UTF-8 conversion issues on legacy Windows

Short post: On Windows, UTF-16 was the dominant locale, and UTF-8 was something only to convert to and from. (Microsoft jumped the gun before Unicode expanded the address space.) While it got better (Windows 10 can use UTF-8 as an MBCS locale with ANSI APIs), it was historically a lot worse.

For converting, you’d use the MultiByteToWideChar and its opposite WideCharToMultiByte. On legacy Windows, they have slightly confusing semantics. Specifically, with flags. While Vista on introduced many flags that can be used with the UTF-8 codepage (to deal with the quirks of conversion, like invalid characters), previously only MB_ERR_INVALID_CHARS was allowed, and only if you were running XP or 2000 SP4. Before that, you can’t have any flags if you’re converting to or from UTF-16 and UTF-8. It’s unfortunately a little dangerous, but that’s the rub.

It never had to be like this: the git “index”

Hot on the heels of another Git-related article that was making the rounds recently, I was reminded of Git’s own structure and how it influences user experience. Specifically what we assume is part of how Git works, is actually a part of the porcelain (in Git speak, the user interface and commands that back it). As someone developing a Git client, it’s interesting to think Git’s user experience could be significantly different with a different interface, particularly because people have a particular mental model of Git influenced by the default interface. Said influence is enough that libgit2’s API emulates the porcelain’s semantics, in-process.

Continue reading