UTF-8 conversion issues on legacy Windows

Short post: On Windows, UTF-16 was the dominant locale, and UTF-8 was something only to convert to and from. (Microsoft jumped the gun before Unicode expanded the address space.) While it got better (Windows 10 can use UTF-8 as an MBCS locale with ANSI APIs), it was historically a lot worse.

For converting, you’d use the MultiByteToWideChar and its opposite WideCharToMultiByte. On legacy Windows, they have slightly confusing semantics. Specifically, with flags. While Vista on introduced many flags that can be used with the UTF-8 codepage (to deal with the quirks of conversion, like invalid characters), previously only MB_ERR_INVALID_CHARS was allowed, and only if you were running XP or 2000 SP4. Before that, you can’t have any flags if you’re converting to or from UTF-16 and UTF-8. It’s unfortunately a little dangerous, but that’s the rub.

Files with accents not showing from an SMB share on macOS

I recently had an issue where some files with accents were showing, but not all of them. If none of them were showing, I might have assumed an encoding issue, but it’s clear something else was at play here. This was pretty annoying when I wanted to play a specific song.

As it turns out, on macOS, it’s almost certainly a Unicode normalization issue, where Apple is (unnecessarily) strict about Unicode. I seem to keep running into these issues on macOS – I first into it trying to notarize a zip file. Luckily, there’s a tool that can handle this for you.

Continue reading