Shared publicly  - 
 
Demand UTF-8

Each time you encode characters as bytes, you should do two things:
• Use UTF-8.
• Explicitly specify that UTF-8 is your charset.

Conversely, whenever you decode bytes to characters, you should also do two things:
• Confirm that the charset is explicitly set to UTF-8.
• Use UTF-8.

If you're working with data encoded in something other than UTF-8, or data with no charset specified, you should demand UTF-8.

Switching to UTF-8 everywhere is progress. If we all do this, then future generations won't need to learn the horrors, hassles and bugs of managing multiple character encodings.

A brief history of standardization

In 1964, the first IBM computers with 8-bit bytes were introduced. Today 8-bit bytes are universal and nobody maintains code to support 6-bit bytes.

In 1982 the US military standardized on TCP/IP. Today IP networking is universal and nobody maintains code to support IPX/SPX or AppleTalk.

In 1993, UTF-8 was released. We're very near to the day that we can drop support for ISO-8859-1 and many other obsolete character sets.

Why UTF-8

Because all other encodings are inferior:
• ASCII and ISO-8859-1 lack essential characters.
• UTF-16 is overweight, has endianness problems and needs surrogate pairs.
• UTF-32 is obese.

See UTF-8 Everywhere[1] for a more complete comparison, and tactics for upgrading to UTF-8 in your applications.

[1]: http://utf8everywhere.org/
69
49
Henrik Grubbström's profile photoJulien Dodokal's profile photoPeter Turovskij's profile photoMark Macdonald's profile photo
17 comments
 
Also please:
• Demand that the programming languages you use have full-featured UTF-8 string libraries.

Lua is the best example of a language which is completely modern in every way except not having a native UTF-8 processing library. Perl was early on board, but sadly that means it's a little shaky and rough around some of the edges. PHP is good at utf-8 which is almost a shame because of its terrible flaws.

What's the best UTF-8 library for C?
 
While I agree UTF-8 is good (and seems to be the default for on-the-wire transfers), the third reason for not using UTF-16 is not valid.  UTF-8 uses a similar concept to surrogate pairs to encode any value greater then 127.  In fact for documents using many symbols higher up the Unicode specification, it becomes less and less efficient.  Its really more important to use an encoding, specify which, and make sure its Unicode.  At the end of the day UTF-8/UTF-16/UTF-32 all represent Unicode.

Also, UTF-16 can have some significant performance advantages over UTF-8 when dealing with non-ascii text, as it doesn't require more complicated string parsing code.  As in all things in computers, there is a tradeoff.
 
+Matthew Dawson Using real text from languages that use predominantly non-ASCII characters, there are enough latin and control characters present (consider HTML or XML, for example) that UTF8 almost always comes out ahead of UTF16.  Note that UTF16 is only shorter than UTF8 for encoding characters in the range U+0800 to U+FFFF -- everywhere else, UTF8 is either shorter or the same length.  Further, if you care about size you will be compressing it, in which case the results are about the same.

UTF16 is no easier to parse than UTF8 -- you still have to check for multi-word characters in either case.  One simple and and a switch is all you need for either.
 
ever ytime you are using something else as UTF-8 god kills a kitten!
 
You don't reference the first link anywhere in the post.
Rob Ban
+
1
2
1
 
Been heavily advocated UTF8 for years and years. People needs to learn. Organizations too.
 
+Matthew Dawson I don't think Jesse's objection to surrogate pairs is an efficiency consideration; rather, I think it's an objection to cognitive load on the developer and an aesthetic objection.

It's nearly impossible to use UTF-8 without handling the idea that some characters might be longer than one byte. True, there are many applications where everything's ASCII and so I suppose it's perfectly possible to write a program that is using UTF-8-capable libraries underneath in which you manipulate bytes as though they were characters; however, if you do that you likely do not think of your program as outputting UTF-8. And, therefore, if someone asks you "does your program do unicode", you'll say "no".

With UTF-16, however, because the BMP covers such a large portion of practical use cases outside China (and seems, to someone unfamiliar with the issues, to cover Chinese as well) it's perfectly possible to write programs that don't handle surrogate pairs correctly at all while believing that your program handles UTF-16. (Witness the people who will sloppily claim that java's char type is "UTF-16" - no, it isn't. If anything, it's UCS-2, which is a format that was obsoleted after Unicode 1.0)

There's also something elegant about the way the UTF-8's bytes telescope out for different character code ranges.

The net effect here is that with UTF-8, multibyte characters are frequent enough to be a normal case and just part of the standard logic, whereas with UTF-16 characters formed by surrogate pairs become a special case. Special cases are a breeding ground for bugs that are found only late in the development cycle.
 
Broadly speaking, you can divide programs that output text by their output character set into:

1) US-ASCII
2) UTF-8
3) wrong
 
Let me guess, you guys never worked with Chinese developers using GB2312... And yes I advocate for UTF-8 everywhere but that's a tough  sell for Chinese people, especially w.r.t. to the official regulation.
 
+Konstantin Pribluda (4) is a subset of (3). "Legacy" is a particular type of "wrong" in this categorization. (See also: Russian legacy stuff using koi-8, Japanese legacy stuff using iso-2022, EBCDIC systems, etc.)
 
+Marc-Antoine Ruel This was talking abort UTF16 vs UTF8 - certainly stateful encodings like ShiftJIS, Big5, etc can be more efficient than UTF8. However those have their own issues.
 
Jesse was talking about "characters" vs "bytes".

I'm just saying you can't just enforce a specific encoding down everyone's throat in all circumstances.
 
I agree that +Marc-Antoine Ruel's argument is the one to contend with: for Asian languages, you can be more efficient than the Unicode world.

At the risk of making a self-centered claim, I still suspect it's worth going UTF8 or bust. Note that in any context where a standard encoding matters, we are talking about some form of interchange--either between components, or between different users and organizations. In such a case, the value of a hookup Just Working is enormous. It helps all the engineers sleep better at night. It reduces the lost time due to fire drills that can occur out of nowhere.

As a motivating case study, I was at LogicBlox when they made the switch to UTF8 or bust. As an aside to the GWT team, the positive experience with this approach on GWT was part of the motivation for LogicBlox to try it. It worked out quite well. Clients really didn't care; they just adjusted their software to also use UTF8. Moreover, nobody involved cared about the byte efficiency of cross-company data exchanges. Heck, most of that data exchange uses text-encoded CSV files, uncompressed.
 
+Lex Spoon Plus you avoid all the stupid crap like defining the character set in a file, using characters for which you don't know the encoding (like XML or HTML).
 
Convincing some that filenames originating on a Windows system need not be in CP-1252 can be quite something else...
 
+Drew Northup , I never even thought about that issue. Yuck; Linux has used UTF-8 for quite awhile.

In fairness, I tend to think we'd be better off if filenames had to be ASCII and to not have spaces in them. Many hours have been lost trying to get a Windows batch file to work in the face of spaces. No, I'm not bitter about "Program Files".
Add a comment...