Why would starting two-byte codes at C0 80 break the self-synchronization guarantees of UTF-8?
I think you're rushing to assumptions. The documented reason wasn't that it would break the self-synchronization guarantees, and wasn't to increase error tolerance in data โ it seems to have been purely to reduce complexity of the encoding/decoding implementation.
Answer heavily edited as I most likely have misread the referenced text and misattributed the note.
The original X/Open FSS-UTF proposal (starting at "/usr/ken/utf/xutf from dump
") did offset values, but the "modified" Ken Thompson proposal that actually became UTF-8 no longer did so. At the bottom of the file, Ken's copy of the FSS-UTF proposal has his improved variant appended (starting at "We define 7 byte types
"), with apparently his footnote:
- The 2 byte sequence has 2ยนยน codes, yet only 2ยนยน โ 2โท are allowed. The codes in the range 0-7f are illegal. I think this is preferable to
a pile of magic additive constants for no real benefit. Similar
comment applies to all of the longer sequences.
In other words, even though FSS-UTF was a major improvement over the original UTF (now known as UTF-1), it was still thought to be unnecessarily complex (the Plan 9 team was certainly one to avoid adding complexity).
A later Google+ post by Rob Pike also confirms this, stating that the originally proposed "FSS/UTF was more intricate than we liked".
It's been well documented elsewhere (http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt) that one Wednesday night, after a phone call from X/Open, Ken Thompson and I were sitting in a New Jersey diner talking about how best to represent Unicode as a byte stream. Given the experience we had accumulated dealing with the original UTF, which had many problems, we knew what we wanted. X/Open had offered us a deal: implement something better than their proposal, called FSS/UTF (File System Safe UTF; the name tells you something on its own), and do so before Monday. In return, they'd push it as existing practice.
UTF was awful. It had modulo-192 arithmetic, if I remember correctly, and was all but impossible to implement efficiently on old SPARCs with no divide hardware. Strings like "/*" could appear in the middle of a Cyrillic character, making your Russian text start a C comment. And more. It simply wasn't practical as an encoding: think what happens to that slash byte inside a Unix file name.
FSS/UTF addressed that problem, which was great. Big improvement though it was, however, FSS/UTF was more intricate than we liked and lacked one property we insisted on: If a byte is corrupted, it should be possible to re-synch the encoded stream without losing more than one character. When we claimed we wanted that property, and sensed we could press for a chance to design something right, X/Open gave us the green light to try.