13

In UTF-8, two-byte codes are used to encode the code points from U+0080 to U+07FF (2ยนยน - 2โท code points) instead of U+0080 to U+087F (2ยนยน code points). I fail to see why the range of valid two-byte codes starts at C2 80 (11100010 10000000) instead of at C0 80 (11100000 10000000).

Why would starting two-byte codes at C0 80 break the self-synchronization guarantees of UTF-8? If a reader started reading in the middle of a multi-byte sequence at a continuation byte (10xxxxxx), it would still be able to uniquely identify (and move to) the next single (0xxxxxxx) or start (11xxxxxx) byte, wouldn't it? (Also, would starting the range of valid two-byte codes at C0 80 eliminate the potential problems associated with overlong encodings - as, for instance C0 80 would encode U+0080 instead of being an invalid encoding?)

How does excluding just two particular byte values (C0 and C1) as obviously invalid increase error tolerance in any significant way? (After all, a single-bit error might just as well transform one valid code into another one?)

New contributor
user31387391 is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
3
  • 1
    owlfolio.org/development/corrected-utf-8 Commented yesterday
  • Nothing to do with self-synchronisation. But it means that a decoder needs to only shift bits around, not do any arithmetic, to arrive at the code point. Commented yesterday
  • I would like to express my heartfelt thank you to everyone who took the time to answer my - rather esoteric - question. Your kindness and generosity is very much appreciated! Commented 42 mins ago

1 Answer 1

22

Why would starting two-byte codes at C0 80 break the self-synchronization guarantees of UTF-8?

I think you're rushing to assumptions. The documented reason wasn't that it would break the self-synchronization guarantees, and wasn't to increase error tolerance in data โ€“ it seems to have been purely to reduce complexity of the encoding/decoding implementation.

Answer heavily edited as I most likely have misread the referenced text and misattributed the note.

The original X/Open FSS-UTF proposal (starting at "/usr/ken/utf/xutf from dump") did offset values, but the "modified" Ken Thompson proposal that actually became UTF-8 no longer did so. At the bottom of the file, Ken's copy of the FSS-UTF proposal has his improved variant appended (starting at "We define 7 byte types"), with apparently his footnote:

  1. The 2 byte sequence has 2ยนยน codes, yet only 2ยนยน โ€“ 2โท are allowed. The codes in the range 0-7f are illegal. I think this is preferable to a pile of magic additive constants for no real benefit. Similar comment applies to all of the longer sequences.

In other words, even though FSS-UTF was a major improvement over the original UTF (now known as UTF-1), it was still thought to be unnecessarily complex (the Plan 9 team was certainly one to avoid adding complexity).

A later Google+ post by Rob Pike also confirms this, stating that the originally proposed "FSS/UTF was more intricate than we liked".

It's been well documented elsewhere (http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt) that one Wednesday night, after a phone call from X/Open, Ken Thompson and I were sitting in a New Jersey diner talking about how best to represent Unicode as a byte stream. Given the experience we had accumulated dealing with the original UTF, which had many problems, we knew what we wanted. X/Open had offered us a deal: implement something better than their proposal, called FSS/UTF (File System Safe UTF; the name tells you something on its own), and do so before Monday. In return, they'd push it as existing practice.

UTF was awful. It had modulo-192 arithmetic, if I remember correctly, and was all but impossible to implement efficiently on old SPARCs with no divide hardware. Strings like "/*" could appear in the middle of a Cyrillic character, making your Russian text start a C comment. And more. It simply wasn't practical as an encoding: think what happens to that slash byte inside a Unix file name.

FSS/UTF addressed that problem, which was great. Big improvement though it was, however, FSS/UTF was more intricate than we liked and lacked one property we insisted on: If a byte is corrupted, it should be possible to re-synch the encoded stream without losing more than one character. When we claimed we wanted that property, and sensed we could press for a chance to design something right, X/Open gave us the green light to try.

5
  • 3
    Also to note: UTF-8 was not designed to be the most efficient encoding (but it could recover the next character, in case of errors). So there is no need to add complexities. OTOH now we must check overlong sequences and handle as error (for security reason). But also we can encode 00 using two bytes, so allowing Unicode strings which are not permitted as "C-language-string" (and original UTF-8 allowed 31 bit encoding, so not problem on not getting enough codes) Commented yesterday
  • 1
    You have plenty of invalid byte sequences anyway. For example the high surrogates and low surrogates from D800 to DFFF. And a few explicitly invalid code points. Commented yesterday
  • @GiacomoCatenazzi, you can't have a two-byte NUL in standard UTF-8, as that would be an overlong encoding. Commented yesterday
  • @ilkkachu: I mean: NUL encoded as two bytes, as " 0xC0, 0x80". Usually used only internally (because we should avoid to expose overlong UTF-8). Java seems to use it but it call it MUTF-8 (Modified UTF-8). Also used where NUL is allowed and using C-strings is easier. Yes: it is not standard UTF-8, but a nice feature of UTF-8 compared UTF-1 (more space efficient, more in line with the question, but with some disadvantages). Commented 20 hours ago
  • 3
    It's not "we should avoid overlong UTF-8". 0c080 IS NOT UTF-8. You NEVER, ever, ever produce it. If you receive it, you either treat the whole UTF-8 string as invalid, or as an alternative you remove the first byte 0xc0 and start all over again. And all external UTF-8 input goes through that filter. Commented 17 hours ago

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.