Missing Pieces in Python 3 Unicode

12 thoughts
last posted March 3, 2015, 6:21 a.m.
get stream as: markdown or atom

The task of tracking these remaining concerns has migrated to the CPython issue tracker: http://bugs.python.org/issue22555


There's a case to be made that stdin and stdout should default to using "surrogateescape" rather than "strict", as we do for other operating system APIs.


Something that may be useful is an explicit ability to assert a UTF-8 clean environment, and then have the interpreter operate on that basis:

  • 'surrogateescape' treated like 'strict'
  • default IO encoding set to 'utf-8'
  • standard streams all set to 'utf-8'
  • filesystem encoding set to 'utf-8' (also used for os.environ and sys.argv)

At the moment, Python's desire to be tolerant of environmental configuration errors makes it difficult to enforce consistency when you actually want it (treating any deviations as an error in the environment rather than in Python).


There is also this trick, which turns environmental encoding errors into immediate exceptions rather than allowing them to enter your Python application:

import codecs
codecs.register_error('surrogateescape', codecs.strict_errors)

If you have a UTF-8 clean environment, this may not be a bad idea.


Some other niceties we should probably offer:

  • ability to force UTF-8 as the global default encoding for everything (not arbitrary encodings, just UTF-8)
  • a way to switch the encoding and/or error handling of an existing IO stream (http://bugs.python.org/issue15216)
  • ensure any APIs that may return surrogate escaped strings (or improperly latin-1 encoded strings) are clearly marked as doing so

One way to handle this without introducing a new type might be to have an assumed_encoding attribute on strings.

APIs that know they're making unwarranted assumptions about the original binary encoding (including when they introduce surrogate escapes on decoding, or when they apply latin-1 as a blunt instrument) could set this attribute, triggering the following rules:

  • If two strings are combined and have the same assumed encoding, the result also has that assumed encoding
  • If two strings are combined, and one has an assumed encoding while the other does not, the result has that assumed encoding
  • Attempting to combine strings with different assumed encodings is an error
  • Attempting to encode a string with an assumed encoding using a different encoding is an error

This would allow the "decoding dance" above to be standardised, rather than the originally assumed encoding needing to be remembered somewhere else:

def fix_decoding_assumption(sorta_str, encoding):
    if sorta_str.assumed_encoding is None:
        return sorta_str
    if sorta_str.assumed_encoding == encoding == 'latin-1':
        return str(sorta_str, assumed_encoding=None)
    if sorta_str.assumed_encoding == encoding:
            raise UnicodeDecodeError("Has surrogate escapes")
     return sorta_str.encode(sorta_str.assumed_encoding, errors='surrogateescape').decode(encoding, errors='strict')

One problem is that there's no standard way to mark a string which must be considered suspect due to an assumed encoding - it looks just like an ordinary string, and has forgotten its dubious origins.

This probably needs to change - mixing strings that were decoded using latin-1 or surrogateescape when that may not be the actual encoding has all the same problems that mixing 8-bit strings with unknown encodings did in Python 2. While the symptoms are somewhat different, and slightly more likely to result in an exception rather than silent data corruption, any such errors are still likely to be reported far from where the problem was actually introduced.


However, for the status quo, there's still a few pieces missing. For a "sorta decoded" surrogate escaped string, the dance to turn it back into a properly decoded string with no surrogates is like this:

sorta_decoded_str.encode(assumed_encoding, errors="surrogateescape").decode(correct_encoding)

The case where the assumed encoding is latin-1 is just a special case of this one, since the surrogate escape error handler will never fire in that situation (since latin-1 is a direct mapping of bytes values to the first 256 Unicode code points)


I believe the eventual goal should be that Python just defaults to UTF-8 for everything, and you have to jump through hoops to make anything else work.

We may getting closer to that being a reasonable option (Armin tells me that .Net defaults to UTF-8 for everything, even though other Windows tools still don't do that).

3.4 is probably still too soon, but it may be something to consider for 3.5.


For operating system APIs, it would be lovely if we could assume the shiny happy situation where all data is encoded correctly, and all encoding declarations are correct. Unfortunately, it just isn't true, and we took the position (in PEP 383) that it was better to ensure code like the following worked properly even if there was improperly encoded metadata in the filesystem:

all_stats = [os.stat(name) for name in os.listdir(dirname)]

If you're someone that lives on that serialisation boundary between binary data and text data, there are a lot of things about the Python 3 model that really suck. We swung the wrecking ball, but we're still in the process of building the replacements because we don't necessarily know what they should be.

There are three big areas where the distinction is blurry:

  • operating system APIs
  • file contents
  • wire protocols

When you see the complaints from folks like Armin Ronacher and Chris McDonough about the text model in Python 3, they're not wrong for the kind of code they're writing.

As cross-platform web framework developers they're constantly playing in the binary/text interface layer where most Python programmers don't spend their time. Be grateful to them, folks - just as the web framework developers directly inspired most of the default behaviours in the Python 3 text model, they're suffering through the pain of figuring out how to make the new model work for the use cases that we deliberately broke.


When we did the Python 3 migration, we knew we were swinging a wrecking ball through all the current strategies people used to cope with the messy reality of the blurry boundary between binary and text data.

Python 2 assumes you live on that boundary all the time, and will gleefully corrupt data by allowing implicit combination of data from different sources with different encodings.

The core Python 3 model is different: it assumes the shiny happy world where text is text, and binary data is binary data, we use encoding and decoding to get between them, and encoding declarations are never wrong, and data is never corrupted.