Missing Pieces in Python 3 Unicode

12 thoughts
last posted March 3, 2015, 6:21 a.m.

6 earlier thoughts

0

One way to handle this without introducing a new type might be to have an assumed_encoding attribute on strings.

APIs that know they're making unwarranted assumptions about the original binary encoding (including when they introduce surrogate escapes on decoding, or when they apply latin-1 as a blunt instrument) could set this attribute, triggering the following rules:

  • If two strings are combined and have the same assumed encoding, the result also has that assumed encoding
  • If two strings are combined, and one has an assumed encoding while the other does not, the result has that assumed encoding
  • Attempting to combine strings with different assumed encodings is an error
  • Attempting to encode a string with an assumed encoding using a different encoding is an error

This would allow the "decoding dance" above to be standardised, rather than the originally assumed encoding needing to be remembered somewhere else:

def fix_decoding_assumption(sorta_str, encoding):
    if sorta_str.assumed_encoding is None:
        return sorta_str
    if sorta_str.assumed_encoding == encoding == 'latin-1':
        return str(sorta_str, assumed_encoding=None)
    if sorta_str.assumed_encoding == encoding:
            raise UnicodeDecodeError("Has surrogate escapes")
     return sorta_str.encode(sorta_str.assumed_encoding, errors='surrogateescape').decode(encoding, errors='strict')

5 later thoughts