Missing Pieces in Python 3 Unicode

by ncoghlan_dev

12 thoughts
last posted March 3, 2015, 6:21 a.m.

One way to handle this without introducing a new type might be to have an assumed_encoding attribute on strings.

APIs that know they're making unwarranted assumptions about the original binary encoding (including when they introduce surrogate escapes on decoding, or when they apply latin-1 as a blunt instrument) could set this attribute, triggering the following rules:

If two strings are combined and have the same assumed encoding, the result also has that assumed encoding
If two strings are combined, and one has an assumed encoding while the other does not, the result has that assumed encoding
Attempting to combine strings with different assumed encodings is an error
Attempting to encode a string with an assumed encoding using a different encoding is an error

This would allow the "decoding dance" above to be standardised, rather than the originally assumed encoding needing to be remembered somewhere else:

def fix_decoding_assumption(sorta_str, encoding):
    if sorta_str.assumed_encoding is None:
        return sorta_str
    if sorta_str.assumed_encoding == encoding == 'latin-1':
        return str(sorta_str, assumed_encoding=None)
    if sorta_str.assumed_encoding == encoding:
            raise UnicodeDecodeError("Has surrogate escapes")
     return sorta_str.encode(sorta_str.assumed_encoding, errors='surrogateescape').decode(encoding, errors='strict')

created July 3, 2013, 2:02 a.m.

5 later thoughts

Missing Pieces in Python 3 Unicode

by ncoghlan_dev

Keyboard Help