One way to handle this without introducing a new type might be to have an assumed_encoding
attribute on strings.
APIs that know they're making unwarranted assumptions about the original binary encoding (including when they introduce surrogate escapes on decoding, or when they apply latin-1
as a blunt instrument) could set this attribute, triggering the following rules:
This would allow the "decoding dance" above to be standardised, rather than the originally assumed encoding needing to be remembered somewhere else:
def fix_decoding_assumption(sorta_str, encoding):
if sorta_str.assumed_encoding is None:
return sorta_str
if sorta_str.assumed_encoding == encoding == 'latin-1':
return str(sorta_str, assumed_encoding=None)
if sorta_str.assumed_encoding == encoding:
raise UnicodeDecodeError("Has surrogate escapes")
return sorta_str.encode(sorta_str.assumed_encoding, errors='surrogateescape').decode(encoding, errors='strict')