Missing Pieces in Python 3 Unicode

12 thoughts
last posted March 3, 2015, 6:21 a.m.

5 earlier thoughts


One problem is that there's no standard way to mark a string which must be considered suspect due to an assumed encoding - it looks just like an ordinary string, and has forgotten its dubious origins.

This probably needs to change - mixing strings that were decoded using latin-1 or surrogateescape when that may not be the actual encoding has all the same problems that mixing 8-bit strings with unknown encodings did in Python 2. While the symptoms are somewhat different, and slightly more likely to result in an exception rather than silent data corruption, any such errors are still likely to be reported far from where the problem was actually introduced.

6 later thoughts