Hacker School Journal

14 thoughts
last posted April 8, 2014, 9:49 p.m.

4 earlier thoughts

0

For my own sense of propriety and politesse I decided to programmatically assign the unicode control chars from the Private Use Area, which is designated to be unused by official encodings. I originally thought it might be nice to use Plane 15, the supplemental Private Use Area-A, but that would involve the use of double-wide surrogate pair characters and complicate the issue. Later concerns ended up persuading me to move to Python 3 (where everything is a unicode string), which might have mooted the problem, but in any case it seems saner and more extensible to restrict myself to the standard private use area, which still presents me with 6400 codepoints to make use of.

In any case, I should implement something pretty soon that sanitizes the source data before the abbreviations are applied, escaping any characters in that range that are already in the document.

9 later thoughts