Banjo

by dobesv

52 thoughts
last posted Nov. 9, 2015, 7:13 p.m.

Cross-platform strings with efficient ASCII-friendly APIs

I've been mulling over this problem that sometimes the host language might provide UTF-16 strings (Java or Javascript) or UTF-8 strings (C) but you'd sometimes want to write code that runs efficiently in both environments.

I think that since ASCII is a subset of both and many (most?) operations are only interested in the ASCII characters of the string, providing an API that handles ASCII specially could be implemented efficiently in both languages. For example:

mapASCII could leave non-ASCII characters untouched, and pass the ASCII characters to the function
foldASCII could skip non-ASCII characters and only provide the ASCII ones to the function to fold over
split() on a UTF-8 string could check for an ASCII character and split on bytes instead of code points
replace() on a UTF-8 string that receives a UTF-8 string can do a byte-wise replace

Code that does support or care about unicode will inevitably want APIs that operate on code points, which requires (inefficient) translation regardless of whether the underlying string is UTF-8 or UTF-16.

I think the key thing here is to provide operations for as many of the things people want to do while avoiding having them think about or depend on the underlying encoding. Operations like length() and indexOf() are troublesome because they are ambiguous about whether code points or they correspond to (8 or 16 bit) code units or (32 bit) code points. So providing useful operations that mostly eliminate the need to ask about the length is good. Like, we never want people doing a loop using a length and index...

created April 21, 2014, 4:59 a.m.

21 later thoughts

Banjo

by dobesv

Cross-platform strings with efficient ASCII-friendly APIs

Keyboard Help