Imported Text and Python with Unicode

Sun, Nov 1, 2015

There is no such thing as “plain text” to a computer; all text is stored as bytes, and those bytes get translated into something human-readable in accordance with the encoding applied to it. Below are some brief notes on the Unicode standard, UTF-8 encoding, and the underlying bytecode as they apply to Python 2.

Usually You’re Lucky

For a great introduction and history of Unicode, see Joel Spolsky’s overview. In a nutshell, most characters that English-language developers use are mapped to the same bytecode values in the majority of popular encoding schemes. As a result, text stored in one encoding and manipulated in another often works just fine, especially since many systems these days happen to use the same encoding (UTF-8).

However, there’s no guarantee that text is encoded the same everywhere, and assuming it is can cause hard-to-debug text munging. Anytime you work with text that enters or exits your system, you must be explicit about how that data is encoded into bytes and decoded into text.

Unicode vs. Encodings

Unicode is a standard that specifies a code point for every character you could ever want to use. This means we can safely use Unicode in our system knowing it can handle diverse written languages and symbol sets. So, when dealing with text that comes in and out of your system, you should work in Unicode wherever you can while it’s inside your system.

Code points provide an abstraction layer for the sake of consistency; on every computer, a Unicode ‘A’ is represented with the code point U+0041. Similarly, ‘õ’ is U+00F5. However, Unicode doesn’t define how the text is translated into bytes when stored or transmitted. For ingesting text, and for emitting it, you need to specify an encoding to ensure the text is properly translated into and out of Unicode.

There are a few Unicode-compliant encodings that each specify their own way to translate a code point into a series of bytes. The de facto standard of these is UTF-8, which is attractive for a few reasons:

Backwards compatibility with ASCII - Code points U+0127 and below map consistently to the 7-bit values used in ASCII, meaning Unicode systems can consume ASCII without issue.
Frugality with storage - In contrast to fixed-length encodings (like UTF-16 and UTF-32), UTF-8 stores low code points in a single byte, and consumes additional bytes for larger code points only when needed.
Byte-oriented - Unlike UTF-16 and UTF-32, UTF-8 specifies its orientation and avoids conflicts between mixed-endian systems.
Widespread use - Many modern systems already use UTF-8. For example: over 85% of sampled websites use UTF-8

The Unicode Sandwich

In practice, you often have little control over the encoding used in the text your system ingests. What you can control, however, is how well that text is decoded into Unicode objects inside your system. You also have control over how text is saved or transmitted by your system. In between, you can use Unicode representation so the text can be manipulated without having to worry about encoding at all.

This idea is referred to as the Unicode Sandwich:

Decode ingested text into Unicode early.
Manipulate the Unicode text without concern for encoding.
Store or transmit the manipulated text by encoding it just before it leaves your system (preferably with UTF-8 encoding).

Here’s some sample Python 2 code. It:

Takes in a raw pangram in Icelandic, encoded in the mac_iceland encoding.
Decodes this string into a Unicode object for manipulation.
Operates some regex on it, using a raw string pattern and Unicode replacement string.
Prints and outputs the string using UTF-8 encoding.

Voilà! It’s that simple. Note that Python also has UTF-8 codecs for file writing, which is probably what should be used for anything beyond a basic example.

bunn

Imported Text and Python with Unicode

Usually You’re Lucky

Unicode vs. Encodings

The Unicode Sandwich