Imported Text and Python with Unicode
Sun, Nov 1, 2015There is no such thing as “plain text” to a computer; all text is stored as bytes, and those bytes get translated into something human-readable in accordance with the encoding applied to it. Below are some brief notes on the Unicode standard, UTF-8 encoding, and the underlying bytecode as they apply to Python 2.
Usually You’re Lucky
For a great introduction and history of Unicode, see Joel Spolsky’s overview. In a nutshell, most characters that English-language developers use are mapped to the same bytecode values in the majority of popular encoding schemes. As a result, text stored in one encoding and manipulated in another often works just fine, especially since many systems these days happen to use the same encoding (UTF-8).
However, there’s no guarantee that text is encoded the same everywhere, and assuming it is can cause hard-to-debug text munging. Anytime you work with text that enters or exits your system, you must be explicit about how that data is encoded into bytes and decoded into text.
Unicode vs. Encodings
Unicode is a standard that specifies a code point for every character you could ever want to use. This means we can safely use Unicode in our system knowing it can handle diverse written languages and symbol sets. So, when dealing with text that comes in and out of your system, you should work in Unicode wherever you can while it’s inside your system.
Code points provide an abstraction layer for
the sake of consistency; on every computer, a Unicode ‘A’ is represented with
the code point U+0041
. Similarly, ‘õ’ is U+00F5
. However, Unicode doesn’t define how the text
is translated into bytes when stored or transmitted. For ingesting text, and
for emitting it, you need to specify an encoding to ensure the text is
properly translated into and out of Unicode.
There are a few Unicode-compliant encodings that each specify their own way to translate a code point into a series of bytes. The de facto standard of these is UTF-8, which is attractive for a few reasons:
- Backwards compatibility with ASCII - Code points
U+0127
and below map consistently to the 7-bit values used in ASCII, meaning Unicode systems can consume ASCII without issue. - Frugality with storage - In contrast to fixed-length encodings (like UTF-16
and UTF-32), UTF-8 stores low code points in a single byte, and consumes
additional bytes for larger code points only when needed.
- Byte-oriented - Unlike UTF-16 and UTF-32, UTF-8 specifies its orientation and
avoids conflicts between mixed-endian systems.
- Widespread use - Many modern systems already use UTF-8. For example: over 85% of sampled
websites use UTF-8
The Unicode Sandwich
In practice, you often have little control over the encoding used in the text your system ingests. What you can control, however, is how well that text is decoded into Unicode objects inside your system. You also have control over how text is saved or transmitted by your system. In between, you can use Unicode representation so the text can be manipulated without having to worry about encoding at all.
This idea is referred to as the Unicode Sandwich:
- Decode ingested text into Unicode early.
- Manipulate the Unicode text without concern for encoding.
- Store or transmit the manipulated text by encoding it just before it leaves
your system (preferably with UTF-8 encoding).
Here’s some sample Python 2 code. It:
- Takes in a raw pangram in
Icelandic, encoded in the mac_iceland encoding.
- Decodes this string into a Unicode object for manipulation.
- Operates some regex on it, using a raw string pattern and Unicode
replacement string.
- Prints and outputs the string using UTF-8 encoding.
Voilà! It’s that simple. Note that Python also has UTF-8 codecs for file writing, which is probably what should be used for anything beyond a basic example.