Data Science: Dealing with Unicode in Python

By John Paul Mueller, Luca Massaron

Text files are pure text — this much is certain to data scientists using Python. The way the text is encoded can differ. For example, a character can use either seven or eight bits for encoding purposes. The use of special characters can differ as well. In short, the interpretation of bits used to create characters differs from encoding to encoding. Check here for a host of encodings.

Sometimes you need to work with encodings other than the default encoding set within the Python environment. When working with Python 3.x, you must rely on Universal Transformation Format 8-bit (UTF-8) as the encoding used to read and write files. This environment is always set for UTF-8, and trying to change it causes an error message.

However, when working with Python 2.x, you can choose other encodings. In this case, the default encoding is the American Standard Code for Information Interchange (ASCII), but you can change it to some other encoding.

You can use this technique in any IPython Notebook file, but you won’t actually see output from it. In order to see output, you need to work with the IPython prompt. The following steps help you see how to deal with Unicode characters, but only when working with Python 2.x (these steps will cause errors in the Python 3.x environment).

  1. Open a copy of the IPython command prompt.

    You see the IPython window.

  2. Type the following code, pressing Enter after each line.

    import sys
    sys.getdefaultencoding()

    You see the default encoding for Python, which is ascii in most cases.

  3. Type reload(sys) and press Enter.

    Python reloads the sys module and makes a special function available.

  4. Type sys.setdefaultencoding(‘utf-8’) and press Enter.

    Python does change the encoding, but you won’t know that for certain until after the next step.

  5. Type sys.getdefaultencoding( ) and press Enter.

    You see that the default encoding has now changed to utf-8.

Changing the default encoding at the wrong time and in the incorrect way can prevent you from performing tasks such as importing modules. Make sure to test your code carefully and completely to ensure that any change in the default encoding won’t affect your ability to run the application. Good additional articles to read on this topic appear at blog.notdot.net and web.archive.org.