Handling Problems with Raw Text in Python - dummies

Handling Problems with Raw Text in Python

By Nikhil Abraham

special formatting, you do have to consider how the text is stored and whether it contains special words within it. The multiple forms of encoding present on web pages can present interpretation problems that you need to consider as you work through the text.

For example, the way the text is encoded can differ because of different operating systems, languages, and geographical areas. Be prepared to find a host of different encodings as you recover data from the web. Human language is complex, and the original ASCII coding, comprising just unaccented English letters, can’t represent all the different alphabets. That’s why so many encodings appeared with special characters.

For example, a character can use either seven or eight bits for encoding purposes. The use of special characters can differ as well. In short, the interpretation of bits used to create characters differs from encoding to encoding.

Sometimes you need to work with encodings other than the default encoding set within the Python environment. When working with Python 3.x, you must rely on Universal Transformation Format 8-bit (UTF-8) as the encoding used to read and write files. This environment is always set for UTF-8, and trying to change it causes an error message. However, when working with Python 2.x, you can choose other encodings. In this case, the default encoding is the American Standard Code for Information Interchange (ASCII), but you can change it to some other encoding.

You can use this technique in any Python script. It can save your day when your code won’t work because of errors when Python can’t encode a character. However, working at the IPython prompt is actually easier in this case. The following steps help you see how to deal with Unicode characters, but only when working with Python 2.x. (These steps are unnecessary and cause errors in the Python 3.x environment.)

  1. Open a copy of the IPython command prompt.

    You see the IPython window.

  2. Type the following code, pressing Enter after each line.

    import sys

    sys.getdefaultencoding()

    You see the default encoding for Python, which is ascii in Python 2.x (in Python 3.x, it’s utf-8 instead). If you really do want to work with Jupyter Notebook, create a new cell after this step.

  3. Type reload(sys) and press Enter.
    Python reloads the sys module and makes a special function available.
  4. Type sys.setdefaultencoding(‘utf-8’) and press Enter.
    Python does change the encoding, but you won’t know that for certain until after the next step. If you really do want to work with Jupyter Notebook, create a new cell after this step.
  5. Type sys.getdefaultencoding() and press Enter.
    You see that the default encoding has now changed to utf-8.

Changing the default encoding at the wrong time and in the incorrect way can prevent you from performing tasks such as importing modules. Make sure to test your code carefully and completely to ensure that any change in the default encoding won’t affect your ability to run the application.

Good additional articles to read on this topic appear at blog.notdot.net and webarchive.org.