UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xb6 in position 136: invalid start byte

I got the following error when trying to read a file off a network drive for use by Pandas:

UnicodeDecodeError                        Traceback (most recent call last)
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()

pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 136: invalid start byte

During handling of the above exception, another exception occurred:

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-17-cda0979be675> in <module>
    203     print(f)
    204 # Read in the classified import file:
--> 205 readAndParse("I:\_classUpload\internetWImages.txt")

<ipython-input-17-cda0979be675> in readAndParse(sourceFile)
    165 def readAndParse(sourceFile):
    166 
--> 167     source_pd = pd.read_csv(sourceFile, sep='|', names = ['class_code', 'ad_text', 'images'], header=None)
    168     dest_file = open(CFG.exportFileAndPath, "w")
    169

My code for reading the file is:

source_pd = pd.read_csv(sourceFile, sep='|', names = ['class_code', 'ad_text', 'images'], header=None)

Possible Fix:

I’ve never had an issue reading this type of file with Pandas before, so I was surprised to find out that I needed to specify an encoding to read the file properly. You can usually see what a file is encoded as with this python code:

with open('I:_classUpload\internetWImages.txt') as f:
print(f)

In my case, the result was:

<_io.TextIOWrapper name='I:\\_classUpload\\internetWImages.txt' mode='r' encoding='cp1252'>

With the encoding specified, the fix is simple, just copy and paste the encoding=’cp1252′ into the Pandas read_csv call, like this:

source_pd = pd.read_csv(sourceFile, sep='|', names = ['class_code', 'ad_text', 'images'], header=None, encoding='cp1252')

Update: Additional Possible Fix:

I ran into additional Unicode Decode errors while working on the code mentioned above, and found out that the encoding specified isn’t always the actual encoding used. I did a search and found a post where “mancia14” proposed a method to iterate over encodings and attempt to read the CSV file with each one. This is a good method for debugging that you can use:

Stackoverflow Post

# Code copied from https://stackoverflow.com/questions/25530142/list-of-pandas-read-csv-encoding-list
import pandas as pd

codecs = ['ascii','big5','big5hkscs','cp037','cp273','cp424','cp437','cp500','cp720','cp737','cp775','cp850','cp852','cp855',
          'cp856','cp857','cp858','cp860','cp861','cp862','cp863','cp864','cp865','cp866','cp869','cp874','cp875','cp932','cp949',
          'cp950','cp1006','cp1026','cp1125','cp1140','cp1250','cp1251','cp1252','cp1253','cp1254','cp1255','cp1256','cp1257','cp1258',
          'euc_jp','euc_jis_2004','euc_jisx0213','euc_kr','gb2312','gbk','gb18030','hz','iso2022_jp','iso2022_jp_1','iso2022_jp_2',
          'iso2022_jp_2004','iso2022_jp_3','iso2022_jp_ext','iso2022_kr','latin_1','iso8859_2','iso8859_3','iso8859_4','iso8859_5','iso8859_6',
          'iso8859_7','iso8859_8','iso8859_9','iso8859_10','iso8859_11','iso8859_13','iso8859_14','iso8859_15','iso8859_16','johab','koi8_r','koi8_t',
          'koi8_u','kz1048','mac_cyrillic','mac_greek','mac_iceland','mac_latin2','mac_roman','mac_turkish','ptcp154','shift_jis','shift_jis_2004',
          'shift_jisx0213','utf_32','utf_32_be','utf_32_le','utf_16','utf_16_be','utf_16_le','utf_7','utf_8','utf_8_sig']


for x in range(len(codecs)):
    print(x,': Now checking use of:', codecs[x])
    try:
        df = pd.read_csv('*your_csv_file*.csv', header = 0, encoding = (codecs[x]), sep=';')
        print(df.info())
        print(input('Press any key...'))
    except:
        print('I can\'t load data for', codecs[x], '\n')
        print(input('Press any key...'))

For my purposes, I replaced the “pd.read_csv()” call with a call to my function that reads and processes the CSV file. This way I was able to try out the entire file with the encoding to see if there were errors processing the file after opening it. I found errors with every encoding except ‘latin_1’.

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xb6 in position 136: invalid start byte

Possible Fix:

Update: Additional Possible Fix:

Related

Leave a Reply Cancel reply

Recent Posts

Recent Comments