I got the following error when trying to read a file off a network drive for use by Pandas:
UnicodeDecodeError Traceback (most recent call last) pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert() pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8() UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 136: invalid start byte During handling of the above exception, another exception occurred: UnicodeDecodeError Traceback (most recent call last) <ipython-input-17-cda0979be675> in <module> 203 print(f) 204 # Read in the classified import file: --> 205 readAndParse("I:\_classUpload\internetWImages.txt") <ipython-input-17-cda0979be675> in readAndParse(sourceFile) 165 def readAndParse(sourceFile): 166 --> 167 source_pd = pd.read_csv(sourceFile, sep='|', names = ['class_code', 'ad_text', 'images'], header=None) 168 dest_file = open(CFG.exportFileAndPath, "w") 169
My code for reading the file is:
source_pd = pd.read_csv(sourceFile, sep='|', names = ['class_code', 'ad_text', 'images'], header=None)
Possible Fix:
I’ve never had an issue reading this type of file with Pandas before, so I was surprised to find out that I needed to specify an encoding to read the file properly. You can usually see what a file is encoded as with this python code:
with open('I:_classUpload\internetWImages.txt') as f:
print(f)
In my case, the result was:
<_io.TextIOWrapper name='I:\\_classUpload\\internetWImages.txt' mode='r' encoding='cp1252'>
With the encoding specified, the fix is simple, just copy and paste the encoding=’cp1252′ into the Pandas read_csv call, like this:
source_pd = pd.read_csv(sourceFile, sep='|', names = ['class_code', 'ad_text', 'images'], header=None, encoding='cp1252')
Update: Additional Possible Fix:
I ran into additional Unicode Decode errors while working on the code mentioned above, and found out that the encoding specified isn’t always the actual encoding used. I did a search and found a post where “mancia14” proposed a method to iterate over encodings and attempt to read the CSV file with each one. This is a good method for debugging that you can use:
# Code copied from https://stackoverflow.com/questions/25530142/list-of-pandas-read-csv-encoding-list import pandas as pd codecs = ['ascii','big5','big5hkscs','cp037','cp273','cp424','cp437','cp500','cp720','cp737','cp775','cp850','cp852','cp855', 'cp856','cp857','cp858','cp860','cp861','cp862','cp863','cp864','cp865','cp866','cp869','cp874','cp875','cp932','cp949', 'cp950','cp1006','cp1026','cp1125','cp1140','cp1250','cp1251','cp1252','cp1253','cp1254','cp1255','cp1256','cp1257','cp1258', 'euc_jp','euc_jis_2004','euc_jisx0213','euc_kr','gb2312','gbk','gb18030','hz','iso2022_jp','iso2022_jp_1','iso2022_jp_2', 'iso2022_jp_2004','iso2022_jp_3','iso2022_jp_ext','iso2022_kr','latin_1','iso8859_2','iso8859_3','iso8859_4','iso8859_5','iso8859_6', 'iso8859_7','iso8859_8','iso8859_9','iso8859_10','iso8859_11','iso8859_13','iso8859_14','iso8859_15','iso8859_16','johab','koi8_r','koi8_t', 'koi8_u','kz1048','mac_cyrillic','mac_greek','mac_iceland','mac_latin2','mac_roman','mac_turkish','ptcp154','shift_jis','shift_jis_2004', 'shift_jisx0213','utf_32','utf_32_be','utf_32_le','utf_16','utf_16_be','utf_16_le','utf_7','utf_8','utf_8_sig'] for x in range(len(codecs)): print(x,': Now checking use of:', codecs[x]) try: df = pd.read_csv('*your_csv_file*.csv', header = 0, encoding = (codecs[x]), sep=';') print(df.info()) print(input('Press any key...')) except: print('I can\'t load data for', codecs[x], '\n') print(input('Press any key...'))
For my purposes, I replaced the “pd.read_csv()” call with a call to my function that reads and processes the CSV file. This way I was able to try out the entire file with the encoding to see if there were errors processing the file after opening it. I found errors with every encoding except ‘latin_1’.