1

I am new to data science. I want to apply preprocessing to my dataset in Jupyter Notebook. Here is what I have done so far:

import pandas as pd
import numpy as np
from sklearn import preprocessing

country = pd.read_csv('data.csv', encoding='utf_8')

But it gives me this error:

---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
<ipython-input-19-80e6ff7ff11c> in <module>()
----> 1 country = pd.read_csv('data.csv', encoding='utf_8')

/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    707                     skip_blank_lines=skip_blank_lines)
    708 
--> 709         return _read(filepath_or_buffer, kwds)
    710 
    711     parser_f.__name__ = name

/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    453 
    454     try:
--> 455         data = parser.read(nrows)
    456     finally:
    457         parser.close()

/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
   1067                 raise ValueError('skipfooter not supported for iteration')
   1068 
-> 1069         ret = self._engine.read(nrows)
   1070 
   1071         if self.options.get('as_recarray'):

/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
   1837     def read(self, nrows=None):
   1838         try:
-> 1839             data = self._reader.read(nrows)
   1840         except StopIteration:
   1841             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 3 fields in line 5, saw 63

I have also tried some other encodings such as: latin1, iso-8859-1 and more

Link to CSV

3
  • Can u post your sample CSV Commented Apr 8, 2018 at 11:15
  • @RehanAzher just updated. please check Commented Apr 8, 2018 at 11:23
  • As link updated by @Jezrael , your csv have missing data in some rows. awesomescreenshot.com/image/3283728/… Commented Apr 8, 2018 at 11:25

1 Answer 1

1

There is problem need omit first 4 lines by parameter skiprows in read_csv:

df = pd.read_csv('data.csv', skiprows=4)
print (df.head())

  Country Name Country Code     Indicator Name Indicator Code       1960  \
0        Aruba          ABW  Population, total    SP.POP.TOTL    54211.0   
1  Afghanistan          AFG  Population, total    SP.POP.TOTL  8996351.0   
2       Angola          AGO  Population, total    SP.POP.TOTL  5643182.0   
3      Albania          ALB  Population, total    SP.POP.TOTL  1608800.0   
4      Andorra          AND  Population, total    SP.POP.TOTL    13411.0   

        1961       1962       1963       1964       1965     ...       \
0    55438.0    56225.0    56695.0    57032.0    57360.0     ...        
1  9166764.0  9345868.0  9533954.0  9731361.0  9938414.0     ...        
2  5753024.0  5866061.0  5980417.0  6093321.0  6203299.0     ...        
3  1659800.0  1711319.0  1762621.0  1814135.0  1864791.0     ...        
4    14375.0    15370.0    16412.0    17469.0    18549.0     ...        

         2009        2010        2011        2012        2013        2014  \
0    101453.0    101669.0    102053.0    102577.0    103187.0    103795.0   
1  28004331.0  28803167.0  29708599.0  30696958.0  31731688.0  32758020.0   
2  22549547.0  23369131.0  24218565.0  25096150.0  25998340.0  26920466.0   
3   2927519.0   2913021.0   2905195.0   2900401.0   2895092.0   2889104.0   
4     84462.0     84449.0     83751.0     82431.0     80788.0     79223.0   

         2015        2016  2017  Unnamed: 62  
0    104341.0    104822.0   NaN          NaN  
1  33736494.0  34656032.0   NaN          NaN  
2  27859305.0  28813463.0   NaN          NaN  
3   2880703.0   2876101.0   NaN          NaN  
4     78014.0     77281.0   NaN          NaN  

[5 rows x 63 columns]

If want remove all NaNs columns also add dropna:

print (df.dropna(how='all', axis=1).head())
  Country Name Country Code     Indicator Name Indicator Code       1960  \
0        Aruba          ABW  Population, total    SP.POP.TOTL    54211.0   
1  Afghanistan          AFG  Population, total    SP.POP.TOTL  8996351.0   
2       Angola          AGO  Population, total    SP.POP.TOTL  5643182.0   
3      Albania          ALB  Population, total    SP.POP.TOTL  1608800.0   
4      Andorra          AND  Population, total    SP.POP.TOTL    13411.0   

        1961       1962       1963       1964       1965     ...      \
0    55438.0    56225.0    56695.0    57032.0    57360.0     ...       
1  9166764.0  9345868.0  9533954.0  9731361.0  9938414.0     ...       
2  5753024.0  5866061.0  5980417.0  6093321.0  6203299.0     ...       
3  1659800.0  1711319.0  1762621.0  1814135.0  1864791.0     ...       
4    14375.0    15370.0    16412.0    17469.0    18549.0     ...       

         2007        2008        2009        2010        2011        2012  \
0    101220.0    101353.0    101453.0    101669.0    102053.0    102577.0   
1  26616792.0  27294031.0  28004331.0  28803167.0  29708599.0  30696958.0   
2  20997687.0  21759420.0  22549547.0  23369131.0  24218565.0  25096150.0   
3   2970017.0   2947314.0   2927519.0   2913021.0   2905195.0   2900401.0   
4     82683.0     83861.0     84462.0     84449.0     83751.0     82431.0   

         2013        2014        2015        2016  
0    103187.0    103795.0    104341.0    104822.0  
1  31731688.0  32758020.0  33736494.0  34656032.0  
2  25998340.0  26920466.0  27859305.0  28813463.0  
3   2895092.0   2889104.0   2880703.0   2876101.0  
4     80788.0     79223.0     78014.0     77281.0  

[5 rows x 61 columns]
Sign up to request clarification or add additional context in comments.

2 Comments

@AminMemariani - You are welcome! Be free upvote answer too. thanks.
@AminMemariani - Answer was edited for remove all NaNs columns.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.