How to remove decimal point from string using pandas

Question

I'm reading an xls file and converting to csv file in databricks using pyspark. My input data is of string format 101101114501700 in the xls file. But after converting it to CSV format using pandas and writing to the datalake folder my data is showing as 101101114501700.0. My code is given below. Please help me why am I getting the decimal part in the data.

for file in os.listdir("/path/to/file"):
     if file.endswith(".xls"):
       filepath = os.path.join("/path/to/file",file)         
       filepath_pd = pd.ExcelFile(filepath)
       names = filepath_pd.sheet_names        
       df = pd.concat([filepath_pd.parse(name) for name in names])        
       df1 = df.to_csv("/path/to/file"+file.split('.')[0]+".csv", sep=',', encoding='utf-8', index=False)
       print(time.strftime("%Y%m%d-%H%M%S") + ": XLS files converted to CSV and moved to folder"

gilgorio · Accepted Answer · 2019-03-19 08:54:31Z

2

I think the field is automatically parsed as float when reading the excel. I would correct it afterwards:

df['column_name'] = df['column_name'].astype(int)

If your column contains Nulls you can´t convert to integer so you will need to fill nulls first:

df['column_name'] = df['column_name'].fillna(0).astype(int)

Then you can concatenate and store the way you were doing it

answered Mar 19, 2019 at 8:54

gilgorio

5686 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

pythonUser Over a year ago

Thanks gigorio for your reply.But i need that column to be in string because after converting to csv i'm doing substr for some columns.Can i have alternate solution using read_excel one?

gilgorio Over a year ago

df['column_name'] = df['column_name'].fillna(0).astype(int).astype(str)

Yuan JI · Accepted Answer · 2019-03-19 14:16:02Z

0

Your question has nothing to do with Spark or PySpark. It's related to Pandas.

This is because Pandas interpret and infer columns' data type automatically. Since all the values of your column are numeric, Pandas will consider it as float data type.

To avoid this, pandas.ExcelFile.parse method accepts an argument called converters, you could use this to tell Pandas the specific column data type by:

# if you want one specific column as string
df = pd.concat([filepath_pd.parse(name, converters={'column_name': str}) for name in names])

OR

# if you want all columns as string
# and you have multi sheets and they do not have same columns
# this merge all sheets into one dataframe
def get_converters(excel_file, sheet_name, dt_cols):
    cols = excel_file.parse(sheet_name).columns
    converters = {col: str for col in cols if col not in dt_cols}
    for col in dt_cols:
        converters[col] = pd.to_datetime
    return converters

df = pd.concat([filepath_pd.parse(name, converters=get_converters(filepath_pd, name, ['date_column'])) for name in names]).reset_index(drop=True)

OR

# if you want all columns as string
# and all your sheets have same columns
cols = filepath_pd.parse().columns
dt_cols = ['date_column']
converters = {col: str for col in cols if col not in dt_cols}
for col in dt_cols:
    converters[col] = pd.to_datetime
df = pd.concat([filepath_pd.parse(name, converters=converters) for name in names]).reset_index(drop=True)

edited Mar 19, 2019 at 14:16

answered Mar 19, 2019 at 8:14

Yuan JI

2,9952 gold badges23 silver badges29 bronze badges

15 Comments

pythonUser Over a year ago

Thanks Yuan for your reply. I have tried your solution but apparently dtype is not supported with Python engine. Any other solution you may have please share .Many thanks in advance

pythonUser Over a year ago

Thanks Yuan, for one column it is working fine.If for multiple columns then how to make it as generic?

Yuan JI Over a year ago

I updated the answer, it unions all your sheets into one dataframe and converts all columns of all sheets into string

pythonUser Over a year ago

Thanks Yuan, it works good for number columns but for date(2019-03-19) data it is appending like this 2019-03-19 00:00:00. Can you please let me know why this is causing ? And can we do the same thing using read_excel is it possible with this one

Yuan JI Over a year ago

Your date string problem is because Pandas converts automatically date columns into pandas.Timestamp type, when you apply converters (which apply str() on it), the default string representation __str__() of pandas.Timestamp is of format %Y-%m-%d %H:%M:%S, which is shown in your data.

|

Collectives™ on Stack Overflow

How to remove decimal point from string using pandas

2 Answers 2

2 Comments

15 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

15 Comments

Your Answer

Sign up or log in

Post as a guest

Related