How to Read Large CSV Files with Pandas Without Knowing its Column Data Types
TLDR: read a sample of the data, infer the data types, and read the full data with the inferred data types.
Context
Reading large (and sometimes not so large) CSV files in Pandas is a common problem. It is not uncommon to find CSV files with millions of rows and hundreds of columns, and sometimes the data types of the columns are unknown.
Steps
1. Read a sample of the data
Use nrows
parameter
The first step is to read a sample of the data. This is done with the nrows
parameter of the read_csv
function.
This helps in understanding the data structure without loading the entire file into memory.
import pandas as pd df = pd.read_csv('data.csv', nrows=1000)
The nrows
parameter can be set to any number, but it is recommended to use a number that is large enough to get an
idea of the data structure, but small enough to not use too much memory.
Use usecols
parameter
You can also use less memory by reading only a subset of the columns with the usecols
parameter, if you already
know which columns you need to use.
import pandas as pd df = pd.read_csv('data.csv', nrows=1000, usecols=['col1', 'col2'])
Use nrows
and usecols
parameter together
You can use both parameters together to read a sample of the data and only a subset of the columns.
import pandas as pd df = pd.read_csv('data.csv', nrows=1000, usecols=['col1', 'col2'])
2. Inspect Data Types
Use dtypes
attribute
The next step is to inspect the data types of the columns. This is done with the dtypes
attribute of the dataframe.
print(df.dtypes)
3. Define data types manually
Based on the analysis of the data sample, you can define the data types of the columns manually.
from datetime import datetime, timedelta dtypes_dict = { 'col_str': str, # String 'col_int': int, # Integer 'col_float': float, # Float 'col_bool': bool, # Boolean 'col_datetime': datetime, # Datetime 'col_timedelta': timedelta, # Timedelta 'col_category': 'category', # Category 'col_complex': complex, # Complex number 'col_bytes': bytes, # Bytes 'col_object': object # Generic Python object }
Here are the Pandas dtypes for reference.
Some of the not-so-common data types are:
-
category
: A pandasCategorical
type. Useful for columns with a limited number of unique values. (e.g.gender
,country
,city
, etc.). Check Categorical data for more information. -
complex
: A complex number. Check Complex Numbers for more information. -
bytes
: A sequence of bytes. Check Bytes and Bytearray for more information. -
object
: A generic Python object. Check Object for more information. -
datetime
: A datetime object is a single object containing all the information from a date object and a time object. Check datetime for more information. -
timedelta
: A timedelta object represents a duration, the difference between two dates or times. Check timedelta for more information.
4. Read the full data with the manually-defined data types
The next step is to load the data with the manually-defined data types from the earlier step.
For this, you can use the dtype
parameter of the read_csv
function.
import pandas as pd df = pd.read_csv('data.csv', dtype=dtypes_dict)
Important
If you get a ValueError
, it means that the data types you defined are not compatible with all the data in the file,
and
you should check the data types again by using more rows or other chunks of the data. For that you may also use skiprows
parameter.
import pandas as pd df = pd.read_csv('data.csv', skiprows=1000, nrows=1000)
5. Use chunksize
parameter if the data is too large to fit in memory
If using the options above is not enough to load the data into memory, you can use the chunksize
parameter, which returns an iterator that can be used to read the data.
import pandas as pd df_iterator = pd.read_csv('data.csv', chunksize=1000) for df in df_iterator: # do something with df
Comments
Comments powered by Disqus