How to Read Large CSV Files with Pandas Without Knowing its Column Data Types

TLDR: read a sample of the data, infer the data types, and read the full data with the inferred data types.

Context

Reading large (and sometimes not so large) CSV files in Pandas is a common problem. It is not uncommon to find CSV files with millions of rows and hundreds of columns, and sometimes the data types of the columns are unknown.

Steps

1. Read a sample of the data

Use nrows parameter

The first step is to read a sample of the data. This is done with the nrows parameter of the read_csv function.

This helps in understanding the data structure without loading the entire file into memory.

import pandas as pd

df = pd.read_csv('data.csv', nrows=1000)

The nrows parameter can be set to any number, but it is recommended to use a number that is large enough to get an idea of the data structure, but small enough to not use too much memory.

Use usecols parameter

You can also use less memory by reading only a subset of the columns with the usecols parameter, if you already know which columns you need to use.

import pandas as pd

df = pd.read_csv('data.csv', nrows=1000, usecols=['col1', 'col2'])

Use nrows and usecols parameter together

You can use both parameters together to read a sample of the data and only a subset of the columns.

import pandas as pd

df = pd.read_csv('data.csv', nrows=1000, usecols=['col1', 'col2'])

2. Inspect Data Types

Use dtypes attribute

The next step is to inspect the data types of the columns. This is done with the dtypes attribute of the dataframe.

print(df.dtypes)

3. Define data types manually

Based on the analysis of the data sample, you can define the data types of the columns manually.

from datetime import datetime, timedelta

dtypes_dict = {
    'col_str': str,                  # String
    'col_int': int,                  # Integer
    'col_float': float,              # Float
    'col_bool': bool,                # Boolean
    'col_datetime': datetime,        # Datetime
    'col_timedelta': timedelta,      # Timedelta
    'col_category': 'category',      # Category
    'col_complex': complex,          # Complex number
    'col_bytes': bytes,              # Bytes
    'col_object': object             # Generic Python object
}

Here are the Pandas dtypes for reference.

Some of the not-so-common data types are:

  • category: A pandas Categorical type. Useful for columns with a limited number of unique values. (e.g. gender, country, city, etc.). Check Categorical data for more information.
  • complex: A complex number. Check Complex Numbers for more information.
  • bytes: A sequence of bytes. Check Bytes and Bytearray for more information.
  • object: A generic Python object. Check Object for more information.
  • datetime: A datetime object is a single object containing all the information from a date object and a time object. Check datetime for more information.
  • timedelta: A timedelta object represents a duration, the difference between two dates or times. Check timedelta for more information.

4. Read the full data with the manually-defined data types

The next step is to load the data with the manually-defined data types from the earlier step.

For this, you can use the dtype parameter of the read_csv function.

import pandas as pd

df = pd.read_csv('data.csv', dtype=dtypes_dict)

Important

If you get a ValueError, it means that the data types you defined are not compatible with all the data in the file, and you should check the data types again by using more rows or other chunks of the data. For that you may also use skiprows parameter.

import pandas as pd

df = pd.read_csv('data.csv', skiprows=1000, nrows=1000)

5. Use chunksize parameter if the data is too large to fit in memory

If using the options above is not enough to load the data into memory, you can use the chunksize parameter, which returns an iterator that can be used to read the data.

import pandas as pd

df_iterator = pd.read_csv('data.csv', chunksize=1000)

for df in df_iterator:
    # do something with df

References

Comments

Comments powered by Disqus