Selected Reading

Python Pandas - Working with CSV Format



Working with the CSV format is a common task in data analysis and data science. CSV (Comma-Separated Values) files are widely used to store tabular data because they are lightweight, human-readable, and supported by almost all data analysis tools and programming languages.

The Python Pandas library is a powerful tool for working with data, it offers extensive functionality for reading, processing, and writing data in CSV format. With Pandas, you can easily handle complex operations like filtering, grouping, and manipulating data in CSV files.

A CSV file is a plain text file where data values are separated by commas, representing tabular data in plain text format. A CSV file has a .csv extension. Below you can see how the data present in the CSV file looks like −

Sr.no,Name,Gender,Age
1,Braund,male,22
2,Cumings,female,38
3,Heikkinen,female,26
4,Futrelle,female,35

In this tutorial, we will learn how to work with CSV files using Pandas, including reading CSV files into DataFrames, understanding alternative reading methods, and handling large datasets, to exporting data back to CSV.

Reading a CSV File in Pandas

The pandas.read_csv() function is used to read the CSV format file into the Pandas DataFrame or TextFileReader. This function accepts CSV data from a URL or a local file path to load the data into the Pandas environment.

Example

The following example demonstrates how to read CSV data using the pandas.read_csv() function. Here we are using the StringIO to load the CSV string into a file-like object.

import pandas as pd

# Import StringIO to load a file-like object for reading CSV
from io import StringIO

# Create string representing CSV data
data = """Name,Gender,Age
Braund,male,22
Cumings,female,38
Heikkinen,female,26
Futrelle,female,35"""

# Use StringIO to convert the string data into a file-like object
obj = StringIO(data)

# read CSV into a Pandas DataFrame
df = pd.read_csv(obj)

print(df)

Output

Following is the output of the above code −

        Name  Gender  Age
0     Braund    male   22
1    Cumings  female   38
2  Heikkinen  female   26
3   Futrelle  female   35

Writing Data to a CSV File

Pandas provides a method called to_csv() to create or write CSV file using the Pandas data structures, such as DataFrame or Series objects. This function allows you to export your data to a CSV format.

Example

Here is an example demonstrating how to write a Pandas DataFrame to a CSV file using the DataFrame.to_csv() method.

import pandas as pd

# dictionary of lists
d = {'Car': ['BMW', 'Lexus', 'Audi', 'Mercedes', 'Jaguar', 'Bentley'],
'Date_of_purchase': ['2024-10-10', '2024-10-12', '2024-10-17', '2024-10-16', '2024-10-19', '2024-10-22']}

# creating dataframe from the above dictionary of lists
dataFrame = pd.DataFrame(d)
print("Original DataFrame:\n",dataFrame)

# write dataFrame to SalesRecords CSV file
dataFrame.to_csv("Output_written_CSV_File.csv")

# display the contents of the output csv
print("The output csv file written successfully...")

Output

Following is the output of the above code −

Original DataFrame:
         Car Date_of_purchase
0       BMW       2024-10-10
1     Lexus       2024-10-12
2      Audi       2024-10-17
3  Mercedes       2024-10-16
4    Jaguar       2024-10-19
5   Bentley       2024-10-22
The output csv file written successfully...

If you visit your working directory after executing the above code, you can see the created CSV file named Output_written_CSV_File.csv.

Handling Large CSV Files in Pandas

When working with large CSV files, loading the entire file may cause memory issues. Pandas provides option like chunksize in pandas.read_csv() function to process such files efficiently in smaller chunks.

Example

Below is an example that initially creates a large CSV file using the DataFrame.to_csv() method with random integers and then processes it in chunks using the Pandas read_csv() function.

import pandas as pd
import numpy as np

# Generate a DataFrame with random integers
data = np.random.randint(0, 100, size=(1000, 5))
column_names = [f"Col_{i}" for i in range(1, 5 + 1)]

# Create a DataFrame and save it as a CSV file
large_csv_file = "large_file.csv"
df = pd.DataFrame(data, columns=column_names)
df.to_csv(large_csv_file, index=False)
print(f"Large CSV file is created successfully.\n")

# Read large CSV file in chunks
chunk_size = 200

print("Output CSV data in chunks:")
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
   print('Data in chunks:') 
   print(chunk.head(2))

Output

While executing the above code we obtain the following output −

Large CSV file is created successfully.

Output CSV data in chunks:
Data in chunks:
   Col_1  Col_2  Col_3  Col_4  Col_5
0     56     62     66     83     25
1     49     72     93      6     55
Data in chunks:
     Col_1  Col_2  Col_3  Col_4  Col_5
200     59     50      4     15      4
201     53     38     14     66     52
Data in chunks:
     Col_1  Col_2  Col_3  Col_4  Col_5
400     99     67     69     69     63
401     65     53     70     38     58
Data in chunks:
     Col_1  Col_2  Col_3  Col_4  Col_5
600     90     43     79     29      1
601     31     96     77     75     47
Data in chunks:
     Col_1  Col_2  Col_3  Col_4  Col_5
800      1     94     46     14     31
801     17     75     55      5      6

Alternatives to Reading CSV Files

In addition to the pandas.read_csv() function, Pandas provides an alternative method for reading CSV data using pandas.read_table() function.

The pandas.read_table() function is used to read general delimited files such as CSV, TSV, or other delimiter-separated formats into a Pandas DataFrame. It is a good alternative for loading CSV files, and it easily handles various delimiters using the sep parameter. Additionally, this function supports iterating or breaking of the file into chunks.

Example

This example shows an alternative way to load CSV data into the Pandas DataFrame using the pd.read_table() function. Here you need to specify the delimiter with the sep parameter to read comma-separated values (CSV).

import pandas as pd

url ="https://raw.githubusercontent.com/Opensourcefordatascience/Data-sets/master/blood_pressure.csv"

# read CSV into a Pandas DataFrame using the read_table() function
df = pd.read_table(url,sep=',')

print(df.head(5))

Output

Following is the output of the above code −

   
patient   sex agegrp  bp_before  bp_after
0        1  Male  30-45        143       153
1        2  Male  30-45        163       170
2        3  Male  30-45        153       168
3        4  Male  30-45        153       142
4        5  Male  30-45        146       141
python_pandas_io_tool.htm
Advertisements