Selected Reading

Python Pandas - Quick Guide



Python Pandas - Introduction

Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data an Econometrics from Multidimensional data.

In 2008, developer Wes McKinney started developing pandas when in need of high performance, flexible tool for analysis of data.

Prior to Pandas, Python was majorly used for data munging and preparation. It had very little contribution towards data analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data load, prepare, manipulate, model, and analyze.

Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.

Key Features of Pandas

  • Fast and efficient DataFrame object with default and customized indexing.
  • Tools for loading data into in-memory data objects from different file formats.
  • Data alignment and integrated handling of missing data.
  • Reshaping and pivoting of date sets.
  • Label-based slicing, indexing and subsetting of large data sets.
  • Columns from a data structure can be deleted or inserted.
  • Group by data for aggregation and transformations.
  • High performance merging and joining of data.
  • Time Series functionality.

Python Pandas - Environment Setup

Setting up an environment to use the Pandas library is straightforward, and there are multiple ways to achieve this. Whether you prefer using Anaconda, Miniconda, or pip, you can easily get Pandas up and running on your system. This tutorial will guide you through the different methods to install Pandas.

Installing Pandas with pip

The most common way to install Pandas is by using the pip, it is a Python package manager (pip) allows you to install modules and packages. This method is suitable if you already have Python installed on your system. Note that the standard Python distribution does not come bundled with the Pandas module.

To install the pandas package by using pip you need to open the command prompt in our system (assuming, your machine is a windows operating system), and run the following command −

pip3 install pandas

This command will download and install the Pandas package along with its dependencies. If you install Anaconda Python package, Pandas will be installed by default with the following −

Upgrading pip (if necessary)

If you encounter any errors regarding the pip version, you can upgrade pip using the following command −

python -m pip3 install --upgrade pip

Then, rerun the Pandas installation command.

Installing a Specific Version of Pandas

If you need a specific version of Pandas, you can specify it using the following command −

pip3 install pandas==2.3.3

Every time, when you try to install any package, initially pip will check for the package dependencies if they are already installed on the system or not. if not, it will install them. Once all dependencies have been satisfied, it proceeds to install the requested package(s).

Installing Pandas Using Anaconda

Anaconda is a popular distribution for data science that includes Python and many scientific libraries, including Pandas.

Following are the steps to install Anaconda −

  • Download Anaconda: Go to Anaconda's official website and download the installer suitable for your operating system.
  • Install Anaconda: Follow the installation instructions provided on the Anaconda website.

Pandas comes pre-installed with Anaconda, so you can directly import it in your Python environment.

import pandas as pd

Installing a Specific Version of Pandas with Anaconda

If you need a specific version of Pandas, you can install it using the conda command −

conda install pandas=2.3.3

Anaconda will take up to 300GB of system space for storage and 600GB for air-gapped deployments because it comes with the most common data science packages in Python like Numpy, Pandas, and many more.

Installing Pandas Using Miniconda

Both Anaconda and minconda use the conda package installer, but using anaconda will occupy more system storage. Because anaconda has more than 100 packages, those are automatically installed and the result needs more space.

Miniconda is a minimal installer for conda, which includes only the conda package manager and Python. It is lightweight compared to Anaconda and is suitable if you want more control over the packages you install.

Following are the steps to install Miniconda −

  • Download Miniconda: Visit the Miniconda download page and download the installer for your operating system.
  • Install Miniconda: Follow the installation instructions provided on the Miniconda website.

Installing Pandas with Miniconda

After successfully installing Miniconda, you can use the conda command to install Pandas −

conda install pandas

Installing Pandas on Linux

On Linux, you can use the package manager of your respective distribution to install Pandas and other scientific libraries.

For Ubuntu Users

sudo apt-get install python-numpy python-scipy python-matplotlibipythonipythonnotebook
python-pandas python-sympy python-nose

For Fedora Users

sudo yum install numpyscipy python-matplotlibipython python-pandas sympy
python-nose atlas-devel

By following any of these methods, you can set up Pandas on your system and start using it for data analysis and manipulation.

Python Pandas - Introduction to Data Strutures

Python Pandas Data Structures

Data structures in Pandas are designed to handle data efficiently. They allow for the organization, storage, and modification of data in a way that optimizes memory usage and computational performance. Python Pandas library provides two primary data structures for handling and analyzing data −

  • Series
  • DataFrame

In general programming, the term "data structure" refers to the method of collecting, organizing, and storing data to enable efficient access and modification. Data structures are collections of data types that provide the best way of organizing items (values) in terms of memory usage.

Pandas is built on top of NumPy and integrates well within a scientific computing environment with many other third-party libraries. This tutorial will provide a detailed introduction to these data structures.

Dimension and Description of Pandas Data Structures

Data Structure Dimensions Description
Series 1 A one-dimensional labeled homogeneous array, sizeimmutable.
Data Frames 2 A two-dimensional labeled, size-mutable tabular structure with potentially heterogeneously typed columns.

Working with two or more dimensional arrays can be complex and time-consuming, as users need to carefully consider the data's orientation when writing functions. However, Pandas simplifies this process by reducing the mental effort required. For example, when dealing with tabular data (DataFrame), it's more easy to think in terms of rows and columns instead of axis 0 and axis 1.

Mutability of Pandas Data Structures

All Pandas data structures are value mutable, meaning their contents can be changed. However, their size mutability varies −

  • Series − Size immutable.
  • DataFrame − Size mutable.

Series

A Series is a one-dimensional labeled array that can hold any data type. It can store integers, strings, floating-point numbers, etc. Each value in a Series is associated with a label (index), which can be an integer or a string.

Name Steve
Age 35
Gender Male
Rating 3.5

Example

Consider the following Series which is a collection of different data types

import pandas as pd

data = ['Steve', '35', 'Male', '3.5']
series = pd.Series(data, index=['Name', 'Age', 'Gender', 'Rating'])
print(series)

On executing the above program, you will get the following output

Name      Steve
Age          35
Gender     Male
Rating      3.5
dtype: object

Key Points

Following are the key points related to the Pandas Series.

  • Homogeneous data
  • Size Immutable
  • Values of Data Mutable

DataFrame

A DataFrame is a two-dimensional labeled data structure with columns that can hold different data types. It is similar to a table in a database or a spreadsheet. Consider the following data representing the performance rating of a sales team −

Name Age Gender Rating
Steve 32 Male 3.45
Lia 28 Female 4.6
Vin 45 Male 3.9
Katie 38 Female 2.78

Example

The above tabular data can be represented in a DataFrame as follows −

import pandas as pd

# Data represented as a dictionary
data = {
    'Name': ['Steve', 'Lia', 'Vin', 'Katie'],
    'Age': [32, 28, 45, 38],
    'Gender': ['Male', 'Female', 'Male', 'Female'],
    'Rating': [3.45, 4.6, 3.9, 2.78]
}

# Creating the DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

Output

On executing the above code you will get the following output −

    Name  Age  Gender  Rating
0  Steve   32    Male    3.45
1    Lia   28  Female    4.60
2    Vin   45    Male    3.90
3  Katie   38  Female    2.78

Key Points

Following are the key points related the Pandas DataFrame −

  • Heterogeneous data
  • Size Mutable
  • Data Mutable

Purpose of Using More Than One Data Structure

Pandas data structures are flexible containers for lower-dimensional data. For instance, a DataFrame is a container for Series, and a Series is a container for scalars. This flexibility allows for efficient data manipulation and storage.

Building and handling multi-dimensional arrays can be boring and require careful consideration of the data's orientation when writing functions. Pandas reduces this mental effort by providing intuitive data structures.

Example

Following example represents a Series within a DataFrame.

import pandas as pd

# Data represented as a dictionary
data = {
    'Name': ['Steve', 'Lia', 'Vin', 'Katie'],
    'Age': [32, 28, 45, 38],
    'Gender': ['Male', 'Female', 'Male', 'Female'],
    'Rating': [3.45, 4.6, 3.9, 2.78]
}

# Creating the DataFrame
df = pd.DataFrame(data)

# Display a Series within a DataFrame
print(df['Name'])

Output

On executing the above code you will get the following output −

0    Steve
1      Lia
2      Vin
3    Katie
Name: Name, dtype: object

Python Pandas - Index Objects

In Pandas, Index Objects play an important role in organizing and accessing data in a structured way. They work like labeled arrays and play an important role in defining how data is arranged and accessed in structures like Series and DataFrames. The Index allows quick data searches, efficient slicing, and keeps data properly aligned, while giving each row meaningful labels.

An Index is used to label the rows of a DataFrame or elements in a Series. These labels can be numbers, strings, or dates, and they help you to identify the data. One key thing to remember about Pandas indexes is that they are immutable, meaning you cannot change their size once created.

In this tutorial, we will learn about Pandas Index Objects, and various types of indexes in pandas.

The Index Class

The Index class is a basic object for storing all index types in Pandas objects. It provides the basic functionality for accessing and manipulating data.

Key Features of Index Object

  • Immutable: Index object is a immutable sequence, which cannot modify once it is created.

  • Alignment: Index ensures that data from different DataFrames or Series can be combined correctly, based on the index values.

  • Slicing: Index allows fast slicing and retrieval of data based on labels.

Syntax

Following is the syntax of the Index class −

class pandas.Index(data=None, dtype=None, copy=False, name=None, tupleize_cols=True)

Where,

  • data: The data for the index, which can be an array-like structure (like a list or numpy array) or another index object.

  • dtype: It specifies the data type for the index values, If not provided, Pandas will decide the data type based on the index values.

  • copy: It is a boolean parameter (True or False), which, specifies to create a copy of the input data.

  • name: This parameter gives a label to the index.

  • data: It is also a boolean parameter (True or False), When True, it tries to create MultiIndex if possible.

Types of Indexes in Pandas

Pandas provides various types of indexes to handle different types of data. Such as −

Let's discuss about all types of indexes in pandas.

NumericIndex

A NumericIndex is the basic index type in Pandas, it contains numerical values. NumericIndex is a default index and Pandas automatically assigns this if you did not provided any index.

Example

Following example demonstrates how pandas automatically assigns NumericIndex to a pandas DataFrame object.

import pandas as pd

# Generate some data for DataFrame
data = {
   'Name': ['Steve', 'Lia', 'Vin', 'Katie'],
   'Age': [32, 28, 45, 38],
   'Gender': ['Male', 'Female', 'Male', 'Female'],
   'Rating': [3.45, 4.6, 3.9, 2.78]
}
# Creating the DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

print("\nDataFrame Index Object Type:",df.index.dtype)

Output

Following is the output of the above code −

    Name  Age  Gender  Rating
0  Steve   32    Male    3.45
1    Lia   28  Female    4.60
2    Vin   45    Male    3.90
3  Katie   38  Female    2.78

DataFrame Index Object Type: int64

Categorical Index

The CategoricalIndex is used to deal the duplicate labels. This index is efficient in terms of memory usage and handling the large number of duplicate elements.

Example

The Following example create a Pandas DataFrame with the CategoricalIndex.

import pandas as pd

# Creating a CategoricalIndex
categories = pd.CategoricalIndex(['a','b', 'a', 'c'])
df = pd.DataFrame({'Col1': [50, 70, 90, 60], 'Col2':[1, 3, 5, 8]}, index=categories)
print("Input DataFrame:\n",df)

print("\nDataFrame Index Object Type:",df.index.dtype)

Output

Following is the output of the above code −

Input DataFrame:
    Col1  Col2
a    50     1
b    70     3
a    90     5
c    60     8

DataFrame Index Object Type: category

IntervalIndex

An IntervalIndex is used to represent intervals (ranges) in your data. This type of index will be created using the interval_range() method.

Example

Following example creates a DataFrame with IntervalIndex using the interval_range() method.

import pandas as pd

# Creating a IntervalIndex
interval_idx = pd.interval_range(start=0, end=4)

# Creating a DataFrame with IntervalIndex
df = pd.DataFrame({'Col1': [1, 2, 3, 4], 'Col2':[1, 3, 5, 8]}, index=interval_idx)

print("Input DataFrame:\n",df)

print("\nDataFrame Index Object Type:",df.index.dtype)

Output

Following is the output of the above code −

Input DataFrame:
         Col1  Col2
(0, 1]     1     1
(1, 2]     2     3
(2, 3]     3     5
(3, 4]     4     8

DataFrame Index Object Type: interval[int64, right]

MultiIndex

Pandas MultiIndex is used to represent multiple levels or layers in index of Pandas data structures, which is also called as hierarchical.

Example

The following example shows the creation of a simple MultiIndexed DataFrame.

import pandas as pd

# Create MultiIndex
arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
multi_idx = pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))

# Create a DataFrame with MultiIndex
df = pd.DataFrame({'Col1': [1, 2, 3, 4], 'Col2':[1, 3, 5, 8]}, index=multi_idx)

print("MultiIndexed DataFrame:\n",df)

Output

Following is the output of the above code −

MultiIndexed DataFrame:
               Col1  Col2
number color            
1      red       1     1
       blue      2     3
2      red       3     5
       blue      4     8

DatetimeIndex

Pandas DatetimeIndex object is used to represent the date and time values. Nothing but it used for time-series data where each row is linked to a specific timestamp.

Example

The Following example create a Pandas DataFrame with the DatetimeIndex.

import pandas as pd

# Create DatetimeIndex
datetime_idx = pd.DatetimeIndex(["2020-01-01 10:00:00", "2020-02-01 11:00:00"])

# Create a DataFrame with DatetimeIndex
df = pd.DataFrame({'Col1': [1, 2], 'Col2':[1, 3]}, index=datetime_idx )

print("DatetimeIndexed DataFrame:\n",df)

Output

Following is the output of the above code −

DatetimeIndexed DataFrame:
                      Col1  Col2
2020-01-01 10:00:00     1     1
2020-02-01 11:00:00     2     3

TimedeltaIndex

Pandas TimedeltaIndex is used represent a duration between two dates or times, like the number of days or hours between events.

Example

This example creates a Pandas DataFrame with a TimedeltaIndex.

import pandas as pd

# Create TimedeltaIndex
timedelta_idx = pd.TimedeltaIndex(['0 days', '1 days', '2 days'])

# Create a DataFrame with TimedeltaIndex
df = pd.DataFrame({'Col1': [1, 2, 3], 'Col2':[1, 3, 3]}, index=timedelta_idx )

print("TimedeltaIndexed DataFrame:\n",df)

Output

Following is the output of the above code −

TimedeltaIndexed DataFrame:
         Col1  Col2
0 days     1     1
1 days     2     3
2 days     3     3

PeriodIndex

Pandas PeriodIndex is used to represent regular periods in time, like quarters, months, or years.

Example

This example creates a Pandas DataFrame with PeriodIndex object.

import pandas as pd

# Create PeriodIndex
period_idx = pd.PeriodIndex(year=[2020, 2024], quarter=[1, 3])

# Create a DataFrame with PeriodIndex
df = pd.DataFrame({'Col1': [1, 2], 'Col2':[1, 3]}, index=period_idx )

print("PeriodIndexed DataFrame:\n",df)

Output

Following is the output of the above code −

PeriodIndexed DataFrame:
         Col1  Col2
2020Q1     1     1
2024Q3     2     3

Python Pandas - Panel

A panel is a 3D container of data. The term Panel data is derived from econometrics and is partially responsible for the name pandas − pan(el)-da(ta)-s.

The Panel class is deprecated and has been removed in recent versions of pandas. The recommended way to represent 3-D data is with a MultiIndex on a DataFrame via the to_frame() method or with the xarray package. pandas provides a to_xarray() method to automate this conversion.

The names for the 3 axes are intended to give some semantic meaning to describing operations involving panel data. They are −

  • items: axis 0, each item corresponds to a DataFrame contained inside.

  • major_axis: axis 1, it is the index (rows) of each of the DataFrames.

  • minor_axis: axis 2, it is the columns of each of the DataFrames.

pandas.Panel()

A Panel can be created using the following constructor −

pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)

The parameters of the constructor are as follows −

Parameter Description
data Data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame
items axis=0
major_axis axis=1
minor_axis axis=2
dtype Data type of each column
copy Copy data. Default, false

Create Panel

A Panel can be created using multiple ways like −

  • From ndarrays
  • From dict of DataFrames

From 3D ndarray

# creating an empty panel
import pandas as pd
import numpy as np

data = np.random.rand(2,4,5)
p = pd.Panel(data)
print(p)

Its output is as follows −

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4

Note: Observe the dimensions of the empty panel and the above panel, all the objects are different.

From dict of DataFrame Objects

#creating an empty panel
import pandas as pd
import numpy as np

data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
   'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print(p)

Its output is as follows −

Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

Create an Empty Panel

An empty panel can be created using the Panel constructor as follows −

#creating an empty panel
import pandas as pd
p = pd.Panel()
print(p)

Its output is as follows −

<class 'pandas.core.panel.Panel'>
Dimensions: 0 (items) x 0 (major_axis) x 0 (minor_axis)
Items axis: None
Major_axis axis: None
Minor_axis axis: None

Selecting the Data from Panel

Select the data from the panel using −

  • Items
  • Major_axis
  • Minor_axis

Using Items

# creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
   'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print(p['Item1'])

Its output is as follows −

            0          1          2
0    0.488224  -0.128637   0.930817
1    0.417497   0.896681   0.576657
2   -2.775266   0.571668   0.290082
3   -0.400538  -0.144234   1.110535

We have two items, and we retrieved item1. The result is a DataFrame with 4 rows and 3 columns, which are the Major_axis and Minor_axis dimensions.

Using major_axis

Data can be accessed using the method panel.major_axis(index).

# creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
   'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print(p.major_xs(1))

Its output is as follows −

      Item1       Item2
0   0.417497    0.748412
1   0.896681   -0.557322
2   0.576657       NaN

Using minor_axis

Data can be accessed using the method panel.minor_axis(index).

# creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
   'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print(p.minor_xs(1))

Its output is as follows −

       Item1       Item2
0   -0.128637   -1.047032
1    0.896681   -0.557322
2    0.571668    0.431953
3   -0.144234    1.302466

Note: Observe the changes in the dimensions.

Python Pandas - Basic Functionality

Pandas is a powerful data manipulation library in Python, providing essential tools to work with data in both Series and DataFrame formats. These two data structures are crucial for handling and analyzing large datasets.

Understanding the basic functionalities of Pandas, including its attributes and methods, is essential for effectively managing data, these attributes and methods provide valuable insights into your data, making it easier to understand and process. In this tutorial you will learn about the basic attributes and methods in Pandas that are crucial for working with these data structures.

Working with Attributes in Pandas

Attributes in Pandas allow you to access metadata about your Series and DataFrame objects. By using these attributes you can explore and easily understand the data.

Series and DataFrame Attributes

Following are the widely used attribute of the both Series and DataFrame objects −

Sr.No. Attribute & Description
1

dtype

Returns the data type of the elements in the Series or DataFrame.

2

index

Provides the index (row labels) of the Series or DataFrame.

3

values

Returns the data in the Series or DataFrame as a NumPy array.

4

shape

Returns a tuple representing the dimensionality of the DataFrame (rows, columns).

5

ndim

Returns the number of dimensions of the object. Series is always 1D, and DataFrame is 2D.

6

size

Gives the total number of elements in the object.

7

empty

Checks if the object is empty, and returns True if it is.

8

columns

Provides the column labels of the DataFrame object.

Example

Let's create a Pandas Series and explore these attributes operation.

import pandas as pd
import numpy as np

# Create a Series with random numbers
s = pd.Series(np.random.randn(4))

# Exploring attributes
print("Data type of Series:", s.dtype)
print("Index of Series:", s.index)
print("Values of Series:", s.values)
print("Shape of Series:", s.shape)
print("Number of dimensions of Series:", s.ndim)
print("Size of Series:", s.size)
print("Is Series empty?:", s.empty)

Its output is as follows −

Data type of Series: float64
Index of Series: RangeIndex(start=0, stop=4, step=1)
Values of Series: [-1.02016329  1.40840089  1.36293022  1.33091391]
Shape of Series: (4,)
Number of dimensions of Series: 1
Size of Series: 4
Is Series empty?: False

Example

Let's look at below example and understand working of these attributes on a DataFrame object.

import pandas as pd
import numpy as np

# Create a DataFrame with random numbers
df = pd.DataFrame(np.random.randn(3, 4), columns=list('ABCD'))
print("DataFrame:")
print(df)

print("Results:")
print("Data types:", df.dtypes)
print("Index:", df.index)
print("Columns:", df.columns)
print("Values:")
print(df.values)
print("Shape:", df.shape)
print("Number of dimensions:", df.ndim)
print("Size:", df.size)
print("Is empty:", df.empty)

On executing the above code you will get the following output −

DataFrame:
          A         B         C         D
0  2.161209 -1.671807 -1.020421 -0.287065
1  0.308136 -0.592368 -0.183193  1.354921
2 -0.963498 -1.768054 -0.395023 -2.454112

Results:
Data types: 
A    float64
B    float64
C    float64
D    float64
dtype: object
Index: RangeIndex(start=0, stop=3, step=1)
Columns: Index(['A', 'B', 'C', 'D'], dtype='object')
Values:
[[ 2.16120893 -1.67180742 -1.02042138 -0.28706468]
 [ 0.30813618 -0.59236786 -0.18319262  1.35492058]
 [-0.96349817 -1.76805364 -0.3950226  -2.45411245]]
Shape: (3, 4)
Number of dimensions: 2
Size: 12
Is empty: False

Exploring Basic Methods in Pandas

Pandas offers several basic methods in both the data structures, that makes it easy to quickly look at and understand your data. These methods help you get a summary and explore the details without much effort.

Series and DataFrame Methods

Sr.No. Method & Description
1

head(n)

Returns the first n rows of the object. The default value of n is 5.

2

tail(n)

Returns the last n rows of the object. The default value of n is 5.

3

info()

Provides a concise summary of a DataFrame, including the index dtype and column dtypes, non-null values, and memory usage.

4

describe()

Generates descriptive statistics of the DataFrame or Series, such as count, mean, std, min, and max.

Example

Let us now create a Series and see the working of the Series basic methods.

import pandas as pd
import numpy as np

# Create a Series with random numbers
s = pd.Series(np.random.randn(10))

print("Series:")
print(s)

# Using basic methods
print("First 5 elements of the Series:\n", s.head())
print("\nLast 3 elements of the Series:\n", s.tail(3))
print("\nDescriptive statistics of the Series:\n", s.describe())

Its output is as follows −

Series:
0   -0.295898
1   -0.786081
2   -1.189834
3   -0.410830
4   -0.997866
5    0.084868
6    0.736541
7    0.133949
8    1.023674
9    0.669520
dtype: float64
First 5 elements of the Series:
 0   -0.295898
1   -0.786081
2   -1.189834
3   -0.410830
4   -0.997866
dtype: float64

Last 3 elements of the Series:
 7    0.133949
8    1.023674
9    0.669520
dtype: float64

Descriptive statistics of the Series:
 count    10.000000
mean     -0.103196
std       0.763254
min      -1.189834
25%      -0.692268
50%      -0.105515
75%       0.535627
max       1.023674
dtype: float64

Example

Now look at below example and understand working of the basic methods on a DataFrame object.

import pandas as pd
import numpy as np

#Create a Dictionary of series
data = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]), 
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
 
#Create a DataFrame
df = pd.DataFrame(data)
print("Our data frame is:\n")
print(df)

# Using basic methods
print("\nFirst 5 rows of the DataFrame:\n", df.head())
print("\nLast 3 rows of the DataFrame:\n", df.tail(3))
print("\nInfo of the DataFrame:")
df.info()
print("\nDescriptive statistics of the DataFrame:\n", df.describe())

On executing the above code you will get the following output −

Our data frame is:

    Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
2  Ricky   25    3.98
3    Vin   23    2.56
4  Steve   30    3.20
5  Smith   29    4.60
6   Jack   23    3.80

First 5 rows of the DataFrame:
     Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
2  Ricky   25    3.98
3    Vin   23    2.56
4  Steve   30    3.20

Last 3 rows of the DataFrame:
     Name  Age  Rating
4  Steve   30     3.2
5  Smith   29     4.6
6   Jack   23     3.8

Info of the DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    7 non-null      object 
 1   Age     7 non-null      int64  
 2   Rating  7 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 296.0+ bytes

Descriptive statistics of the DataFrame:
              Age    Rating
count   7.000000  7.000000
mean   25.857143  3.658571
std     2.734262  0.698628
min    23.000000  2.560000
25%    24.000000  3.220000
50%    25.000000  3.800000
75%    27.500000  4.105000
max    30.000000  4.600000

Python Pandas - Indexing and Selecting Data

In pandas, indexing and selecting data are crucial for efficiently working with data in Series and DataFrame objects. These operations help you to slice, dice, and access subsets of your data easily.

These operations involve retrieving specific parts of your data structure, whether it's a Series or DataFrame. This process is crucial for data analysis as it allows you to focus on relevant data, apply transformations, and perform calculations.

Indexing in pandas is essential because it provides metadata that helps with analysis, visualization, and interactive display. It automatically aligns data for easier manipulation and simplifies the process of getting and setting data subsets.

This tutorial will explore various methods to slice, dice, and manipulate data using Pandas, helping you understand how to access and modify subsets of your data.

Types of Indexing in Pandas

Similar to Python and NumPy indexing ([ ]) and attribute (.) operators, Pandas provides straightforward methods for accessing data within its data structures. However, because the data type being accessed can be unpredictable, relying exclusively on these standard operators may lead to optimization challenges.

Pandas provides several methods for indexing and selecting data, such as −

  • Label-Based Indexing with .loc

  • Integer Position-Based Indexing with .iloc

  • Indexing with Brackets []

Label-Based Indexing with .loc

The .loc indexer is used for label-based indexing, which means you can access rows and columns by their labels. It also supports boolean arrays for conditional selection.

.loc() has multiple access methods like −

  • single scalar label: Selects a single row or column, e.g., df.loc['a'].

  • list of labels: Select multiple rows or columns, e.g., df.loc[['a', 'b']].

  • Label Slicing: Use slices with labels, e.g., df.loc['a':'f'] (both start and end are included).

  • Boolean Arrays: Filter data based on conditions, e.g., df.loc[boolean_array].

loc takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.

Example 1

Here is a basic example that selects all rows for a specific column using the loc indexer.

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

print("Original DataFrame:\n", df)

#select all rows for a specific column
print('\nResult:\n',df.loc[:,'A'])

Its output is as follows −

Original DataFrame:
           A         B         C         D
a  0.962766 -0.195444  1.729083 -0.701897
b -0.552681  0.797465 -1.635212 -0.624931
c  0.581866 -0.404623 -2.124927 -0.190193
d -0.284274  0.019995 -0.589465  0.914940
e  0.697209 -0.629572 -0.347832  0.272185
f -0.181442 -0.000983  2.889981  0.104957
g  1.195847 -1.358104  0.110449 -0.341744
h -0.121682  0.744557  0.083820  0.355442

Result:
 a    0.962766
b   -0.552681
c    0.581866
d   -0.284274
e    0.697209
f   -0.181442
g    1.195847
h   -0.121682
Name: A, dtype: float64

Note: The output generated will vary with each execution because the DataFrame is created using NumPy's random number generator.

Example 2

This example selecting all rows for multiple columns.

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# Select all rows for multiple columns, say list[]
print(df.loc[:,['A','C']])

Its output is as follows −

            A           C
a    0.391548    0.745623
b   -0.070649    1.620406
c   -0.317212    1.448365
d   -2.162406   -0.873557
e    2.202797    0.528067
f    0.613709    0.286414
g    1.050559    0.216526
h    1.122680   -1.621420

Example 3

This example selects the specific rows for the specific columns.

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# Select few rows for multiple columns, say list[]
print(df.loc[['a','b','f','h'],['A','C']])

Its output is as follows −

           A          C
a   0.391548   0.745623
b  -0.070649   1.620406
f   0.613709   0.286414
h   1.122680  -1.621420

Example 4

The following example selecting a range of rows for all columns using the loc indexer.

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# Select range of rows for all columns
print(df.loc['c':'e'])

Its output is as follows −

          A         B         C         D
c  0.044589  1.966278  0.894157  1.798397
d  0.451744  0.233724 -0.412644 -2.185069
e -0.865967 -1.090676 -0.931936  0.214358

Integer Position-Based Indexing with .iloc

The .iloc indexer is used for integer-based indexing, which allows you to select rows and columns by their numerical position. This method is similar to standard python and numpy indexing (i.e. 0-based indexing).

  • Single Integer: Selects data by its position, e.g., df.iloc[0].

  • List of Integers: Select multiple rows or columns by their positions, e.g., df.iloc[[0, 1, 2]].

  • Integer Slicing: Use slices with integers, e.g., df.iloc[1:3].

  • Boolean Arrays: Similar to .loc, but for positions.

Example 1

Here is a basic example that selects 4 rows for the all column using the iloc indexer.

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

print("Original DataFrame:\n", df)

# select all rows for a specific column
print('\nResult:\n',df.iloc[:4])

Its output is as follows −

Original DataFrame:
           A         B         C         D
0 -1.152267  2.206954 -0.603874  1.275639
1 -0.799114 -0.214075  0.283186  0.030256
2 -1.823776  1.109537  1.512704  0.831070
3 -0.788280  0.961695 -0.127322 -0.597121
4  0.764930 -1.310503  0.108259 -0.600038
5 -1.683649 -0.602324 -1.175043 -0.343795
6  0.323984 -2.314158  0.098935  0.065528
7  0.109998 -0.259021 -0.429467  0.224148

Result:
           A         B         C         D
0 -1.152267  2.206954 -0.603874  1.275639
1 -0.799114 -0.214075  0.283186  0.030256
2 -1.823776  1.109537  1.512704  0.831070
3 -0.788280  0.961695 -0.127322 -0.597121

Example 2

The following example selects the specific data using the integer slicing.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# Integer slicing
print(df.iloc[:4])
print(df.iloc[1:5, 2:4])

Its output is as follows −

           A          B           C           D
0   0.699435   0.256239   -1.270702   -0.645195
1  -0.685354   0.890791   -0.813012    0.631615
2  -0.783192  -0.531378    0.025070    0.230806
3   0.539042  -1.284314    0.826977   -0.026251

           C          D
1  -0.813012   0.631615
2   0.025070   0.230806
3   0.826977  -0.026251
4   1.423332   1.130568

Example 3

This example selects the data using the slicing through list of values.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# Slicing through list of values
print(df.iloc[[1, 3, 5], [1, 3]])

Its output is as follows −

           B           D
1   0.890791    0.631615
3  -1.284314   -0.026251
5  -0.512888   -0.518930

Direct Indexing with Brackets "[]"

Direct indexing with [] is a quick and intuitive way to access data, similar to indexing with Python dictionaries and NumPy arrays. Its often used for basic operations −

  • Single Column: Access a single column by its name.

  • Multiple Columns: Select multiple columns by passing a list of column names.

  • Row Slicing: Slice rows using integer-based indexing.

Example 1

This example demonstrates how to use the direct indexing with brackets for accessing a single column.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# Accessing a Single Column
print(df['A'])

Its output is as follows −

0   -0.850937
1   -1.588211
2   -1.125260
3    2.608681
4   -0.156749
5    0.154958
6    0.396192
7   -0.397918
Name: A, dtype: float64

Example 2

This example selects the multiple columns using the direct indexing.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# Accessing Multiple Columns
print(df[['A', 'B']])

Its output is as follows −

          A         B
0  0.167211 -0.080335
1 -0.104173  1.352168
2 -0.979755 -0.869028
3  0.168335 -1.362229
4 -1.372569  0.360735
5  0.428583 -0.203561
6 -0.119982  1.228681
7 -1.645357  0.331438

Python Pandas - Series

In the Python Pandas library, a Series is one of the primary data structures, that offers a convenient way to handle and manipulate one-dimensional data. It is similar to a column in a spreadsheet or a single column in a database table. In this tutorial you will learn more about Pandas Series and use Series effectively for data manipulation and analysis.

What is a Series?

A Series in Pandas is a one-dimensional labeled array capable of holding data of any type, including integers, floats, strings, and Python objects. It consists of two main components −

  • Data: The actual values stored in the Series.
  • Index: The labels or indices that correspond to each data value.

A Series is similar to a one-dimensional ndarray (NumPy array) but with labels, which are also known as indices. These labels can be used to access the data within the Series. By default, the index values are integers starting from 0 to the length of the Series minus one, but you can also manually set the index labels.

Creating a Pandas Series

A pandas Series can be created using the following constructor −

class pandas.Series(data, index, dtype, name, copy)

The parameters of the constructor are as follows −

Sr.No Parameter & Description
1

data

Data takes various forms like ndarray, list, or constants.

2

index

Index values must be unique and hashable, with the same length as data. Default is np.arange(n) if no index is passed.

3

dtype

Data type. If None, data type will be inferred.

4

copy

Copy data. Default is False.

A series object can be created using various inputs like −

  • List
  • ndarray
  • Dict
  • Scalar value or constant

Create an Empty Series

If no data is provided to the Series constructor pandas.Series() it will create a basic empty series object.

Example

Following is the example demonstrating creating the empty Series.

#import the pandas library and aliasing as pd
import pandas as pd
s = pd.Series()

# Display the result
print('Resultant Empty Series:\n',s)

Its output is as follows −

Resultant Empty Series: 
Series([], dtype: object)

Create a Series from ndarray

An ndarray is provided as an input data to the Series constructor, then it will create series with that data. If you want specify the custom index then index passed must be of the same length of input data. If no index is specified, then Pandas will automatically generate a default index from staring 0 to length of the input data, i.e., [0,1,2,3. range(len(array))-1].

Example

Here's the example creating a Pandas Series using an ndarray.

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print(s)

Its output is as follows −

0   a
1   b
2   c
3   d
dtype: object

We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1, i.e., 0 to 3.

Example

This example demonstrates applying the custom index to the series object while creating.

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print("Output:\n",s)

Its output is as follows −

Output:
100  a
101  b
102  c
103  d
dtype: object

In this example we have provided the index values. Now we can observe the customized indexed values in the output.

Create a Series from Python Dictionary

A dictionary can be passed as input to the pd.Series() constructor to create a series with the dictionary values. If no index is specified, then the dictionary keys are taken in a sorted order to construct the series index. If index is passed, the values in data corresponding to the labels in the index will be pulled out.

Example 1

Here is the basic example of creating the Series object using a Python dictionary.

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print(s)

Its output is as follows −

a 0.0
b 1.0
c 2.0
dtype: float64

Observe − Dictionary keys are used to construct index.

Example 2

In this example a Series object is created with Python dictionary by explicitly specifying the index labels.

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','x','a'])
print(s)

Its output is as follows −

b 1.0
c 2.0
d NaN
a 0.0
dtype: float64

Observe − Index order is persisted and the missing element is filled with NaN (Not a Number).

Create a Series from Scalar

If you provide a single scalar value as data to the Pd.Series() constructor with specified index labels. Then that single value will be repeated to match the length of provided index object.

Example

Following is the example that demonstrates creating a Series object using a single scalar value.

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print(s)

Its output is as follows −

0  5
1  5
2  5
3  5
dtype: int64

Python Pandas - Slicing a Series Object

Pandas Series slicing is a process of selecting a group of elements from a Series object. A Series in Pandas is a one-dimensional labeled array that works similarly to the one-dimensional ndarray (NumPy array) but with labels, which are also called indexes.

Pandas Series slicing is very similarly to the Python and NumPy slicing but it comes with additional features, like slicing based on both position and labels. In this tutorial we will learn about the slicing operations on Pandas Series object.

Basics of Pandas Series Slicing

Series slicing can be done by using the [:] operator, which allows you to select subset of elements from the series object by specified start and end points.

Below are the syntax's of the slicing a Series −

  • Series[start:stop:step]: It selects elements from start to end with specified step value.

  • Series[start:stop]: It selects items from start to stop with step 1.

  • Series[start:]: It selects items from start to the rest of the object with step 1.

  • Series[:stop]: It selects the items from the beginning to stop with step 1.

  • Series[:]: It selects all elements from the series object.

Slicing a Series by Position

Pandas Series allows you to select the elements based on their position(i.e, Index values), just like Python list object.

Example: Slicing range of values from a Series

Following is the example of demonstrating how to slice a range value from a series object using the positions.

import pandas as pd
import numpy as np

data = np.array(['a', 'b', 'c', 'd'])
s = pd.Series(data)

# Display the Original series
print('Original Series:',s, sep='\n')

# Slice the range of values
result = s[1:3]

# Display the output
print('Values after slicing the Series:', result, sep='\n')

Following is the output of the above code −

Original Series:
0    a
1    b
2    c
3    d
dtype: object
Values after slicing the Series:
1    b
2    c
dtype: object

Example: Slicing the First Three Elements from a Series

This example retrieves the first three elements in the Series using it's position(i.e, index values).

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first three element
print(s[:3])

Its output is as follows −

a  1
b  2
c  3
dtype: int64

Example: Slicing the Last Three Elements from a Series

Similar to the above example the following example retrieves the last three elements from the Series using the element position.

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the last three element
print(s[-3:])

Its output is as follows −

c  3
d  4
e  5
dtype: int64

Slicing a Series by Label

A Pandas Series is like a fixed-size Python dict in that you can get and set values by index labels.

Example: Slicing Group of elements from a Series using the Labels

The following example retrieves multiple elements with slicing the label of a Series.

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

# Slice multiple elements
print(s['a':'d'])

Its output is as follows −

a    1
b    2
c    3
d    4
dtype: int64

Example: Slicing First Three Elements using the Labels

The following example slice the first few elements using the label of a Series data.

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

# Slice multiple elements
print(s[:'c'])

Its output is as follows −

a    1
b    2
c    3
dtype: int64

Modifying Values after Slicing

After slicing a Series, you can also modify the values, by assigning the new values to those sliced elements.

Example

The following example demonstrates how to modify the series values after accessing the range values through slice.

import pandas as pd
s = pd.Series([1,2,3,4,5])

# Display the original series 
print("Original Series:\n",s)

# Modify the values of first two elements
s[:2] = [100, 200]

print("Series after modifying the first two elements:",s)

Following is the output of the above code −

Original Series:
 0    1
1    2
2    3
3    4
4    5
dtype: int64
Series after modifying the first two elements:
 0    100
1    200
2      3
3      4
4      5
dtype: int64

Python Pandas - Attributes of a Series Object

Pandas Series is one of the primary data structures, provides a convenient way to handle and manipulate one-dimensional data. It looks similar to a single column in a spreadsheet or a single column in a database table.

Series object attributes are tools that help you get information about series object and its data. Pandas provides multiple attributes to understand and manipulate the data in a Series. In this tutorial you will learn about Pandas Series attributes.

Data Information

These attributes provide information about the data in the Series −

Sr.No. Methods & Description
1

dtype

Returns the data type of the underlying data.

2

dtypes

Returns the data type of the underlying data.

3

nbytes

Returns the number of bytes in the underlying data.

4

ndim

Returns the number of dimensions of the underlying data, which is always 1 for a Series.

5

shape

Returns a tuple representing the shape of the underlying data.

6

size

Returns the number of elements in the underlying data.

7

values

Returns the Series as an ndarray or ndarray-like object depending on the data type.

Data Access

These attributes help in accessing data within the Series −

Sr.No. Methods & Description
1

at

Accesses a single value using a row/column label pair.

2

iat

Accesses a single value by integer position.

3

loc

Accesses a group of rows and columns by labels or a boolean array.

Data Properties

These attributes provide properties and metadata about the Series −

Sr.No. Methods & Description
1

empty

Indicates whether the Series or DataFrame is empty.

2

flags

Gets the properties associated with the Pandas object.

3

hasnans

Returns True if there are any NaN values.

4

index

Returns the index (axis labels) of the Series.

5

is_monotonic_decreasing

Returns True if the values are monotonically decreasing.

6

is_monotonic_increasing

Returns True if the values are monotonically increasing.

7

is_unique

Returns True if all values are unique.

8

name

Returns the name of the Series.

Other

This category includes attributes that perform a variety of other operations −

Sr.No. Methods & Description
1

array

Provides the underlying data of the Series as an ExtensionArray.

2

attrs

Returns a dictionary of global attributes of the dataset.

3

axes

Returns a list of the row axis labels.

4

T

Returns the transpose of the Series, which is essentially the same as the original Series.

Python Pandas - Arithmetic Operations on Series Object

Pandas Series is one of the primary data structures, that stores the one-dimensional labeled data. The data can be any type, such as integers, floats, or strings. One of the primary advantages of using a Pandas Series is the ability to perform arithmetic operations in a vectorized manner. This means arithmetic operations on Series are performed without needing a loop through elements manually.

In this tutorial, we will learn how to apply arithmetic operations like addition(+), subtraction(-), multiplication(*), and division(/) to a single Series or between two Series objects.

Arithmetic Operations on a Series with Scalar Value

Arithmetic operations on a Pandas Series object can be directly applied to an entire Series elements, which means the operation is executed element-wise across all values. This is very similar to how operations work with NumPy arrays.

Following is the list of commonly used arithmetic operations on Pandas Series −

Operation Example Description
Addition s + 2 Adds 2 to each element
Subtraction s - 2 Subtracts 2 from each element
Multiplication s * 2 Multiplies each element by 2
Division s / 2 Divides each element by 2
Exponentiation s ** 2 Raises each element to the power of 2
Modulus s % 2 Finds remainder when divided by 2
Floor Division s // 2 Divides and floors the quotient

Example

The following example demonstrates how to applies the all arithmetical operations on a Series object with the scalar values.

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

# Display the Input Series
print('Input Series\n',s)

# Apply all Arithmetic Operation and Display the Results
print('\nAddition:\n',s+2)
print('\nSubtraction:\n', s-2)
print('\nMultiplication:\n', s * 2)
print('\nDivision:\n', s/2)
print('\nExponentiation:\n', s**2)
print('\nModulus:\n', s%2)
print('\nFloor Division:\n', s//2)

Following is the output of the above code −

Input Series
 a    1
b    2
c    3
d    4
e    5
dtype: int64

Addition:
 a    3
b    4
c    5
d    6
e    7
dtype: int64

Subtraction:
 a   -1
b    0
c    1
d    2
e    3
dtype: int64

Multiplication:
 a     2
b     4
c     6
d     8
e    10
dtype: int64

Division:
 a    0.5
b    1.0
c    1.5
d    2.0
e    2.5
dtype: float64

Exponentiation:
 a     1
b     4
c     9
d    16
e    25
dtype: int64

Modulus:
 a    1
b    0
c    1
d    0
e    1
dtype: int64

Floor Division:
 a    0
b    1
c    1
d    2
e    2
dtype: int64

Arithmetic Operations Between Two Series

You can perform arithmetical operations between two series objects. Pandas automatically aligns the data by index labels. If one of the Series object does not have an index but not the other, then the resultant value for that index will be NaN.

Example

This example demonstrates applying the arithmetic operations on two series objects.

import pandas as pd
s1 = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
s2 = pd.Series([9, 8, 6, 5], index=['x','a','b','c'])

# Apply all Arithmetic Operations and Display the Results
print('\nAddition:\n',s1+s2)
print('\nSubtraction:\n', s1-s2)
print('\nMultiplication:\n', s1 * s2)
print('\nDivision:\n', s1/s2)

Following is the output of the above code −

Addition:
 a    9.0
b    8.0
c    8.0
d    NaN
e    NaN
x    NaN
dtype: float64

Subtraction:
 a   -7.0
b   -4.0
c   -2.0
d    NaN
e    NaN
x    NaN
dtype: float64

Multiplication:
 a     8.0
b    12.0
c    15.0
d     NaN
e     NaN
x     NaN
dtype: float64

Division:
 a    0.125000
b    0.333333
c    0.600000
d         NaN
e         NaN
x         NaN
dtype: float64

Python Pandas - Converting Series to Other Objects

Pandas Series is a one-dimensional array-like object containing data of any type, such as integers, floats, and strings. And the data elements is associated with labels (index). In some situations, you need to convert a Pandas Series into different formats for various use cases like creating lists, NumPy arrays, dictionaries, or even converting the Series into a DataFrame.

In this tutorial, we will learn about various methods available in Pandas to convert a Series into different formats such as lists, NumPy arrays, dictionaries, DataFrames, and strings.

Following are the commonly used methods for converting Series into other formats −

Method Description
to_list() Converts the Series into a Python list.
to_numpy() Converts the Series into a NumPy array.
to_dict() Converts the Series into a dictionary.
to_frame() Converts the Series into a DataFrame.
to_string() Converts the Series into a string representation for display.

Converting Series to List

The Series.to_list() method converts a Pandas Series to a Python list, where each element of the Series becomes an element of the returned list. And the type of each element in the list is types as those in the Series.

Example

Here is the example of converting a Pandas Series into a Python list Using the Series.to_list() method.

import pandas as pd

# Create a Pandas Series
s = pd.Series([1, 2, 3])

# Convert Series to a Python list
result = s.to_list()

print("Output:",result)
print("Output Type:", type(result))

Output

Following is the output of the above code −

Output: [1, 2, 3]
Output Type: <class 'list'>

Converting Series to NumPy Array

The Pandas Series.to_numpy() method can be used to convert a Pandas Series into a NumPy array. This method provides a additional features like specifying the data type (dtype), handle missing values (na_value), and control whether the result should be a copy or a view.

Example

This example converts a Series into a NumPy array using the Series.to_numpy() method.

import pandas as pd

# Create a Pandas Series
s = pd.Series([1, 2, 3])

# Convert Series to a NumPy Array
result = s.to_numpy()

print("Output:",result)
print("Output Type:", type(result))

Output

Output: [1, 2, 3]
Output Type: <class 'numpy.ndarray'>

Converting Pandas Series to a Dictionary

The Pandas Series.to_dict() method is used to convert a Series into a Python dictionary, where each label (index) becomes a key and each corresponding value becomes the dictionary's value.

Example

The following example converts a Series into a Python dictionary using the Series.to_dict() method.

import pandas as pd

# Create a Pandas Series
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

# Convert Series to a Python dictionary
result = s.to_dict()

print("Output:",result)
print("Output Type:", type(result))

Output

Output: {'a': 1, 'b': 2, 'c': 3}
Output Type: <class 'dict'>

Converting a Series to DataFrame

The Series.to_frame() method allows you to convert a Series into a DataFrame. Each Series becomes a single column in the DataFrame. This method provides a name parameter to set the column name of the resulting DataFrame.

Example

This example uses the Series.to_frame() method to convert a Series into a Pandas DataFrame with a single column.

import pandas as pd

# Create a Pandas Series
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

# Convert Series to a Pandas DataFrame
result = s.to_frame(name='Numbers')

print("Output:\n",result)
print("Output Type:", type(result))

Output

Output:
    Numbers
a        1
b        2
c        3
Output Type: <class 'pandas.core.frame.DataFrame'>

Converting Series to Python String

To convert a Pandas Series object to a Python string you can use the he Series.to_string() method, which renders a string representation of the Series.

This method returns a string showing the index and values of the Series. You can customize the output string using various parameters like na_rep (represent missing values), header, index, float_format, length, etc.

Example

This example converts a Series into the Python string representation using the Series.to_string() method.

import pandas as pd

# Create a Pandas Series
s = pd.Series([1, 2, 3], index=['r1', 'r2', 'r3'])

# Convert Series to string representation
result = s.to_string()

print("Output:",repr(result))
print("Output Type:", type(result))

Output

Output: 'r1    1\nr2    2\nr3    3'
Output Type: <class 'str'>

Python Pandas - DataFrame

A DataFrame in Python's pandas library is a two-dimensional labeled data structure that is used for data manipulation and analysis. It can handle different data types such as integers, floats, and strings. Each column has a unique label, and each row is labeled with a unique index value, which helps in accessing specific rows.

DataFrame is used in machine learning tasks which allow the users to manipulate and analyze the data sets in large size. It supports the operations such as filtering, sorting, merging, grouping and transforming data.

Features of DataFrame

Following are the features of the Pandas DataFrame −

  • Columns can be of different types.
  • Size is mutable.
  • Labeled axes (rows and columns).
  • Can Perform Arithmetic operations on rows and columns.

Python Pandas DataFrame Structure

You can think of a DataFrame as similar to an SQL table or a spreadsheet data representation. Let us assume that we are creating a data frame with student's data.

Structure Table

Creating a pandas DataFrame

A pandas DataFrame can be created using the following constructor −

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

The parameters of the constructor are as follows −

Sr.No Parameter & Description
1

data

data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.

2

index

For the row labels, the Index to be used for the resulting frame is Optional Default np.arange(n) if no index is passed.

3

columns

This parameter specifies the column labels, the optional default syntax is - np.arange(n). This is only true if no index is passed.

4

dtype

Data type of each column.

5

copy

This command (or whatever it is) is used for copying of data, if the default is False.

Creating a DataFrame from Different Inputs

A pandas DataFrame can be created using various inputs like −

  • Lists
  • Dictionary
  • Series
  • Numpy ndarrays
  • Another DataFrame
  • External input iles like CSV, JSON, HTML, Excel sheet, and more.

In the subsequent sections of this chapter, we will see how to create a DataFrame using these inputs.

Create an Empty DataFrame

An empty DataFrame can be created using the DataFrame constructor without any input.

Example

Following is the example creating an empty DataFrame.

#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print(df)

Its output is as follows −

Empty DataFrame
Columns: []
Index: []

Create a DataFrame from Lists

The DataFrame can be created using a single list or a list of lists.

Example

The following example demonstrates how to create a pandas DataFrame from a Python list object.

import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print(df)

Its output is as follows −

     0
0    1
1    2
2    3
3    4
4    5

Example

Here is another example of creating a Pandas DataFrame from the Python list of list.

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)

Its output is as follows −

      Name      Age
0     Alex      10
1     Bob       12
2     Clarke    13

Create a DataFrame from Dict of ndarrays / Lists

All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays.

If no index is passed, then by default, index will be range(n), where n is the array length.

Example

Here is the example of creating the DataFrame from a Python dictionary.

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print(df)

Its output is as follows −

      Age      Name
0     28        Tom
1     34       Jack
2     29      Steve
3     42      Ricky

Note − Observe the values 0,1,2,3. They are the default index assigned to each using the function range(n).

Example

Let us now create an indexed DataFrame using arrays.

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print(df)

Its output is as follows −

         Age    Name
rank1    28      Tom
rank2    34     Jack
rank3    29    Steve
rank4    42    Ricky

Note − Observe, the index parameter assigns an index to each row.

Create a DataFrame from List of Dicts

List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.

Example

The following example shows how to create a DataFrame by passing a list of dictionaries.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print(df)

Its output is as follows −

    a    b      c
0   1   2     NaN
1   5   10   20.0

Note − Observe, NaN (Not a Number) is appended in missing areas.

Example

The following example shows how to create a DataFrame with a list of dictionaries, row indices, and column indices.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])

#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print(df1)
print(df2)

Its output is as follows −

#df1 output
         a  b
first    1  2
second   5  10

#df2 output
         a  b1
first    1  NaN
second   5  NaN

Note − Observe, df2 DataFrame is created with a column index other than the dictionary key; thus, appended the NaNs in place. Whereas, df1 is created with column indices same as dictionary keys, so NaNs appended.

Create a DataFrame from Dict of Series

Dictionary of Series can be passed to form a DataFrame. The resultant index is the union of all the series indexes passed.

Example

Here is the example −

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df)

Its output is as follows −

      one    two
a     1.0    1
b     2.0    2
c     3.0    3
d     NaN    4

Note − Observe, for the series one, there is no label d passed, but in the result, for the d label, NaN is appended with NaN.

Example

Another example of creating a Pandas DataFrame from a Series −

import pandas as pd

data = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])

df = pd.DataFrame(data)
print(df)

Its output is as follows −

       0
a      1
b      2
c      3
d      4

Python Pandas - Accessing DataFrame

Pandas DataFrame is a two-dimensional labeled data structure with rows and columns labels, it is looks and works similar to a table in a database or a spreadsheet. To work with the DataFrame labels, pandas provides simple tools to access and modify the rows and columns using index the index and columns attributes of a DataFrame.

In this tutorial, we will learn about how to access and modify rows and columns in a Pandas DataFrame using the index and columns attributes of the DataFrame.

Accessing the DataFrame Rows Labels

The index attribute in Pandas is used to access row labels in a DataFrame. It returns an index object containing the series of labels corresponding to the data represented in the each row of the DataFrame. These labels can be integers, strings, or other hashable types.

Example

The following example access the DataFrame row labels using the pd.index attribute.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Name': ['Steve', 'Lia', 'Vin', 'Katie'],
    'Age': [32, 28, 45, 38],
    'Gender': ['Male', 'Female', 'Male', 'Female'],
    'Rating': [3.45, 4.6, 3.9, 2.78]},
    index=['r1', 'r2', 'r3', 'r4'])

# Access the rows of the DataFrame
result = df.index
print('Output Accessed Row Labels:', result)

Output

Following is the output of the above code −

Output Accessed Row Labels: Index(['r1', 'r2', 'r3', 'r4'], dtype='object')

Modifying DataFrame Row Labels

With the index attribute you can also modify the row labels of a DataFrame.

Example

Here is an example that demonstrates accessing and modifying the row labels of the Pandas DataFrame using the index attribute.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Name': ['Steve', 'Lia', 'Vin', 'Katie'],
    'Age': [32, 28, 45, 38],
    'Gender': ['Male', 'Female', 'Male', 'Female'],
    'Rating': [3.45, 4.6, 3.9, 2.78]},
    index=['r1', 'r2', 'r3', 'r4'])

# Display the Input DataFrame
print('Input DataFrame:\n', df)

# Modify the Row labels of the DataFrame
df.index = [100, 200, 300, 400]
print('Output Modified DataFrame with the updated index labels:\n', df)

Output

On executing the above code you will get the following output −

Input DataFrame:
      Name  Age  Gender  Rating
r1  Steve   32    Male    3.45
r2    Lia   28  Female    4.60
r3    Vin   45    Male    3.90
r4  Katie   38  Female    2.78
Output Modified DataFrame with the updated index labels:
       Name  Age  Gender  Rating
100  Steve   32    Male    3.45
200    Lia   28  Female    4.60
300    Vin   45    Male    3.90
400  Katie   38  Female    2.78

Accessing The DataFrame Columns Labels

The Pandas pd.columns attribute is used to access the labels of the columns in the DataFrame. You can access and modify these column labels similarly to how we work with row labels.

Example

The following example demonstrates how to access the DataFrame column labels using the pd.columns attribute.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Name': ['Steve', 'Lia', 'Vin', 'Katie'],
    'Age': [32, 28, 45, 38],
    'Gender': ['Male', 'Female', 'Male', 'Female'],
    'Rating': [3.45, 4.6, 3.9, 2.78]},
    index=['r1', 'r2', 'r3', 'r4'])

# Access the column labels of the DataFrame
result = df.columns
print('Output Accessed column Labels:', result)

Output

Following is the output of the above code −

Output Accessed column Labels: Index(['Name', 'Age', 'Gender', 'Rating'], dtype='object')

Modifying the DataFrame Column Labels

Column labels can be modified using the columns attribute.

Example

This example demonstrates how to access and modify the DataFrame column labels using the pd.columns attribute.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Name': ['Steve', 'Lia', 'Vin', 'Katie'],
    'Age': [32, 28, 45, 38],
    'Gender': ['Male', 'Female', 'Male', 'Female'],
    'Rating': [3.45, 4.6, 3.9, 2.78]},
    index=['r1', 'r2', 'r3', 'r4'])
    
# Display the Input DataFrame
print('Input DataFrame:\n', df)

# Modify the Column labels of the DataFrame
df.columns = ['Col1', 'Col2', 'Col3', 'Col4']
print('Output Modified DataFrame with the updated Column Labels\n:', df)

Output

Following is the output of the above code −

Input DataFrame:
      Name  Age  Gender  Rating
r1  Steve   32    Male    3.45
r2    Lia   28  Female    4.60
r3    Vin   45    Male    3.90
r4  Katie   38  Female    2.78
Output Modified DataFrame with the updated Column Labels:      
     Col1  Col2    Col3  Col4
r1  Steve    32    Male  3.45
r2    Lia    28  Female  4.60
r3    Vin    45    Male  3.90
r4  Katie    38  Female  2.78

Python Pandas - Slicing a DataFrame Object

Pandas DataFrame slicing is a process of extracting specific rows, columns, or subsets of data based on both position and labels. DataFrame slicing is a common operation while working with large datasets, it is similar to Python lists and NumPy ndarrays, DataFrame slicing uses the [] operator and specific slicing attributes like .iloc[] and .loc[] to retrieve data efficiently.

In this tutorial, we will learn about how to slice Pandas DataFrames using both positional and label-based indexing.

Introduction to Pandas DataFrame Slicing

Pandas DataFrame slicing is performed using two main attributes, which are −

  • .iloc[]: For slicing based on position (integer-based indexing).

  • .loc[]: For slicing based on labels (index labels or column labels).

Let's learn about all possible ways of slicing a Pandas DataFrame.

Slicing a DataFrame by Position

The Pandas DataFrame.iloc[] attribute used to slice a DataFrame based on the integer position (i.e, integer-based indexing) of rows and columns.

Following is the syntax of slicing a DataFrame using the .iloc[] attribute −

DataFrame.iloc[row_start:row_end, column_start:column_end]

Where, row_start and row_end are indicates the start and end integer-based index values of the DataFrame rows. Similarly, column_start and column_end are the column index values.

Example: Slicing DataFrame Rows by Position

The following example demonstrates how to slice the DataFrame rows using the DataFrame.iloc[] attribute.

import pandas as pd

# Create a Pandas DataFrame
df = pd.DataFrame([['a','b'], ['c','d'], ['e','f'], ['g','h']], columns=['col1', 'col2'])

# Display the DataFrame
print("Input DataFrame:")
print(df)

# Slice rows based on position
result = df.iloc[1:3, :]
print("Output:")
print(result)

Following is the output of the above code −


Input DataFrame:
  col1 col2
0    a    b
1    c    d
2    e    f
3    g    h

Output:
  col1 col2
1    c    d
2    e    f

Slicing a DataFrame by Label

The Pandas DataFrame.loc[] attribute used to slice a DataFrame based on the labels of rows and columns.

Following is the syntax of slicing a DataFrame using the .loc[] attribute −

DataFrame.loc[row_label_start:row_label_end, column_label_start:column_label_end]

Where, row_label_start and row_label_end are indicates the start and end labels of the DataFrame rows. Similarly, column_label_start and column_label_end are the column labels.

Example: Slicing DataFrame Rows and Columns using .loc[]

The following example demonstrates how to slice a DataFrame rows and columns by using their labels with the .loc[] attribute.

import pandas as pd

# Create a DataFrame with labeled indices
df = pd.DataFrame([['a','b'], ['c','d'], ['e','f'], ['g','h']], columns=['col1', 'col2'], index=['r1', 'r2', 'r3', 'r4'])

# Display the DataFrame
print("Original DataFrame:")
print(df)

# Slice rows and columns by label
result = df.loc['r1':'r3', 'col1']
print("Output:")
print(result)

Following is the output of the above code −

Original DataFrame:
   col1 col2
r1    a    b
r2    c    d
r3    e    f
r4    g    h

Output:
r1    a
r2    c
r3    e
Name: col1, dtype: object

DataFrame Column Slicing

Similar to the above row slicing, Pandas DataFrame Column slicing can also done using the .iloc[] for position and .loc[] for labels.

Example: Column Slicing using iloc[]

The following example slice the DataFrame columns based on their integer position.

import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data)

# Slice a single column
col_A = df.iloc[:, 0]
print("Slicing a single column A using iloc[]:")
print(col_A)

# Slice multiple columns
cols_AB = df.iloc[:, 0:2]
print("Slicing multiple columns A and B using iloc[]:")
print(cols_AB)

Following is the output of the above code −

Slicing a single column A using iloc[]:
0    1
1    2
2    3
Name: A, dtype: int64

Slicing multiple columns A and B using iloc[]:
   A  B
0  1  4
1  2  5
2  3  6

Example: Column Slicing Using loc[]

This example slices the DataFrame columns by their labels using the .loc[] attribute.

import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data)

# Slice a single column by label
col_A = df.loc[:, 'A']
print("Slicing a single column A using loc[]:")
print(col_A)

# Slice multiple columns by label
cols_AB = df.loc[:, 'A':'B']
print("Slicing Multiple columns A and B using loc[]:")
print(cols_AB)

Following is the output of the above code −

Slicing a single column A using loc[]:
0    1
1    2
2    3
Name: A, dtype: int64

Slicing Multiple columns A and B using loc[]:
   A  B
0  1  4
1  2  5
2  3  6

Modifying Values After Slicing

After slicing a DataFrame, you can modify the sliced values directly. This can be done by assigning new values to the selected elements.

Example

This example demonstrates how to modify the sliced DataFrame values directly.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame([['a', 'b'], ['c', 'd'], ['e', 'f'], ['g', 'h']], 
                  columns=['col1', 'col2'])

# Display the Original DataFrame
print("Original DataFrame:", df, sep='\n')

# Modify a subset of the DataFrame using iloc
df.iloc[1:3, 0] = ['x', 'y']

# Display the modified DataFrame
print('Modified DataFrame:',df, sep='\n')

Following is the output of the above code −

Original DataFrame:
  col1 col2
0    a    b
1    c    d
2    e    f
3    g    h

Modified DataFrame:
  col1 col2
0    a    b
1    x    d
2    y    f
3    g    h

Python Pandas - Modifying DataFrame

Pandas DataFrame is two-dimensional data structure that can be used for storing and manipulating tabular data. It consists of rows and columns making it similar to a spreadsheet or SQL table. Modifying a Pandas DataFrame is a crucial step in data preprocessing, data analysis, and data cleaning.

Some of the most common DataFrame modifications include −

  • Renaming column or row labels.

  • Adding or inserting new columns.

  • Updating or replacing existing column values.

  • Removing unnecessary columns.

In this tutorial, we will learn about how to modify Pandas DataFrames in different ways.

Renaming Column Labels in a DataFrame

Renaming column or row labels improves data readability and helps standardize column names. The rename() method in Pandas allows renaming one or more columns or row labels.

Example

The following example uses the DataFrame.rename() method to rename a columns name of a DataFrame.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3],'B': [4, 5, 6]})

# Display original DataFrame
print("Original DataFrame:")
print(df)

# Rename column 'A' to 'aa'
df = df.rename(columns={'A': 'aa'})

# Display modified DataFrame
print("Modified DataFrame:")
print(df)

Output

Following is the output of the above code −

Original DataFrame:
   A  B
0  1  4
1  2  5
2  3  6
Modified DataFrame:
   aa  B
0   1  4
1   2  5
2   3  6

Renaming Row Labels in a DataFrame

Similarly, you can rename row labels using of a Pandas DataFrame using the index parameter of the rename() method.

Example

This example demonstrates how to rename the row labels of a Pandas DataFrame using the rename(index={}) method.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3],'B': [4, 5, 6]}, index=['x', 'y', 'z'])

# Display original DataFrame
print("Original DataFrame:")
print(df)

# Rename the multiple row labels
df = df.rename(index={'x': 'r1', 'y':'r2', 'z':'r3'})

# Display modified DataFrame
print("Modified DataFrame:")
print(df)

Output

Following is the output of the above code −

Original DataFrame:
   A  B
x  1  4
y  2  5
z  3  6
Modified DataFrame:
    A  B
r1  1  4
r2  2  5
r3  3  6

Adding or Inserting Columns in a DataFrame

Adding a new column to an existing DataFrame is straightforward. The simplest way is to directly assign values to the DataFrame using a new column name. Additionally, you can use the DataFrame.insert() method to insert a new column at a specified location.

Example: Adding a New Column Directly

The following example demonstrates how to add a new column directly to a DataFrame.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3],'B': [4, 5, 6]})

# Add a new column 'C' with values
df['C'] = [7, 8, 9]

# Display updated DataFrame
print("DataFrame after adding a new column 'C':")
print(df)

Output

Following is the output of the above code −

DataFrame after adding a new column 'C':
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9

Example: Inserting a Column at a Specific Position

This example demonstrates how to insert a column at a specific index of a DataFrame using the DataFrame.insert() method. In this example we will insert the columns D at index position 1.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3],'B': [4, 5, 6]})

# Insert a new column 'D' at position 1
df.insert(1, 'D', [10, 11, 12])

# Display updated DataFrame
print("DataFrame after inserting column 'D' at position 1:")
print(df)

Output

Following is the output of the above code −

DataFrame after inserting column 'D' at position 1:
   A   D  B
0  1  10  4
1  2  11  5
2  3  12  6

Replacing the Contents of a DataFrame

Replacing the contents of the DataFrame can be done by multiple ways, one of the easiest way is assigning new values directly to the particular part of the DataFrame.

Example: Replacing a Columns Values

The following example demonstrates how to replace/update particular column values of a DataFrame using the direct assignment way.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3],'B': [4, 5, 6]})

# Replace the contents of column 'A' with new values
df['A'] = [10, 20, 30]

# Display updated DataFrame
print("DataFrame after replacing column 'A':")
print(df)

Output

Following is the output of the above code −

DataFrame after replacing column 'A':
    A  B
0  10  4
1  20  5
2  30  6

Replacing Specific Values using the replace() method

You can also use the DataFrame.replace() method to replace specific values within a column of a DataFrame.

Example

This example demonstrates how to replace specific values in a DataFrame using the DataFrame.replace() method.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3],'B': [4, 5, 6]})

# Display the Input DataFrame 
print("Original DataFrame:", df, sep='\n')

# Replace the contents 
df.replace({'A': 1, 'B': 6}, 100, inplace=True)

# Display updated DataFrame
print("DataFrame after replacing column 'A':")
print(df)

Output

Following is the output of the above code −

Original DataFrame:
   A  B
0  1  4
1  2  5
2  3  6
DataFrame after replacing column 'A':
     A    B
0  100    4
1    2    5
2    3  100

Deleting Columns

Removing unnecessary columns is essential for data cleaning. You can delete single or multiple columns of a DataFrame using the DataFrame.drop() method.

Example

Here is an example that demonstrates how to delete multiple columns from a Pandas DataFrame using the DataFrame.drop() method.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3],'B': [4, 5, 6],'C': [7, 8, 9]})

# Display the original DataFrame
print("Original DataFrame:", df, sep='\n')

# Delete columns 'A' and 'B'
df = df.drop(columns=['A', 'B'])

# Display updated DataFrame
print("DataFrame after deleting columns 'A' and 'B':")
print(df)

Output

Following is the output of the above code −

Original DataFrame:
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9
DataFrame after deleting columns 'A' and 'B':
   C
0  7
1  8
2  9

Python Pandas - Removing Rows from a DataFrame

Data cleaning is an essential step in preprocessing, and removing unwanted rows is a common operation in Pandas. A Pandas DataFrame is a two-dimensional data structure in Python that organizes data in a tabular format, consisting of rows and columns. It is widely used for data analysis and manipulation tasks, enabling efficient handling of large datasets.

Removing rows may be necessary for various reasons −

  • Removing the irrelevant data

  • Removing duplicate or missing values

  • Deleting specific rows based on conditions

Pandas provides multiple ways to remove rows efficiently. In this tutorial, we will learn about various techniques to remove/drop rows from a pandas DataFrame, including −

  • Using the .drop() method

  • Removing rows based on conditions

  • Dropping rows with index slicing

Dropping Rows using the drop() method

The pandas DataFrame.drop() method is used to remove a specific row from the pandas DataFrame. It can be used to drop rows by their label or position (integer-based index), and it returns a new DataFrame with the selected rows removed.

Example: Dropping DataFrame Rows by Index Values

Here is a basic example of deleting a row from a DataFrame object using the DataFrame.drop() method based on its index value.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],'B': [4, 5, 6, 7, 8]})

# Display original DataFrame
print("Original DataFrame:")
print(df)

# Drop the row with index 3
result = df.drop(3)

# Display the result
print("\nAfter dropping the row at index 3:")
print(result)

Output

Following is the output of the above code −

Original DataFrame:
   A  B
0  1  4
1  2  5
2  3  6
3  4  7
4  5  8

After dropping the row at index 3:
   A  B
0  1  4
1  2  5
2  3  6
4  5  8

Note: This method will raise a KeyError if the specified row label or index is not found in the index of the DataFrame. And this error can be suppressed by setting the errors parameter from raise to ignore.

Dropping Multiple Rows by Labels

By providing the list of multiple row labels to the drop() method, we can easily remove multiple rows at a time from a DataFame.

Example

Similar to the previous example the following one will delete the multiple rows from a DataFrame based on its row labels using the DataFrame.drop() method. Here we are specified list of row labels to the drop() method.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],'B': [4, 5, 6, 7, 8],
'C': [9, 10, 11, 12, 13]}, index=['r1', 'r2', 'r3', 'r4', 'r5'])

# Display original DataFrame
print("Original DataFrame:")
print(df)

# Drop the rows by row-labels
result = df.drop(['r1', 'r3'])

# Display the result
print("\nAfter dropping the rows:")
print(result)

Output

Following is the output of the above code −

Original DataFrame:
    A  B   C
r1  1  4   9
r2  2  5  10
r3  3  6  11
r4  4  7  12
r5  5  8  13

After dropping the rows:
    A  B   C
r2  2  5  10
r4  4  7  12
r5  5  8  13

Removing Rows Based on a Condition

Rows can be removed based on a conditional expression, meaning that you can use a condition inside a selection brackets [] to filter the rows. This method is useful when filtering out rows that meet a specific condition, such as missing values or unwanted entries.

Example

This example demonstrates how to drop row or rows from a Pandas DataFrame based on a conditional statement specified inside the []. In this example row deletion done is based on a DataFrame on column value.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],'B': [4, 5, 6, 7, 8],
'C': [90, 0, 11, 12, 13]}, index=['r1', 'r2', 'r3', 'r4', 'r5'])

# Display original DataFrame
print("Original DataFrame:")
print(df)

# Dropping rows where column 'C' contains 0
result = df[df["C"] != 0]

# Display the result
print("\nAfter dropping the row where 'C' has 0:")
print(result)

Output

Following is the output of the above code −

Original DataFrame:
    A  B   C
r1  1  4  90
r2  2  5   0
r3  3  6  11
r4  4  7  12
r5  5  8  13

After dropping the row where 'C' has 0:
    A  B   C
r1  1  4  90
r3  3  6  11
r4  4  7  12
r5  5  8  13

Removing Rows using Index Slicing

This is the another approach of removing or dropping rows is using index slicing. This technique drops a range of rows based on their index positions.

Example

This example demonstrates how to drop the single or multiple rows from a DataFrame using the index slicing technique.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],'B': [4, 5, 6, 7, 8]})

# Display original DataFrame
print("Original DataFrame:")
print(df)

# Drop the row using the index slicing
result = df.drop(df.index[2:4])

# Display the result
print("\nAfter dropping the row at 2 and 3:")
print(result)

Output

Following is the output of the above code −

Original DataFrame:
   A  B
0  1  4
1  2  5
2  3  6
3  4  7
4  5  8

After dropping the row at 2 and 3:
   A  B
0  1  4
1  2  5
4  5  8

Python Pandas - Arithmetic Operations on Dataframe

Pandas DataFrame is a two-dimensional, labeled data structure that allows for efficient data manipulation and analysis. One of the primary features of Pandas is its ability to perform vectorized arithmetic operations on DataFrames. This means you can apply mathematical operations without using loop through elements manually.

Applying arithmetic operations in Pandas allows you to manipulate data quickly and efficiently, whether you're working with a single DataFrame or performing operations between multiple DataFrames.

In this tutorial, we will learn how to apply arithmetic operations like addition, subtraction, multiplication, and division on Pandas DataFrames.

Arithmetic Operations on DataFrame with Scalar Value

You can perform arithmetic operations on a DataFrame with scalar values directly. These operations are applied element-wise, meaning that every value in the DataFrame is affected by the arithmetic operation.

Following is the list of commonly used arithmetic operators on Pandas DataFrame −

Operation Example with Operator Description
Addition df + 2 Adds 2 to each element of the DataFrame
Subtraction df - 2 Subtracts 2 from each element
Multiplication df * 2 Multiplies each element by 2
Division df / 2 Divides each element by 2
Exponentiation df ** 2 Raises each element to the power of 2
Modulus df % 2 Finds the remainder when divided by 2
Floor Division df // 2 Divides and floors the quotient

Example

The following example demonstrates how to applies the all arithmetical operators on a Pandas DataFrame with a scalar value.

import pandas as pd

# Create a sample DataFrame
data = {'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}
df = pd.DataFrame(data)

# Display the input DataFrame
print("Input DataFrame:\n", df)

# Perform arithmetic operations
print("\nAddition:\n", df + 2)
print("\nSubtraction:\n", df - 2)
print("\nMultiplication:\n", df * 2)
print("\nDivision:\n", df / 2)
print("\nExponentiation:\n", df ** 2)
print("\nModulus:\n", df % 2)
print("\nFloor Division:\n", df // 2)

Output

Following is the output of the above code −

Input DataFrame:
    A  B
0  1  5
1  2  6
2  3  7
3  4  8

Addition:
    A   B
0  3   7
1  4   8
2  5   9
3  6  10

Subtraction:
    A  B
0 -1  3
1  0  4
2  1  5
3  2  6

Multiplication:
    A   B
0  2  10
1  4  12
2  6  14
3  8  16

Division:
      A    B
0  0.5  2.5
1  1.0  3.0
2  1.5  3.5
3  2.0  4.0

Exponentiation:
     A   B
0   1  25
1   4  36
2   9  49
3  16  64

Modulus:
    A  B
0  1  1
1  0  0
2  1  1
3  0  0

Floor Division:
    A  B
0  0  2
1  1  3
2  1  3
3  2  4

Arithmetic Operations Between Two DataFrames

Pandas allows you to apply arithmetic operators between two DataFrames efficiently. These operations are applied element-wise, meaning corresponding elements in both DataFrames are used in calculations.

When performing arithmetic operations on two DataFrames, Pandas aligns them based on their index and column labels. If a particular index or column is missing in either DataFrame, the result for those entries will be NaN, indicating missing values.

Example

This example demonstrates applying the arithmetic operations on two DataFrame. These operations include addition, subtraction, multiplication, and division of two DataFrame.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]})
df2 = pd.DataFrame({'A': [10, 20, 30], 'B': [50, 60, 70]}, index=[1, 2, 3])

# Display the input DataFrames
print("DataFrame 1:\n", df1)
print("\nDataFrame 2:\n", df2)

# Perform arithmetic operations
print("\nAddition of Two DataFrames:\n", df1 + df2)
print("\nSubtraction of Two DataFrames:\n", df1 - df2)
print("\nMultiplication of Two DataFrames:\n", df1 * df2)
print("\nDivision of Two DataFrames:\n", df1 / df2)

Output

Following is the output of the above code −

DataFrame 1:
    A  B
0  1  5
1  2  6
2  3  7
3  4  8

DataFrame 2:
     A   B
1  10  50
2  20  60
3  30  70

Addition of Two DataFrames:
       A     B
0   NaN   NaN
1  12.0  56.0
2  23.0  67.0
3  34.0  78.0

Subtraction of Two DataFrames:
       A     B
0   NaN   NaN
1  -8.0 -44.0
2 -17.0 -53.0
3 -26.0 -62.0

Multiplication of Two DataFrames:
        A      B
0    NaN    NaN
1   20.0  300.0
2   60.0  420.0
3  120.0  560.0

Division of Two DataFrames:
           A         B
0       NaN       NaN
1  0.200000  0.120000
2  0.150000  0.116667
3  0.133333  0.114286

Arithmetic Functions in Pandas

In addition to the above operators, Pandas provides various functions to perform arithmetic operations on Pandas Data structure, which can handle missing values efficiently and provides additional options for customization, like selecting the axis and specifying levels.

S.No Function Description
1 add(other[, axis, level, fill_value]) Element-wise addition (binary operator +).
2 sub(other[, axis, level, fill_value]) Element-wise subtraction (binary operator -).
3 mul(other[, axis, level, fill_value]) Element-wise multiplication (binary operator *).
4 div(other[, axis, level, fill_value]) Element-wise floating division (binary operator /).
5 truediv(other[, axis, level, ...]) Element-wise floating division (binary operator /).
6 floordiv(other[, axis, level, ...]) Element-wise integer division (binary operator //).
7 mod(other[, axis, level, fill_value]) Element-wise modulo operation (binary operator %).
8 pow(other[, axis, level, fill_value]) Element-wise exponential power (binary operator **).
9 dot(other) Matrix multiplication with another DataFrame or array.
10 radd(other[, axis, level, fill_value]) Reverse element-wise addition.
11 rsub(other[, axis, level, fill_value]) Reverse element-wise subtraction.
12 rmul(other[, axis, level, fill_value]) Reverse element-wise multiplication.
13 rdiv(other[, axis, level, fill_value]) Reverse element-wise floating division.
14 rfloordiv(other[, axis, level, ...]) Reverse element-wise integer division.
15 rmod(other[, axis, level, fill_value]) Reverse element-wise modulo operation.
16 rpow(other[, axis, level, fill_value]) Reverse element-wise exponential power.

Python Pandas - IO Tools

The Pandas library offers powerful I/O tools (API) for data import and export, enabling seamless handling of various file formats like CSV, Excel, JSON, and many more. This API includes top-level reader functions like, pd.read_csv(), read_clipboard() and corresponding writer methods like, to_csv(), to_clipboard() for easy data handling.

In this tutorial, we will learn about the overview of the Pandas I/O tools and learn how to use them effectively.

Overview of Pandas IO Tools

The Pandas I/O API supports a wide variety of data formats. Here is a summary of supported formats and their corresponding reader and writer functions −

FormatReader FunctionWriter Function
Tabular Dataread_table()NA
CSVread_csv()to_csv()
Fixed-Width Text Fileread_fwf()NA
Clipboardread_clipboard()to_clipboard()
Picklingread_pickle()to_pickle()
Excelread_excel()to_excel()
JSONread_json()to_json()
HTMLread_html()to_html()
XMLread_xml()to_xml()
LaTeXNAto_latex()
HDF5 Formatread_hdf()to_hdf()
Featherread_feather()to_feather()
Parquetread_parquet()to_parquet()
ORCread_orc()to_orc()
SQLread_sql()to_sql()
Stataread_stata()to_stata()

Among these, the most frequently used functions for handling text files are read_csv() and read_table(). Both convert flat files into DataFrame objects.

Example: Reading CSV Data

This example shows reading the CSV data using the pandas read_csv() function. In this example we are using the StringIO to load the CSV string into a Pandas DataFrame object.

import pandas as pd

# Import StringIO to load a file-like object for reading CSV
from io import StringIO

# Create string representing CSV data
data = """S.No,Name,Age,City,Salary
1,Tom,28,Toronto,20000
2,Lee,32,HongKong,3000
3,Steven,43,Bay Area,8300
4,Ram,38,Hyderabad,3900"""

# Use StringIO to convert the string data into a file-like object
obj = StringIO(data)

# read CSV into a Pandas DataFrame
df = pd.read_csv(obj)

print(df)

Output

Its output is as follows −

   S.No    Name  Age       City  Salary
0     1     Tom   28    Toronto   20000
1     2     Lee   32   HongKong    3000
2     3  Steven   43   Bay Area    8300
3     4     Ram   38  Hyderabad    3900

Customizing Parsing Options

Pandas allows several customization options when parsing data. You can modify how the data is parsed using parameters like −

  • Index_col

  • dtype

  • names

  • skiprows

Below we will discuss about the common parsing options for customization.

Customizing the index

You can customize the row labels or index of the Pandas object by using index_col parameter. Setting index_col=False forces Pandas to not use the first column as the index, which can be helpful when handling malformed files with extra delimiters.

Example

This example uses the index_col parameter to customize the row labels while reading the CSV data.

import pandas as pd

# Import StringIO to load a file-like object for reading CSV
from io import StringIO

# Create string representing CSV data
data = """S.No,Name,Age,City,Salary
1,Tom,28,Toronto,20000
2,Lee,32,HongKong,3000
3,Steven,43,Bay Area,8300
4,Ram,38,Hyderabad,3900"""

# Use StringIO to convert the string data into a file-like object
obj = StringIO(data)

# read CSV into a Pandas DataFrame
df = pd.read_csv(obj, index_col=['S.No'])

# Display the DataFrame
print(df)

Output

Its output is as follows −

        Name  Age       City  Salary
S.No                                
1        Tom   28    Toronto   20000
2        Lee   32   HongKong    3000
3     Steven   43   Bay Area    8300
4        Ram   38  Hyderabad    3900

Converters

Pandas also provides the ability to specify the data type for columns using the dtype parameter. You can convert columns to specific types like {'Col_1': np.float64, 'Col_2': np.int32, 'Col3': 'Int64'}.

Example

This example customizes the data type of a JSON data while parsing the data using the read_json() method with the dtype parameter.

import pandas as pd
from io import StringIO
import numpy as np

# Create a string representing JSON data
data = """[
    {"Name": "Braund", "Gender": "Male", "Age": 30},
    {"Name": "Cumings", "Gender": "Female", "Age": 25},
    {"Name": "Heikkinen", "Gender": "Female", "Age": 35}
]"""

# Use StringIO to convert the JSON-formatted string data into a file-like object
obj = StringIO(data)

# Read JSON into a Pandas DataFrame
df = pd.read_json(obj, dtype={'Age': np.float64})

# Display the DataFrame
print(df.dtypes)

Output

Its output is as follows −

Name       object
Gender     object
Age       float64
dtype: object

By default, the dtype of the 'Age' column is int, but the result shows it as float because we have explicitly casted the type.

Thus, the data looks like float −

        Name  Gender    Age 
S.No                                
1     Braund    Male    30.0
2    Cumings  Female    25.0
3  Heikkinen  Female    35.0

Customizing the Header Names

When reading data files, Pandas assumes the first row as the header. However, you can customize this using the names Parameter to provide custom column names.

Example

This example reads the XML data into a Pandas DataFrame object by customizing the header names using the names parameter of the read_xml() method.

import pandas as pd
from io import StringIO

# Create a String representing XML data 
xml = """<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
  <book category="cooking">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
  <book category="children">
    <title lang="en">Harry Potter</title>
    <author>J K. Rowling</author>
    <year>2005</year>
    <price>29.99</price>
  </book>
  <book category="web">
    <title lang="en">Learning XML</title>
    <author>Erik T. Ray</author>
    <year>2003</year>
    <price>39.95</price>
  </book>
</bookstore>"""

# Parse the XML data with custom column names
df = pd.read_xml(StringIO(xml), names=['a', 'b', 'c','d','e'])

# Display the Output DataFrame
print('Output DataFrame from XML:')
print(df)

Output

Its output is as follows −

Output DataFrame from XML:
          a                 b                    c     d      e
0   cooking  Everyday Italian  Giada De Laurentiis  2005  30.00
1  children      Harry Potter         J K. Rowling  2005  29.99
2       web      Learning XML          Erik T. Ray  2003  39.95

Example: Reading with custom column names and header row

If the header is in a row other than the first, pass the row number to header. This will skip the preceding rows.

import pandas as pd

# Import StringIO to load a file-like object for reading CSV
from io import StringIO

# Create string representing CSV data
data = """S.No,Name,Age,City,Salary
1,Tom,28,Toronto,20000
2,Lee,32,HongKong,3000
3,Steven,43,Bay Area,8300
4,Ram,38,Hyderabad,3900"""

# Use StringIO to convert the string data into a file-like object
obj = StringIO(data)

# read CSV into a Pandas DataFrame
df = pd.read_csv(obj, names=['a', 'b', 'c','d','e'], header=0)

# Display the DataFrame
print(df)

Output

Its output is as follows −

   a       b   c          d      e
0  1     Tom  28    Toronto  20000
1  2     Lee  32   HongKong   3000
2  3  Steven  43   Bay Area   8300
3  4     Ram  38  Hyderabad   3900

Skipping Rows

The skiprows parameter allows you to skip a specific number of rows or line numbers when reading a file. It can also accept a callable function to decide which rows to skip based on conditions.

Example

This example shows skipping the rows of a input data while parsing.

import pandas as pd

# Import StringIO to load a file-like object for reading CSV
from io import StringIO

# Create string representing CSV data
data = """S.No,Name,Age,City,Salary
1,Tom,28,Toronto,20000
2,Lee,32,HongKong,3000
3,Steven,43,Bay Area,8300
4,Ram,38,Hyderabad,3900"""

# Use StringIO to convert the string data into a file-like object
obj = StringIO(data)
    
# read CSV into a Pandas DataFrame
df = pd.read_csv(obj, skiprows=2)

# Display the DataFrame
print(df)

Output

Its output is as follows −

   2     Lee  32   HongKong  3000
0  3  Steven  43   Bay Area  8300
1  4     Ram  38  Hyderabad  3900

Python Pandas - Working with CSV Format

Working with the CSV format is a common task in data analysis and data science. CSV (Comma-Separated Values) files are widely used to store tabular data because they are lightweight, human-readable, and supported by almost all data analysis tools and programming languages.

The Python Pandas library is a powerful tool for working with data, it offers extensive functionality for reading, processing, and writing data in CSV format. With Pandas, you can easily handle complex operations like filtering, grouping, and manipulating data in CSV files.

A CSV file is a plain text file where data values are separated by commas, representing tabular data in plain text format. A CSV file has a .csv extension. Below you can see how the data present in the CSV file looks like −

Sr.no,Name,Gender,Age
1,Braund,male,22
2,Cumings,female,38
3,Heikkinen,female,26
4,Futrelle,female,35

In this tutorial, we will learn how to work with CSV files using Pandas, including reading CSV files into DataFrames, understanding alternative reading methods, and handling large datasets, to exporting data back to CSV.

Reading a CSV File in Pandas

The pandas.read_csv() function is used to read the CSV format file into the Pandas DataFrame or TextFileReader. This function accepts CSV data from a URL or a local file path to load the data into the Pandas environment.

Example

The following example demonstrates how to read CSV data using the pandas.read_csv() function. Here we are using the StringIO to load the CSV string into a file-like object.

import pandas as pd

# Import StringIO to load a file-like object for reading CSV
from io import StringIO

# Create string representing CSV data
data = """Name,Gender,Age
Braund,male,22
Cumings,female,38
Heikkinen,female,26
Futrelle,female,35"""

# Use StringIO to convert the string data into a file-like object
obj = StringIO(data)

# read CSV into a Pandas DataFrame
df = pd.read_csv(obj)

print(df)

Output

Following is the output of the above code −

        Name  Gender  Age
0     Braund    male   22
1    Cumings  female   38
2  Heikkinen  female   26
3   Futrelle  female   35

Writing Data to a CSV File

Pandas provides a method called to_csv() to create or write CSV file using the Pandas data structures, such as DataFrame or Series objects. This function allows you to export your data to a CSV format.

Example

Here is an example demonstrating how to write a Pandas DataFrame to a CSV file using the DataFrame.to_csv() method.

import pandas as pd

# dictionary of lists
d = {'Car': ['BMW', 'Lexus', 'Audi', 'Mercedes', 'Jaguar', 'Bentley'],
'Date_of_purchase': ['2024-10-10', '2024-10-12', '2024-10-17', '2024-10-16', '2024-10-19', '2024-10-22']}

# creating dataframe from the above dictionary of lists
dataFrame = pd.DataFrame(d)
print("Original DataFrame:\n",dataFrame)

# write dataFrame to SalesRecords CSV file
dataFrame.to_csv("Output_written_CSV_File.csv")

# display the contents of the output csv
print("The output csv file written successfully...")

Output

Following is the output of the above code −

Original DataFrame:
         Car Date_of_purchase
0       BMW       2024-10-10
1     Lexus       2024-10-12
2      Audi       2024-10-17
3  Mercedes       2024-10-16
4    Jaguar       2024-10-19
5   Bentley       2024-10-22
The output csv file written successfully...

If you visit your working directory after executing the above code, you can see the created CSV file named Output_written_CSV_File.csv.

Handling Large CSV Files in Pandas

When working with large CSV files, loading the entire file may cause memory issues. Pandas provides option like chunksize in pandas.read_csv() function to process such files efficiently in smaller chunks.

Example

Below is an example that initially creates a large CSV file using the DataFrame.to_csv() method with random integers and then processes it in chunks using the Pandas read_csv() function.

import pandas as pd
import numpy as np

# Generate a DataFrame with random integers
data = np.random.randint(0, 100, size=(1000, 5))
column_names = [f"Col_{i}" for i in range(1, 5 + 1)]

# Create a DataFrame and save it as a CSV file
large_csv_file = "large_file.csv"
df = pd.DataFrame(data, columns=column_names)
df.to_csv(large_csv_file, index=False)
print(f"Large CSV file is created successfully.\n")

# Read large CSV file in chunks
chunk_size = 200

print("Output CSV data in chunks:")
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
   print('Data in chunks:') 
   print(chunk.head(2))

Output

While executing the above code we obtain the following output −

Large CSV file is created successfully.

Output CSV data in chunks:
Data in chunks:
   Col_1  Col_2  Col_3  Col_4  Col_5
0     56     62     66     83     25
1     49     72     93      6     55
Data in chunks:
     Col_1  Col_2  Col_3  Col_4  Col_5
200     59     50      4     15      4
201     53     38     14     66     52
Data in chunks:
     Col_1  Col_2  Col_3  Col_4  Col_5
400     99     67     69     69     63
401     65     53     70     38     58
Data in chunks:
     Col_1  Col_2  Col_3  Col_4  Col_5
600     90     43     79     29      1
601     31     96     77     75     47
Data in chunks:
     Col_1  Col_2  Col_3  Col_4  Col_5
800      1     94     46     14     31
801     17     75     55      5      6

Alternatives to Reading CSV Files

In addition to the pandas.read_csv() function, Pandas provides an alternative method for reading CSV data using pandas.read_table() function.

The pandas.read_table() function is used to read general delimited files such as CSV, TSV, or other delimiter-separated formats into a Pandas DataFrame. It is a good alternative for loading CSV files, and it easily handles various delimiters using the sep parameter. Additionally, this function supports iterating or breaking of the file into chunks.

Example

This example shows an alternative way to load CSV data into the Pandas DataFrame using the pd.read_table() function. Here you need to specify the delimiter with the sep parameter to read comma-separated values (CSV).

import pandas as pd

url ="https://raw.githubusercontent.com/Opensourcefordatascience/Data-sets/master/blood_pressure.csv"

# read CSV into a Pandas DataFrame using the read_table() function
df = pd.read_table(url,sep=',')

print(df.head(5))

Output

Following is the output of the above code −

   
patient   sex agegrp  bp_before  bp_after
0        1  Male  30-45        143       153
1        2  Male  30-45        163       170
2        3  Male  30-45        153       168
3        4  Male  30-45        153       142
4        5  Male  30-45        146       141
python_pandas_io_tool.htm

Python Pandas - Reading and Writing JSON Files

JSON (JavaScript Object Notation) is a lightweight, human-readable data-interchange format widely used for data storage and transfer. It is widely used for transmitting data between a server and a web application. Python's Pandas library provides robust functionalities for reading and writing JSON files efficiently using the read_json() and to_json() methods.

A JSON file stores data in a structured format, which looks similar to a dictionary or a list in Python. A JSON file has .json extension. Below you can see how the data present in the JSON file looks like −

[
    {
        "Name": "Braund",
        "Gender": "Male",
        "Age": 30
    },
    {
        "Name": "Cumings",
        "Gender": "Female",
        "Age": 25
    },
    {
        "Name": "Heikkinen",
        "Gender": "female",
        "Age": 35
    }
]

In this tutorial, we will learn about basics of working with JSON files using Pandas, including reading and writing JSON files, and some common configurations.

Reading JSON Files with Pandas

The pandas.read_json() function is used to read JSON data into a Pandas DataFrame. This function can take a file path, URL, or JSON string as input.

Example

The following example demonstrates how to read JSON data using the pandas.read_json() function. Here we are using the StringIO to load the JSON string into a file-like object.

import pandas as pd
from io import StringIO

# Create a string representing JSON data
data = """[
    {"Name": "Braund", "Gender": "Male", "Age": 30},
    {"Name": "Cumings", "Gender": "Female", "Age": 25},
    {"Name": "Heikkinen", "Gender": "Female", "Age": 35}
]"""

# Use StringIO to convert the JSON formatted string data into a file-like object
obj = StringIO(data)

# Read JSON into a Pandas DataFrame
df = pd.read_json(obj)

print(df)

Output

Following is the output of the above code −

        Name  Gender  Age
0     Braund    Male   30
1    Cumings  Female   25
2  Heikkinen  Female   35

Writing JSON Files with Pandas

Pandas provides the to_json() function to export or write JSON file using the data from a Pandas DataFrame or Series objects. This function is used to convert a Pandas data structure object into a JSON string, and it offers multiple configuration options for customizing the JSON output.

Example: Basic Example of writing a JSON file

Here is an example demonstrating how to write a Pandas DataFrame to a JSON file.

import pandas as pd

# Create a DataFrame from the above dictionary
df = pd.DataFrame({"Name":["Braund", "Cumings", "Heikkinen"], 
"Gender": ["Male", "Female", "Female"],
"Age": [30, 25, 25]})
print("Original DataFrame:\n", df)    

# Write DataFrame to a JSON file
df.to_json("output_written_json_file.json", orient='records', lines=True)

print("The output JSON file has been written successfully.")

Output

Following is the output of the above code −

Original DataFrame:
         Name  Gender  Age
0     Braund    Male   30
1    Cumings  Female   25
2  Heikkinen  Female   25
The output JSON file has been written successfully.

After executing the above code, you can find the created JSON file named output_written_json_file.json in your working directory.

Example: Writing a JSON file using the split orientation

The following example writes a simple DataFrame object into JSON using the split orientation.

import pandas as pd
from json import loads, dumps

# Create a DataFrame
df = pd.DataFrame(
    [["x", "y"], ["z", "w"]],
    index=["row_1", "row_2"],
    columns=["col_1", "col_2"],
)

# Convert DataFrame to JSON with 'split' orientation
result = df.to_json(orient="split")
parsed = loads(result)

# Display the JSON output
print("JSON Output (split orientation):")
print(dumps(parsed, indent=4))

Output

Following is the output of the above code −

JSON Output (split orientation):
{
    "columns": [
        "col_1",
        "col_2"
    ],
    "index": [
        "row_1",
        "row_2"
    ],
    "data": [
        [
            "x",
            "y"
        ],
        [
            "z",
            "w"
        ]
    ]
}

Python Pandas - Reading Data from an Excel File

The Pandas library provides powerful tool for data manipulation and analysis. Among its many features, it offers the ability to read and write data to Excel files easily. Excel files are widely used to store and organize data in tabular format, making them an excellent choice for analysis. Excel files can have multiple sheets, each containing rows and columns of data. Excel files usually come with extensions like .xls or .xlsx.

Pandas provides easy-to-use methods for working with Excel data directly in Python. One of such method is the read_excel() method, which reads Excel files and loads the data into a Pandas DataFrame.

In this tutorial, we will learn how to read data from Excel files using the pandas.read_excel() method, covering different scenarios like loading a single sheet, specific sheets, and multiple sheets.

Supported Excel File Formats in Pandas

Pandas uses different backends to read various Excel file formats −

  • The Excel 2007+ (.xlsx) files can be read using the openpyxl Python module.

  • The Excel 2003 (.xls) files can be read using the xlrd module.

  • The Binary Excel (.xlsb) files can be read using the pyxlsb module.

  • Additionally, all formats can be read using the Calamine engine.

Note: Please make sure that you have installed the required packages (xlrd and openpyxl) in your system. If these packages are not installed, use the following commands to install them −
pip3 install openpyxl 
or
pip3 install xlrd

Reading an Excel File in Pandas

The pandas.read_excel() method reads Excel files and loads the data into a Pandas DataFrame. This method supports multiple Excel file formats like, .xls, .xlsx, .xlsm, and more from a local filesystem or a URL.

Example

Here is a simple example of reading a local system Excel file into a DataFrame using the pandas.read_excel() method by specifying the file path.

import pandas as pd

# Read an Excel file
df = pd.read_excel('data.xlsx')

# Print the DataFrame
print('Output DataFrame:')
print(df)

Output

Following is the output of the above code −

Output DataFrame:
       Name  Gender  Age
0    Braund  female   38
1   Cumings    male   22
2   Heikkin  female   35
3  Futrelle  female   26

Reading a Specific Sheet from an Excel file

The Excel files may contain multiple sheets with different names. To read a specific sheet into a Pandas DataFrame, you can specify the sheet name or index to the sheet_name parameter of the pandas.read_excel() method.

Example

The following example demonstrates how to read a specific sheet from an Excel file into a Pandas DataFrame using pandas.read_excel() method. Here we will specify the sheet name to the sheet_name parameter to read that specific sheet.

import pandas as pd

# Read a specific sheet
df = pd.read_excel('data.xlsx', sheet_name="Sheet_2")

# Print the DataFrame
print('Output DataFrame:')
print(df)

Following is the output of the above code −

Output DataFrame:
      Name  Value
0  string1      1
1  string2      2
2  comment      5

Reading Multiple Sheets as a Dictionary of DataFrames

If an Excel file contains multiple sheets and you need to read few of them into the Pandas DataFrame, you can pass a list of sheet names or indices to the sheet_name parameter of the pandas.read_excel() method.

Example

This example uses the pandas.read_excel() method to read the multiple sheets in an Excel file to a dictionary of DataFrames.

import pandas as pd

# Read multiple sheets
df = pd.read_excel('data.xlsx', sheet_name=[0, 1])

# Print the DataFrame
print('Output Dict of DataFrames:')
print(df)

Following is the output of the above code −

Output Dict of DataFrames:
{0:        Name  Gender  Age
0    Braund  female   38
1   Cumings    male   22
2   Heikkin  female   35
3  Futrelle  female   26, 1:       Name  Value
0  string1      1
1  string2      2
2  comment      5}

Reading MultiIndex Data from Excel

You can read MultiIndexed data from an Excel file using the pandas.read_excel() method, which is useful for working with hierarchical data. By specifying the lists of columns for index_col and lists of rows for header parameters to handle MultiIndex indices and columns.

Example

This example uses the to_excel() method and pandas.read_excel() method create an excel sheet with MultiIndexed data and read it back to Pandas DataFrame respectively.

import pandas as pd

# Create a MultiIndex object
index = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'), ('B', 'one'), ('B', 'two')])

# Create a DataFrame
data = [[1, 2], [3, 4], [5, 6], [7, 8]]
df = pd.DataFrame(data, index=index, columns=['X', 'Y'])

df.to_excel("multiindex_data.xlsx")

# Read MultiIndex rows and columns
df = pd.read_excel("multiindex_data.xlsx", index_col=[0, 1])

print('Output DataFrame from Excel File:')
print(df)

Following is the output of the above code −

Output DataFrame from Excel File:
       X  Y
A one  1  2
  two  3  4
B one  5  6
  two  7  8

Python Pandas - Writing Data to Excel Files

Pandas is a data analysis library in Python, which is widely used for working with structured data from various formats including CSV, SQL, and Excel files. One of the key features of this library is that it allows you to easily export data from Pandas DataFrames and Series directly into Excel spreadsheets using the to_excel() method.

The to_excel() method in Pandas allows you to export the data from a DataFrame or Series into an Excel file. This method provides the flexibility in specifying various parameters such as file path, sheet name, formatting options, and more.

In the previous tutorial, we learned about Reading Excel Files with Pandas, now we will learn how to write Pandas data to Excel files in Python using Pandas. This complete guide will discuss the to_excel() method, exporting multiple sheets, appending data, and memory-based operations with examples.

Overview of The Pandas to_excel() Method

The Pandas to_excel() method is used to write a DataFrame or Series to an Excel file. It allows you to specify various configurations such as the sheet name, columns to write, and more.

Following is the syntax of this method −

DataFrame.to_excel(excel_writer, *, sheet_name='Sheet1', na_rep='', columns=None, header=True, index=True, ...)

Key parameters are −

  • excel_writer: This represents the path to the Excel file or an existing ExcelWriter object. The ExcelWriter object is used when writing multiple sheets to a file.

  • sheet_name: This specifies the name of the sheet where the DataFrame will be written. By default, it's set to 'Sheet1'.

  • na_rep: A string to represent missing data in the Excel file. Default is an empty string.

  • columns: A sequence or list of column names to write.

Writing a Single DataFrame to an Excel

By simply calling the DataFrame.to_excel() method with the Excel file name, and an optional sheet name, you can directly export the contents of the Pandas DataFrame object into a sheet of an Excel file.

Example

Here is a basic example of writing the contents of a Pandas DataFrame to an Excel file using the DataFrame.to_excel() method.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame([[5, 2], [4, 1]],index=["One", "Two"],columns=["Rank", "Subjects"])

# Display the DataFrame
print("DataFrame:\n", df)

# Export DataFrame to Excel
df.to_excel('Basic_example_output.xlsx')

print('The Basic_example_output.xlsx file is saved successfully..')

Following is the output of the above code −

DataFrame:
      Rank  Subjects
One     5         2
Two     4         1
The Basic_example_output.xlsx file is saved successfully..

Note: After executing each code, you can find the generated output files in your working directory.

Exporting Multiple DataFrames to Different Sheets

Writing the multiple DataFrames to different sheets within the same Excel file is possible by using ExcelWriter class.

Example

Following is the example of writing the multiple DataFrames to different sheets within the same Excel file using ExcelWriter class and the to_excel() method.

import pandas as pd

df1 = pd.DataFrame(
   [[5, 2], [4, 1]],
   index=["One", "Two"],
   columns=["Rank", "Subjects"]
)

df2 = pd.DataFrame(
   [[15, 21], [41, 11]],
   index=["One", "Two"],
   columns=["Rank", "Subjects"]
)

print("DataFrame 1:\n", df1)
print("DataFrame 2:\n", df2)

with pd.ExcelWriter('output_multiple_sheets.xlsx') as writer:
	df1.to_excel(writer, sheet_name='Sheet_name_1')
	df2.to_excel(writer, sheet_name='Sheet_name_2')

print('The output_multiple_sheets.xlsx file is saved successfully..')

Following is the output of the above code −

DataFrame 1:
      Rank  Subjects
One     5         2
Two     4         1
DataFrame 2:
      Rank  Subjects
One    15        21
Two    41        11
The output_multiple_sheets.xlsx file is saved successfully..

Appending Data to an Existing Excel File

Appending the contents of a DataFrame to an existing Excel file is possible by using ExcelWriter with mode='a'. The ExcelWriter object helps you to open the existing Excel file in the appending mode and then allows you to add the new data to the existing file.

Example

The following example demonstrates how to append the contents of a DataFrame to the existing Excel file.

import pandas as pd

# Create a new DataFrame
df3 = pd.DataFrame([[51, 11], [21, 38]],index=["One", "Two"],columns=["Rank", "Subjects"])

# Append the DataFrame to an existing Excel file
with pd.ExcelWriter('output_multiple_sheets.xlsx', mode='a') as writer:
    df3.to_excel(writer, sheet_name='Sheet_name_3', index=False)

print('The output_multiple_sheets.xlsx file is saved successfully with the appended sheet..')

Following is the output of the above code −

The output_multiple_sheets.xlsm file is saved successfully with the appended sheet..

Writing Excel Files to Memory Using Pandas

Writing Excel files to memory (buffer-like objects) instead of saving them to disk is possible by using BytesIO or StringIO along with ExcelWriter.

Example

The following example demonstrates how to write an Excel file to a memory object using the BytesIO and the ExcelWriter class.

import pandas as pd
from io import BytesIO

df = pd.DataFrame(
[[5, 2], [4, 1]],
index=["One", "Two"],
columns=["Rank", "Subjects"])

print("Input DataFrame :\n", df)

# Create a BytesIO object
bio = BytesIO()

# Write the DataFrame to the BytesIO buffer
df.to_excel(bio, sheet_name='Sheet1')

# Get the Excel file from memory
bio.seek(0)
excel_data = bio.read()

print('\nThe Excel file is saved in memory successfully..')

Following is the output of the above code −

Input DataFrame :
      Rank  Subjects
One     5         2
Two     4         1

The Excel file is saved in memory successfully..

Choosing an Excel Writer Engine in Pandas

Pandas supports multiple engines for writing Excel files, such as openpyxl and xlsxwriter. You can specify the engine explicitly as you need using the engine parameter of the DataFrame.to_excel() method. And make sure that you have installed the required engine in your system.

Example

This example demonstrates saving an Excel file with a specified engine using the engine parameter of the DataFrame.to_excel() method.

import pandas as pd
from io import BytesIO

df = pd.DataFrame(
   [[5, 2], [4, 1]],
   index=["One", "Two"],
   columns=["Rank", "Subjects"]
)

# Write DataFrame using xlsxwriter engine
df.to_excel('output_xlsxwriter.xlsx', sheet_name='Sheet1', engine='xlsxwriter')

print('The output_xlsxwriter.xlsx is saved successfully..')

Following is the output of the above code −

The output_xlsxwriter.xlsx is saved successfully..

Python Pandas - Working with HTML Data

The Pandas library provides extensive functionalities for handling data from various formats. One such format is HTML (HyperText Markup Language), which is a commonly used format for structuring web content. The HTML files may contain tabular data, which can be extracted and analyzed using the Pandas library.

An HTML table is a structured format used to represent tabular data in rows and columns within a webpage. Extracting this tabular data from an HTML is possible by using the pandas.read_html() function. Writing the Pandas DataFrame back to an HTML table is also possible using the DataFrame.to_html() method.

In this tutorial, we will learn about how to work with HTML data using Pandas, including reading HTML tables and writing the Pandas DataFrames to HTML tables.

Reading HTML Tables from a URL

The pandas.read_html() function is used for reading tables from HTML files, strings, or URLs. It automatically parses <table> elements in HTML and returns a list of pandas.DataFrame objects.

Example

Here is the basic example of reading the data from a URL using the pandas.read_html() function.

import pandas as pd

# Read HTML table from a URL
url = "https://www.tutorialspoint.com/sql/sql-clone-tables.htm"
tables = pd.read_html(url)

# Access the first table from the URL
df = tables[0]

# Display the resultant DataFrame
print('Output First DataFrame:', df.head())

Output

Following is the output of the above code −

Output First DataFrame:    ID      NAME  AGE    ADDRESS  SALARY
0   1    Ramesh   32  Ahmedabad  2000.0
1   2    Khilan   25      Delhi  1500.0
2   3   Kaushik   23       Kota  2000.0
3   4  Chaitali   25     Mumbai  6500.0
4   5    Hardik   27     Bhopal  8500.0

Reading HTML Data from a String

Reading HTML data directly from a string can be possible by using the Python's io.StringIO module.

Example

The following example demonstrates how to read the HTML string using StringIO without saving to a file.

import pandas as pd
from io import StringIO

# Create an HTML string
html_str = """
<table>
   <tr><th>C1</th><th>C2</th><th>C3</th></tr>
   <tr><td>a</td><td>b</td><td>c</td></tr>
   <tr><td>x</td><td>y</td><td>z</td></tr>
</table>
"""

# Read the HTML string
dfs = pd.read_html(StringIO(html_str))
print(dfs[0])

Output

Following is the output of the above code −

  C1 C2 C3
0  a  b  c
1  x  y  z

Example

This is an alternative way of reading the HTML string with out using the io.StringIO module. Here we will save the HTML string into a temporary file and read it using the pandas.read_html() function.

import pandas as pd

# Create an HTML string
html_str = """
<table>
   <tr><th>C1</th><th>C2</th><th>C3</th></tr>
   <tr><td>a</td><td>b</td><td>c</td></tr>
   <tr><td>x</td><td>y</td><td>z</td></tr>
</table>
"""

# Save to a temporary file and read
with open("temp.html", "w") as f:
    f.write(html_str)

df = pd.read_html("temp.html")[0]
print(df)

Output

Following is the output of the above code −

  C1 C2 C3
0  a  b  c
1  x  y  z

Handling Multiple Tables from an HTML file

While reading an HTML file of containing multiple tables, we can handle it by using the match parameter of the pandas.read_html() function to read a table that has specific text.

Example

The following example reads a table that has a specific text from the HTML file of having multiple tables using the match parameter.

import pandas as pd

# Read tables from a SQL tutorial
url = "https://www.tutorialspoint.com/sql/sql-clone-tables.htm"
tables = pd.read_html(url, match='Field')

# Access the table
df = tables[0]
print(df.head())

Output

Following is the output of the above code −

     Field           Type Null  Key  Default  Extra
0       ID        int(11)   NO  PRI      NaN    NaN
1     NAME    varchar(20)   NO  NaN      NaN    NaN
2      AGE        int(11)   NO  NaN      NaN    NaN
3  ADDRESS       char(25)  YES  NaN      NaN    NaN
4   SALARY  decimal(18,2)  YES  NaN      NaN    NaN

Writing DataFrames to HTML

Pandas DataFrame objects can be converted to HTML tables using the DataFrame.to_html() method. This method returns a string if the parameter buf is set to None.

Example

The following example demonstrates how to write a Pandas DataFrame to an HTML Table using the DataFrame.to_html() method.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])

# Convert the DataFrame to HTML table
html = df.to_html()

# Display the HTML string
print(html)

Output

Following is the output of the above code −

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>A</th>
      <th>B</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>2</td>
    </tr>
    <tr>
      <th>1</th>
      <td>3</td>
      <td>4</td>
    </tr>
  </tbody>
</table>

Python Pandas - Clipboard

Copying and pasting data between different applications is a common task in data analysis. In this, clipboard acts as a temporary data buffer that is used to store short-term data and transfer it between different applications like Excel, text editors, and Python scripts. The Pandas library provides easy tools to work with the system clipboard −

  • read_clipboard(): Reads clipboard data and converts it into a Pandas DataFrame.

  • to_clipboard(): Copies a DataFrame to the clipboard for pasting elsewhere.

These methods make it easy to transfer data between Pandas data structures and other applications like Excel, text editors, or any tool that supports copy-paste functionality.

In this tutorial, we will learn about how to use the Pandas read_clipboard() and to_clipboard() methods effectively.

Note: If you get the pandas.errors.PyperclipException Error then, you may need to install xclip or xsel modules to enable clipboard functionality. Generally, Windows and macOS operating systems do not require these modules.

Reading Clipboard Data using read_clipboard()

The pandas.read_clipboard() method is used to directly import data from your system clipboard into a Pandas DataFrame. This method parses the clipboard data similarly to how CSV data is parsed using the pandas.read_csv() method.

The syntax of the pandas.read_clipboard() method is as follows −

pandas.read_clipboard(sep='\\s+', dtype_backend=<no_default>, **kwargs)

Key parameters,

  • sep: This parameter is used to defines the string delimiter. By default it is set to '\s+', which matches one or more whitespace characters.

  • dtype_backend: This is used for selecting the back-end data type. For example, "numpy_nullable" returns a nullable-dtype-backed DataFrame (default), and "pyarrow" returns a pyarrow-backed nullable ArrowDtype DataFrame (introduced in Pandas 2.0).

  • **kwargs: Additional keyword arguments passed to read_csv() to fine-tune the data reading.

Example

Here is a basic example of using the pandas.read_clipboard() method to generate a DataFrame from the copied data. In this example, we initially created a clipboard data using the to_clipboard() method.

import pandas as pd

# Creating a sample DataFrame
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])

# Copy DataFrame to clipboard
df.to_clipboard()

# Read data from clipboard
clipboard_df = pd.read_clipboard()

# Display the DataFrame
print('DataFrame from clipboard:')
print(clipboard_df)

Output

When we run above program, it produces following result −

DataFrame from clipboard:
   A  B  C
0  1  2  3
1  4  5  6

Reading Tabular Data from Clipboard

When clipboard data includes row and column labels, read_clipboard() automatically detects and converts it into a structured DataFrame.

Example

The following example demonstrates how to use the pandas.read_clipboard() method to generate a DataFrame from the copied tabular data.

First, copy the following data to your clipboard using the Ctrl+c (Windows/Linux) or Command-C (macOS) keyboard shortcut.

C1 C2 C3
X 1 2 3
Y 4 5 6
Z a b c

Then Run the following code −

import pandas as pd

# Read clipboard content into a DataFrame
df = pd.read_clipboard()
print(df)

Output

Following is the output of the above code −

  C1 C2 C3
X  1  2  3
Y  4  5  6
Z  a  b  c

Reading Non-Tabular Data from Clipboard

When you have a non-tabular data in your clipboard with a specific delimiter, you can use the sep parameter of the read_clipboard() method to read such a type of data into Pandas DataFrame.

Example

Below is an example that demonstrates how to read non-tabular clipboard data into a Pandas DataFrame using the pandas.read_clipboard() method.

Copy the following data to your clipboard, then run the program below −

Python,Pandas,Clipboard,DataFrame
import pandas as pd

# Read clipboard content into a DataFrame
df = pd.read_clipboard(sep=',',header=None)
print(df)

Output

Following is the output of the above code −

        0       1          2          3
0  Python  Pandas  Clipboard  DataFrame

Writing Data to Clipboard with to_clipboard()

The to_clipboard() method is used to write the content of a DataFrame or Series object to the system clipboard. This makes it easy to paste data into other applications, such as Excel or text editors.

Following is the syntax of the to_clipboard() method −

DataFrame.to_clipboard(*, excel=True, sep=None, **kwargs)

Parameters

  • excel: It is a boolean parameter, if set to True, formats the DataFrame as CSV for easy pasting into Excel. If False, formats the DataFrame as a string representation to the clipboard.

  • sep: Defines the field delimiter. If sep=None, it defaults to a tab (\t) delimiter.

  • **kwargs: Any Additional arguments will be passed to DataFrame.to_csv.

Example

Here is an example of copying a DataFrame to the clipboard using the DataFrame.to_clipboard() and pasting it elsewhere like text editors.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    "C1": [1, 2, 3], 
    "C2": [4, 5, 6], 
    "C3": ["a", "b", "c"]
}, index=["x", "y", "z"])

# Copies the DataFrame to the clipboard
df.to_clipboard(sep=',')
print('DataFrame is successfully copied to the clipboard. Please paste it into any text editor or Excel sheet.')

Output

Following is the output of the above code −

DataFrame is successfully copied to the clipboard. 
Please paste it into any text editor or Excel sheet.
,C1,C2,C3
x,1,4,a
y,2,5,b
z,3,6,c

Python Pandas - Working with HDF5 Format

When working with large datasets, we may get "out of memory" errors. These types of problems can be avoided by using an optimized storage format like HDF5. The pandas library offers tools like the HDFStore class and read/write APIs to easily store, retrieve, and manipulate data while optimizing memory usage and retrieval speed.

HDF5 stands for Hierarchical Data Format version 5, is an open-source file format designed to store large, complex, and heterogeneous data efficiently. It organizes the data in a hierarchical structure similar to a file system, with groups acting like directories and datasets functioning as files. The HDF5 file format can store different types of data (such as arrays, images, tables, and documents) in a hierarchical structure, making it ideal for managing heterogeneous data.

Creating an HDF5 file using HDFStore in Pandas

The HDFStore class in pandas is used to manage HDF5 files in a dictionary-like manner. The HDFStore class is a dictionary-like object that reads and writes Pandas data in the HDF5 format using PyTables library.

Example

Here is an example of demonstrating how to create a HDF5 file in Pandas using the pandas.HDFStore class.

import pandas as pd
import numpy as np

# Create the store using the HDFStore class
store = pd.HDFStore("store.h5")

# Display the store
print(store)

# It is important to close the store after use
store.close()

Output

Following is the output of the above code −

<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

Note: To work with HDF5 format in pandas, you need the pytables library. It is an optional dependency for pandas and must be installed separately using one of the following commands −

# Using pip3
pip3 install tables

# or using conda installer
conda install pytables

Write/read Data to the HDF5 using HDFStore in Pandas

The HDFStore is a dict-like object, so that we can directly write and read the data to the HDF5 store using key-value pairs.

Example

The below example demonstrates how to write and read data to and from the HDF5 file using the HDFStore in Pandas.

import pandas as pd
import numpy as np

# Create the store
store = pd.HDFStore("store.h5")

# Create the data 
index = pd.date_range("1/1/2024", periods=8)
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"])

# Write Pandas data to the Store, which is equivalent to store.put('s', s)
store["s"] = s  
store["df"] = df

# Read Data from the store, which is equivalent to store.get('df')
from_store = store["df"]
print('Retrieved Data From the HDFStore:\n',from_store)

# Close the store after use
store.close()

Output

Following is the output of the above code −

Retrieved Data From the HDFStore:
                    A         B         C
2024-01-01  0.553352  0.113960 -1.874169
2024-01-02  0.017491 -1.790975 -0.036893
2024-01-03  1.927842 -0.945139  0.158321
2024-01-04 -0.427214  1.663949 -0.830819
2024-01-05  0.810059  0.627320  1.574442
2024-01-06  1.351602 -0.920513 -0.321486
2024-01-07 -0.776996  0.068870  0.432984
2024-01-08  0.779704 -0.603734  0.301488

Read and write HDF5 Format Using Pandas APIs

Pandas also provides high-level APIs to simplify the interaction with HDFStore (Nothing but HDF5 files). These APIs allow you to read and write data directly to and from HDF5 files without needing to manually create an HDFStore object. Following are the primary APIs for handling HDF5 files in pandas −

Writing Pandas Data to HDF5 Using to_hdf()

The to_hdf() function allows you to write pandas objects such as DataFrames and Series directly to an HDF5 file using the HDFStore. This function provides various optional parameters like compression, handling missing values, format options, and more, allowing you to store your data efficiently.

Example

This example uses the DataFrame.to_hdf() function to write data to the HDF5 file.

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]},index=['x', 'y', 'z']) 

# Write data to an HDF5 file using the to_hdf()
df.to_hdf("data_store.h5", key="df", mode="w", format="table")

print("Data successfully written to HDF5 file")

Output

Following is the output of the above code −

Data successfully written to HDF5 file

Reading Data from HDF5 Using read_hdf()

The pandas.read_hdf() method is used to retrieve Pandas object stored in an HDF5 file. It accepts the file name, file path or buffer from which data is read.

Example

This example demonstrates how to read data stored under the key "df" from the HDF5 file "data_store.h5" using the pd.read_hdf() method.

import pandas as pd

# Read data from the HDF5 file using the read_hdf()
retrieved_df = pd.read_hdf("data_store.h5", key="df")

# Display the retrieved data
print("Retrieved Data:\n", retrieved_df.head())

Output

Following is the output of the above code −

Retrieved Data:
    A  B
x  1  4
y  2  5
z  3  6

Appending Data to HDF5 Files Using to_hdf()

Appending data to an existing HDF5 file can be possible by using the mode="a" option of the to_hdf() function. This is useful when you want to add new data to a file without overwriting the existing content.

Example

This example demonstrates how to append data to an an existing HDF5 file using the to_hdf() function.

import pandas as pd
import numpy as np

# Create a DataFrame to append
df_new = pd.DataFrame({'A': [7, 8], 'B': [1, 1]},index=['i', 'j'])

# Append the new data to the existing HDF5 file
df_new.to_hdf("data_store.h5", key="df", mode="a", format="table", append=True)

print("Data successfully appended")

# Now read data from the HDF5 file using the read_hdf()
retrieved_df = pd.read_hdf("data_store.h5", key='df')

# Display the retrieved data
print("Retrieved Data:\n", retrieved_df.head())

Output

Following is the output of the above code −

Data successfully appended
Retrieved Data:
    A  B
x  1  4
y  2  5
z  3  6
i  7  1
j  8  1

Python Pandas - Comparison with SQL

Pandas is a powerful Python library for data manipulation and analysis, widely used in data science and engineering. Many potential Pandas users come from a background in SQL, a language designed for managing and querying relational databases. Understanding how to perform SQL-like operations using Pandas can significantly ease the transition and enhance productivity.

This tutorial provides a side-by-side comparison of common SQL operations and their equivalents in Pandas, using the popular "tips" dataset.

Importing the Necessary Libraries

Before we dive into the comparison, let's start by importing the necessary libraries.

import pandas as pd
import numpy as np

We will also load the "tips" dataset, which will be used throughout this tutorial.

import pandas as pd

url = 'https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv'

tips=pd.read_csv(url)
print(tips.head())

Its output is as follows −

    total_bill   tip      sex  smoker  day     time  size
0        16.99  1.01   Female      No  Sun  Dinner      2
1        10.34  1.66     Male      No  Sun  Dinner      3
2        21.01  3.50     Male      No  Sun  Dinner      3
3        23.68  3.31     Male      No  Sun  Dinner      2
4        24.59  3.61   Female      No  Sun  Dinner      4

Selecting Columns

In SQL, the SELECT statement is used to retrieve specific columns from a table. Selection is done using a comma-separated list of columns that you select (or a * to select all columns) −

SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;

In Pandas, you can achieve the same result by selecting columns from a DataFrame using a list of column names −

tips[['total_bill', 'tip', 'smoker', 'time']].head(5)

Example

Let's check the full program of displaying the first five rows of the selected columns −

import pandas as pd

url = 'https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv'
 
tips=pd.read_csv(url)
print(tips[['total_bill', 'tip', 'smoker', 'time']].head(5))

Its output is as follows −

   total_bill   tip  smoker     time
0       16.99  1.01      No   Dinner
1       10.34  1.66      No   Dinner
2       21.01  3.50      No   Dinner
3       23.68  3.31      No   Dinner
4       24.59  3.61      No   Dinner

Calling the DataFrame without the list of column names will display all columns (akin to SQLs *).

Filtering Rows

In SQL, the WHERE clause is used to filter records based on specific conditions.

SELECT * FROM tips WHERE time = 'Dinner' LIMIT 5;

DataFrames can be filtered in multiple ways; the most intuitive of which is using Boolean indexing.

tips[tips['time'] == 'Dinner'].head(5)

Example

Let's check the full program of displaying the first five records where the time is equal to 'Dinner' −

import pandas as pd

url = 'https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv'

tips=pd.read_csv(url)
print(tips[tips['time'] == 'Dinner'].head(5))

Its output is as follows −

   total_bill   tip      sex  smoker  day    time  size
0       16.99  1.01   Female     No   Sun  Dinner    2
1       10.34  1.66     Male     No   Sun  Dinner    3
2       21.01  3.50     Male     No   Sun  Dinner    3
3       23.68  3.31     Male     No   Sun  Dinner    2
4       24.59  3.61   Female     No   Sun  Dinner    4

The above statement passes a Series of True/False objects to the DataFrame, returning all rows with True.

Grouping Data

SQL's GROUP BY clause is used to group rows that have the same values in specified columns and perform aggregate functions on them. For example, to count the number of tips left by each gender: −

SELECT sex, count(*)
FROM tips
GROUP BY sex;

In Pandas, the groupby() method is used to achieve the same result −

tips.groupby('sex').size()

Example

Let's check the full program of displaying the count of tips grouped by gender −

import pandas as pd

url = 'https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv'

tips=pd.read_csv(url)
print(tips.groupby('sex').size())

Its output is as follows −

sex
Female   87
Male    157
dtype: int64

Limiting the Number of Rows

In SQL, the LIMIT clause is used to limit the number of rows returned by a query. For example −

SELECT * FROM tips
LIMIT 5 ;

In Pandas, the head() method is used to achieve this −

tips.head(5)

Example

Let's check the full example of displaying the first five rows of the DataFrame −

import pandas as pd

url = 'https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv'

tips=pd.read_csv(url)
tips = tips[['smoker', 'day', 'time']].head(5)
print(tips)

Its output is as follows −

   smoker   day     time
0      No   Sun   Dinner
1      No   Sun   Dinner
2      No   Sun   Dinner
3      No   Sun   Dinner
4      No   Sun   Dinner

These are the few basic operations we compared are, which we learnt, in the previous chapters of the Pandas Library.

Python - Pandas

Sorting is a fundamental operation when working with data in Pandas, whether you're organizing rows, columns, or specific values. Sorting can help you to arrange your data in a meaningful way for better understanding and easy analysis.

Pandas provides powerful tools for sorting your data efficiently, which can be done by labels or actual values. In this tutorial, we'll explore various methods for sorting data in Pandas, from basic sorting by index or column labels to more advanced techniques like sorting by multiple columns and choosing specific sorting algorithms.

Types of Sorting in Pandas

There are two kinds of sorting available in Pandas. They are −

  • Sorting by Label − This involves sorting the data based on the index labels.

  • Sorting by Value − This involves sorting data based on the actual values in the DataFrame or Series.

Sorting by Label

To sort by the index labels, you can use the sort_index() method, by passing the axis arguments and the order of sorting, data structure object can be sorted. By default, this method sorts the DataFrame in ascending order based on the row labels.

Example

Let's take a basic example of demonstrating the sorting a DataFrame by using the sort_index() method.

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])

print("Original DataFrame:\n", unsorted_df)

# Sort the DataFrame by labels
sorted_df=unsorted_df.sort_index()
print("\nOutput Sorted DataFrame:\n", sorted_df)

Its output is as follows −

Original DataFrame:
        col2      col1
1  1.116188  1.631727
4  0.287900 -1.097359
6  0.058885 -0.642273
2 -2.070172  0.148255
3 -1.458229  1.298907
5 -0.723663  2.220048
9 -1.271494  2.001025
8 -0.412954 -0.808688
0  0.922697 -0.429393
7 -0.476054 -0.351621

Output Sorted DataFrame:
        col2      col1
0  0.922697 -0.429393
1  1.116188  1.631727
2 -2.070172  0.148255
3 -1.458229  1.298907
4  0.287900 -1.097359
5 -0.723663  2.220048
6  0.058885 -0.642273
7 -0.476054 -0.351621
8 -0.412954 -0.808688
9 -1.271494  2.001025

Example − Controlling the Order of Sorting

By passing the Boolean value to ascending parameter, the order of the sorting can be controlled. Let us consider the following example to understand the same.

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])

print("Original DataFrame:\n", unsorted_df)

# Sort the DataFrame by ascending order
sorted_df = unsorted_df.sort_index(ascending=False)
print("\nOutput Sorted DataFrame:\n", sorted_df)

Its output is as follows −

Original DataFrame:
        col2      col1
1 -0.668366  0.576422
4  0.605218 -0.066065
6  1.140478  0.236687
2  0.137617  0.312423
3 -0.055631  0.774057
5  0.108002  1.038820
9 -0.929134 -0.982358
8 -0.207542 -1.283386
0 -0.210571 -0.656371
7 -0.106388  0.672418

Output Sorted DataFrame:
        col2      col1
9 -0.929134 -0.982358
8 -0.207542 -1.283386
7 -0.106388  0.672418
6  1.140478  0.236687
5  0.108002  1.038820
4  0.605218 -0.066065
3 -0.055631  0.774057
2  0.137617  0.312423
1 -0.668366  0.576422
0 -0.210571 -0.656371

Example − Sort the Columns

By passing the axis argument with a value 0 or 1, the sorting can be done on the column labels. By default, axis=0, sort by row. Let us consider the following example to understand the same.

import pandas as pd
import numpy as np
 
unsorted_df = pd.DataFrame(np.random.randn(6,4),index=[1,4,2,3,5,0],columns = ['col2','col1', 'col4', 'col3'])

print("Original DataFrame:\n", unsorted_df)

# Sort the DataFrame columns
sorted_df=unsorted_df.sort_index(axis=1)
print("\nOutput Sorted DataFrame:\n", sorted_df)

Its output is as follows −

Original DataFrame:
        col2      col1      col4      col3
1 -0.828951 -0.798286 -1.794752 -0.082656
4  0.440243 -0.693218 -0.218277 -0.790168
2  1.017670  1.443679 -1.939119 -1.887223
3 -0.992471 -1.425046  0.651336 -0.278247
5 -0.103537 -0.879433  0.471838  0.860885
0 -0.222297  1.094805  0.501531 -0.580382

Output Sorted DataFrame:
        col1      col2      col3      col4
1 -0.798286 -0.828951 -0.082656 -1.794752
4 -0.693218  0.440243 -0.790168 -0.218277
2  1.443679  1.017670 -1.887223 -1.939119
3 -1.425046 -0.992471 -0.278247  0.651336
5 -0.879433 -0.103537  0.860885  0.471838
0  1.094805 -0.222297 -0.580382  0.501531

Sorting by Actual Values

Like index sorting, sorting by actual values can be done using the sort_values() method. This method allows sorting by one or more columns. It accepts a 'by' argument which will use the column name of the DataFrame with which the values are to be sorted.

Example − Sorting a Series Values

The following example demonstrates how to sort a pandas Series object using the sort_values() method.

import pandas as pd

panda_series = pd.Series([18, 95, 66, 12, 55, 0])
print("Unsorted Pandas Series: \n", panda_series)

panda_series_sorted = panda_series.sort_values(ascending=True)
print("\nSorted Pandas Series: \n", panda_series_sorted)

On executing the above code you will get the following output −

Unsorted Pandas Series: 
 0    18
1    95
2    66
3    12
4    55
5     0
dtype: int64

Sorted Pandas Series: 
 5     0
3    12
0    18
4    55
2    66
1    95
dtype: int64

Example − Sorting a DataFrame Values

The following example demonstrates working of the sort_values() method on a DataFrame Object.

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,9,5,0],'col2':[1,3,2,4]})
print("Original DataFrame:\n", unsorted_df)

# Sort the DataFrame by values
sorted_df = unsorted_df.sort_values(by='col1')
print("\nOutput Sorted DataFrame:\n", sorted_df)

Its output is as follows −

Original DataFrame:
    col1  col2
0     2     1
1     9     3
2     5     2
3     0     4

Output Sorted DataFrame:
    col1  col2
3     0     4
0     2     1
2     5     2
1     9     3

Observe, col1 values are sorted and the respective col2 value and row index will alter along with col1. Thus, they look unsorted.

Example − Sorting Value of the Multiple Columns

You can also sort by multiple columns by passing a list of column names to the 'by' parameter.

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,1,0,1],'col2':[1,3,4,2]})

print("Original DataFrame:\n", unsorted_df)

# Sort the DataFrame multiple columns by values
sorted_df = unsorted_df.sort_values(by=['col1','col2'])
print("\nOutput Sorted DataFrame:\n", sorted_df)

Its output is as follows −

Original DataFrame:
    col1  col2
0     2     1
1     1     3
2     0     4
3     1     2

Output Sorted DataFrame:
    col1  col2
2     0     4
3     1     2
1     1     3
0     2     1

Choosing a Sorting Algorithm

Pandas allows you to specify the sorting algorithm using the kind parameter in the sort_values() method. You can choose between 'mergesort', 'heapsort', and 'quicksort'. 'mergesort' is the only stable algorithm.

Example

The following example sorts a DataFrame using the sort_values() method with specific algorithm.

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,5,0,1],'col2':[1,3,0,4]})
print("Original DataFrame:\n", unsorted_df)

# Sort the DataFrame 
sorted_df = unsorted_df.sort_values(by='col1' ,kind='mergesort')
print("\nOutput Sorted DataFrame:\n", sorted_df)

Its output is as follows −

Original DataFrame:
    col1  col2
0     2     1
1     5     3
2     0     0
3     1     4

Output Sorted DataFrame:
    col1  col2
2     0     0
3     1     4
0     2     1
1     5     3

Python Pandas - Reindexing

Reindexing is a powerful and fundamental operation in Pandas that allows you to align your data with a new set of labels. Whether you're working with rows or columns, reindexing gives you control over how your data aligns with the labels you specify.

This operation is especially useful when working with time series data, aligning datasets from different sources, or simply reorganizing data to match a particular structure.

What is Reindexing?

Reindexing in Pandas refers to the process of conforming your data to match a new set of labels along a specified axis (rows or columns). This process can accomplish several tasks −

  • Reordering: Reorder the existing data to match a new set of labels.

  • Inserting Missing Values: If a label in the new set does not exist in the original data, Pandas will insert a missing value (NaN) for that label.

  • Filling Missing Data: You can specify how to fill in missing values that result from reindexing, using various filling methods.

The reindex() method is the primary tool for performing reindexing in Pandas. It allows you to modify the row and column labels of Pandas data structures.

Key Methods Used in Reindexing

  • reindex(): This method is used to align an existing data structure with a new index (or columns). It can reorder and/or insert missing labels.

  • reindex_like(): This method allows you to reindex one DataFrame or Series to match another. It's useful when you want to ensure two data structures are aligned similarly.

  • Filling Methods: When reindexing introduces NaN values, you can fill them using methods like ffill, bfill, and nearest.

Example: Reindexing a Pandas Series

The following example demonstrates reindexing a Pandas Series object using the reindex() method. In this case, the "f" label was not present in the original Series, so it appears as NaN in the output reindexed Series.

import pandas as pd
import numpy as np

s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
print("Original Series:\n",s)

s_reindexed = s.reindex(["e", "b", "f", "d"])
print('\nOutput Reindexed Series:\n',s_reindexed)

On executing the above code you will get the following output −

Original Series:
 a    0.148874
b    0.592275
c   -0.903546
d    1.031230
e   -0.254599
dtype: float64

Output Reindexed Series:
 e   -0.254599
b    0.592275
f         NaN
d    1.031230
dtype: float64

Example: Reindexing a DataFrame

Consider the following example of reindexing a DataFrame using the reindex() method. With a DataFrame, you can reindex both the rows (index) and columns.

import pandas as pd
import numpy as np

N=5

df = pd.DataFrame({
   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
   'x': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
})

print("Original DataFrame:\n", df)

#reindex the DataFrame
df_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B'])

print("\nOutput Reindexed DataFrame:\n",df_reindexed)

Its output is as follows −

Original DataFrame:
            A    x         y       C           D
0 2016-01-01  0.0  0.513990  Medium  118.143385
1 2016-01-02  1.0  0.751248     Low   91.041201
2 2016-01-03  2.0  0.332970  Medium  100.644345
3 2016-01-04  3.0  0.723816    High  108.810386
4 2016-01-05  4.0  0.376326    High  101.346443

Output Reindexed DataFrame:
            A       C   B
0 2016-01-01  Medium NaN
2 2016-01-03  Medium NaN
5        NaT     NaN NaN

Reindex to Align with Other Objects

Sometimes, you may need to reindex one DataFrame to align it with another. The reindex_like() method allows you to do this seamlessly.

Example

The following example demonstrates how to reindex a DataFrame (df1) to match another DataFrame (df2) using the reindex_like() method.

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])

df1 = df1.reindex_like(df2)
print(df1)

Its output is as follows −

          col1         col2         col3
0    -2.467652    -1.211687    -0.391761
1    -0.287396     0.522350     0.562512
2    -0.255409    -0.483250     1.866258
3    -1.150467    -0.646493    -0.222462
4     0.152768    -2.056643     1.877233
5    -1.155997     1.528719    -1.343719
6    -1.015606    -1.245936    -0.295275

Note: Here, the df1 DataFrame is altered and reindexed like df2. The column names should be matched or else NAN will be added for the entire column label.

Filling While ReIndexing

The reindex() method provides an optional parameter method for filling missing values. The available methods include −

  • pad/ffill: Fill values forward.

  • bfill/backfill: Fill values backward.

  • nearest: Fill from the nearest index values.

Example

The following example demonstrates the working of the ffill method.

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(6, 3), columns=['col1', 'col2', 'col3'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['col1', 'col2', 'col3'])

# Padding NaNs
print(df2.reindex_like(df1))

# Now fill the NaNs with preceding values
print("Data Frame with Forward Fill:")
print(df2.reindex_like(df1, method='ffill'))

Its output is as follows −

         col1        col2       col3
0    1.311620   -0.707176   0.599863
1   -0.423455   -0.700265   1.133371
2         NaN         NaN        NaN
3         NaN         NaN        NaN
4         NaN         NaN        NaN
5         NaN         NaN        NaN

Data Frame with Forward Fill:
         col1        col2        col3
0    1.311620   -0.707176    0.599863
1   -0.423455   -0.700265    1.133371
2   -0.423455   -0.700265    1.133371
3   -0.423455   -0.700265    1.133371
4   -0.423455   -0.700265    1.133371
5   -0.423455   -0.700265    1.133371

Note: The last four rows are padded.

Limits on Filling While Reindexing

The limit argument provides additional control over filling while reindexing. The limit specifies the maximum count of consecutive matches.

Example

Let us consider the following example to understand specifying limits on filling −

import pandas as pd
import numpy as np
 
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])

# Padding NaNs
print(df2.reindex_like(df1))

# Now fill the NaNs with preceding values
print("Data Frame with Forward Fill limiting to 1:")
print(df2.reindex_like(df1, method='ffill', limit=1))

Its output is as follows −

         col1        col2        col3
0    0.247784    2.128727    0.702576
1   -0.055713   -0.021732   -0.174577
2         NaN         NaN         NaN
3         NaN         NaN         NaN
4         NaN         NaN         NaN
5         NaN         NaN         NaN

Data Frame with Forward Fill limiting to 1:
         col1        col2        col3
0    0.247784    2.128727    0.702576
1   -0.055713   -0.021732   -0.174577
2   -0.055713   -0.021732   -0.174577
3         NaN         NaN         NaN
4         NaN         NaN         NaN
5         NaN         NaN         NaN

Note: The forward fill (ffill) is limited to only one row.

Python Pandas - Iteration

Iterating over pandas objects is a fundamental task in data manipulation, and the behavior of iteration depends on the type of object you're dealing with. This tutorial explains how iteration works in pandas, specifically focusing on Series and DataFrame objects.

The iteration behavior in pandas varies between Series and DataFrame objects −

  • Series: Iterating over a Series object yields the values directly, making it similar to an array-like structure.

  • DataFrame: Iterating over a DataFrame follows a dictionary-like convention, where the iteration produces the column labels (i.e., the keys).

Iterating Through Rows in a DataFrame

To iterate over the rows of the DataFrame, we can use the following methods −

  • items(): to iterate over the (key,value) pairs

  • iterrows(): iterate over the rows as (index,series) pairs

  • itertuples(): iterate over the rows as namedtuples

Iterate Over Column Pairs

The items() method allows you to iterate over each column as a key-value pair, with the label as the key and the column values as a Series object. This method is consistent with the dictionary-like interface of a DataFrame.

Example

The following example iterates a DataFrame rows using the items() method. In this example each column is iterated separately as a key-value pair in a Series.

import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3'])

print("Original DataFrame:\n", df)

# Iterate Through DataFrame rows
print("Iterated Output:")
for key,value in df.items():
   print(key,value)

Its output is as follows −

Original DataFrame:
        col1      col2      col3
0  0.422561  0.094621 -0.214307
1  0.430612 -0.334812 -0.010867
2  0.350962 -0.145470  0.988463
3  1.466426 -1.258297 -0.824569

Iterated Output:
col1 0    0.422561
1    0.430612
2    0.350962
3    1.466426
Name: col1, dtype: float64
col2 0    0.094621
1   -0.334812
2   -0.145470
3   -1.258297
Name: col2, dtype: float64
col3 0   -0.214307
1   -0.010867
2    0.988463
3   -0.824569
Name: col3, dtype: float64

Observe, each column is iterated separately, where key is the column name, and value is the corresponding Series object.

Iterate Over DataFrame as Series Pairs

The iterrows() method returns an iterator that yields index and row pairs, where each row is represented as a Series object, containing the data in each row.

Example

The following example iterates the DataFrame rows using the iterrows() method.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])

print("Original DataFrame:\n", df)

# Iterate Through DataFrame rows
print("Iterated Output:")
for row_index,row in df.iterrows():
   print(row_index,row)

Its output is as follows −

Original DataFrame:
        col1      col2      col3
0  0.468160 -0.634193 -0.603612
1  1.231840  0.090565 -0.449989
2 -1.645371  0.032578 -0.165950
3  1.956370 -0.261995  2.168167

Iterated Output:
0 col1    0.468160
col2   -0.634193
col3   -0.603612
Name: 0, dtype: float64
1 col1    1.231840
col2    0.090565
col3   -0.449989
Name: 1, dtype: float64
2 col1   -1.645371
col2    0.032578
col3   -0.165950
Name: 2, dtype: float64
3 col1    1.956370
col2   -0.261995
col3    2.168167
Name: 3, dtype: float64

Note: Because iterrows() iterate over the rows, it doesn't preserve the data type across the row. 0,1,2 are the row indices and col1,col2,col3 are column indices.

Iterate Over DataFrame as Namedtuples

The itertuples() method will return an iterator yielding a named tuple for each row in the DataFrame. The first element of the tuple will be the rows corresponding index value, while the remaining values are the row values. This method is generally faster than iterrows() and preserves the data types of the row elements.

Example

The following example uses the itertuples() method to loop thought a DataFrame rows as Namedtuples

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])

print("Original DataFrame:\n", df)

# Iterate Through DataFrame rows
print("Iterated Output:")
for row in df.itertuples():
   print(row)

Its output is as follows −

Original DataFrame:
        col1      col2      col3
0  0.501238 -0.353269 -0.058190
1 -0.426044 -0.012733 -0.532594
2 -0.704042  2.201186 -1.960429
3  0.514151 -0.844160  0.508056

Iterated Output:
Pandas(Index=0, col1=0.5012381423628608, col2=-0.3532690739340918, col3=-0.058189913290578134)
Pandas(Index=1, col1=-0.42604395958954777, col2=-0.012733326002509393, col3=-0.5325942971498149)
Pandas(Index=2, col1=-0.7040424042099052, col2=2.201186165472291, col3=-1.9604285032438307)
Pandas(Index=3, col1=0.5141508750506754, col2=-0.8441600001815068, col3=0.5080555294913854)

Iterating Through DataFrame Columns

When you iterate over a DataFrame, it will simply returns the column names.

Example

Let us consider the following example to understand the iterate over a DataFrame columns.

import pandas as pd
import numpy as np
 
N = 5
df = pd.DataFrame({
   'A': pd.date_range(start='2016-01-01', periods=N, freq='D'),
   'x': np.linspace(0, stop=N-1, num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low', 'Medium', 'High'], N).tolist(),
   'D': np.random.normal(100, 10, size=N).tolist()
})

print("Original DataFrame:\n", df)

# Iterate Through DataFrame Columns
print("Output:")
for col in df:
   print(col)

Its output is as follows −

Original DataFrame:
            A    x         y     C           D
0 2016-01-01  0.0  0.990949   Low  114.143838
1 2016-01-02  1.0  0.314517  High   95.559640
2 2016-01-03  2.0  0.180237   Low  121.134817
3 2016-01-04  3.0  0.170095   Low   95.643132
4 2016-01-05  4.0  0.920718   Low   96.379692

Output:
A
x
y
C
D

Example

While iterating over a DataFrame, you should not modify any object. Iteration is meant for reading, and the iterator returns a copy of the original object (a view), meaning changes will not reflect on the original object. The following example demonstrates the above statement.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])

for index, row in df.iterrows():
   row['a'] = 10
print(df)

Its output is as follows −

        col1       col2       col3
0  -1.739815   0.735595  -0.295589
1   0.635485   0.106803   1.527922
2  -0.939064   0.547095   0.038585
3  -1.016509  -0.116580  -0.523158

As you can see, no changes are reflected in the DataFrame since the iteration only provides a view of the data.

Python Pandas - Concatenation

Concatenation in Pandas refers to the process of joining two or more Pandas objects (like DataFrames or Series) along a specified axis. This operation is very useful when you need to merge data from different sources or datasets.

The primary tool for this operation is pd.concat() function, which can useful for Series, DataFrame objects, whether you're combining rows or columns. Concatenation in Pandas involves combining multiple DataFrame or Series objects either row-wise or column-wise.

In this tutorial, we'll explore how to concatenate Pandas objects using the pd.concat() function. By discussing the different scenarios including concatenating along rows, using keys to distinguish concatenated DataFrames, ignoring indexes during concatenation, and concatenating along columns.

Understanding the pd.concat() Function

The pandas.concat() function is the primary method used for concatenation in Pandas. It allows you to concatenate pandas objects along a particular axis with various options for handling indexes.

The syntax of the pd.concat() functions as follows −

pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, 
levels=None, names=None, verify_integrity=False, sort=False, copy=None)

Where,

  • objs: This is a sequence or mapping of Series, DataFrame, or Panel objects.

  • axis: {0, 1, ...}, default 0. This is the axis to concatenate along.

  • join: {"inner", "outer"}, default "outer". How to handle indexes on other axis(es). Outer for union and inner for intersection.

  • ignore_index: boolean, default False. If True, do not use the index values on the concatenation axis. The resulting axis will be labeled 0, ..., n - 1.

  • keys: Used to create a hierarchical index along the concatenation axis.

  • levels: Specific levels to use for the MultiIndex in the result.

  • names: Names for the levels in the resulting hierarchical index.

  • verify_integrity: If True, checks for duplicate entries in the new axis and raises an error if duplicates are found.

  • sort: When combining DataFrames with unaligned columns, this parameter ensures the columns are sorted.

  • copy: default None. If False, do not copy data unnecessarily.

The concat() function does all of the heavy lifting of performing concatenation operations along an axis. Let us create different objects and do concatenation.

Example: Concatenating DataFrames

In this example, the two DataFrames are concatenated along rows, with the resulting DataFrame having duplicated indices.

import pandas as pd

# Creating two DataFrames
one = pd.DataFrame({
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5'],
   'Marks_scored':[98,90,87,69,78]},
   index=[1,2,3,4,5])

two = pd.DataFrame({
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5'],
   'Marks_scored':[89,80,79,97,88]},
   index=[1,2,3,4,5])

# Concatenating DataFrames
result = pd.concat([one, two])
print(result)

Output

Its output is as follows −

     Name subject_id  Marks_scored
1    Alex       sub1            98
2     Amy       sub2            90
3   Allen       sub4            87
4   Alice       sub6            69
5  Ayoung       sub5            78
1   Billy       sub2            89
2   Brian       sub4            80
3    Bran       sub3            79
4   Bryce       sub6            97
5   Betty       sub5            88

Example: Concatenating with Keys

If you want to distinguish between the concatenated DataFrames, you can use the keys parameter to associate specific keys with each part of the DataFrame.

import pandas as pd

one = pd.DataFrame({
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5'],
   'Marks_scored':[98,90,87,69,78]},
   index=[1,2,3,4,5])

two = pd.DataFrame({
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5'],
   'Marks_scored':[89,80,79,97,88]},
   index=[1,2,3,4,5])
print(pd.concat([one,two],keys=['x','y']))

Output

Its output is as follows −

       Name subject_id  Marks_scored
x 1    Alex       sub1            98
  2     Amy       sub2            90
  3   Allen       sub4            87
  4   Alice       sub6            69
  5  Ayoung       sub5            78
y 1   Billy       sub2            89
  2   Brian       sub4            80
  3    Bran       sub3            79
  4   Bryce       sub6            97
  5   Betty       sub5            88

Here, the x and y keys create a hierarchical index, allowing easy identification of which original DataFrame each row came from.

Example: Ignoring Indexes During Concatenation

If the resultant object has to follow its own indexing, set ignore_index to True.

import pandas as pd

one = pd.DataFrame({
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5'],
   'Marks_scored':[98,90,87,69,78]},
   index=[1,2,3,4,5])

two = pd.DataFrame({
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5'],
   'Marks_scored':[89,80,79,97,88]},
   index=[1,2,3,4,5])
print(pd.concat([one,two],keys=['x','y'],ignore_index=True))

Output

Its output is as follows −

     Name subject_id  Marks_scored
0    Alex       sub1            98
1     Amy       sub2            90
2   Allen       sub4            87
3   Alice       sub6            69
4  Ayoung       sub5            78
5   Billy       sub2            89
6   Brian       sub4            80
7    Bran       sub3            79
8   Bryce       sub6            97
9   Betty       sub5            88

Observe, the index changes completely and the Keys are also overridden.

Example: Concatenating Along Columns

Instead of concatenating along rows, you can concatenate along columns by setting the axis parameter to 1.

import pandas as pd

one = pd.DataFrame({
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5'],
   'Marks_scored':[98,90,87,69,78]},
   index=[1,2,3,4,5])

two = pd.DataFrame({
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5'],
   'Marks_scored':[89,80,79,97,88]},
   index=[1,2,3,4,5])
print(pd.concat([one,two],axis=1))

Output

Its output is as follows −

    Name subject_id  Marks_scored   Name subject_id  Marks_scored
1    Alex       sub1            98  Billy       sub2            89
2     Amy       sub2            90  Brian       sub4            80
3   Allen       sub4            87   Bran       sub3            79
4   Alice       sub6            69  Bryce       sub6            97
5  Ayoung       sub5            78  Betty       sub5            88

Python Pandas - Statistical Functions

In data analysis, understanding the patterns and relationships within your data is crucial. Statistical methods in Pandas help to extract meaningful information, patterns and relationships from data, enabling you to make decisions and analyzing the behavior of data.

In this tutorial, we will explore some key statistical functions available in Pandas. These functions are designed to help you summarize and understand your data in different ways. Whether you want to measure changes over time, assess relationships between variables, or rank your data, Pandas provides the tools you need.

Analyzing Fractional Change

The pct_change() function in Pandas calculates the fractional change between the current and a prior element. It is a valuable tool for understanding how data evolves over time, commonly used in financial data analysis.

Example

Following is the example of calculating the fractional change between the current and a prior element of Pandas Series and DataFrame using the pct_change() method.

import pandas as pd
import numpy as np
s = pd.Series([1,2,3,4,5,4])
print(s.pct_change())

df = pd.DataFrame(np.random.randn(5, 2))
print(df.pct_change())

Its output is as follows −

0        NaN
1   1.000000
2   0.500000
3   0.333333
4   0.250000
5  -0.200000
dtype: float64

            0          1
0         NaN        NaN
1  -15.151902   0.174730
2  -0.746374   -1.449088
3  -3.582229   -3.165836
4   15.601150  -1.860434

By default, the pct_change() operates on columns; if you want to apply the same row wise, then use axis=1() argument.

Understanding Covariance

Covariance measures how two variables change together. In Pandas, the cov() method computes the covariance between two Series objects or across all pairs of columns in a DataFrame.

Example

Here is the example of calculating the covariance between two Series objects using the Series.cov() method.

import pandas as pd
import numpy as np
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
print(s1.cov(s2))

Its output is as follows −

0.02429227824398636

Example

Covariance method when applied on a DataFrame, computes cov() between all the columns.

import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print(frame['a'].cov(frame['b']))
print(frame.cov())

Its output is as follows −

-0.58312921152741437

           a           b           c           d            e
a   1.780628   -0.583129   -0.185575    0.003679    -0.136558
b  -0.583129    1.297011    0.136530   -0.523719     0.251064
c  -0.185575    0.136530    0.915227   -0.053881    -0.058926
d   0.003679   -0.523719   -0.053881    1.521426    -0.487694
e  -0.136558    0.251064   -0.058926   -0.487694     0.960761

Note: Observe the cov between a and b column in the first statement and the same is the value returned by cov on DataFrame.

Measuring Correlation

Correlation shows the linear relationship between any two array of values (series). Pandas corr() function supports different correlation methods, including Pearson (default), Spearman, and Kendall.

Example

This example calculates the correlation between two columns of a DataFrame using the corr() function.

import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])

print(frame['a'].corr(frame['b']))
print(frame.corr())

Its output is as follows −

-0.383712785514

           a          b          c          d           e
a   1.000000  -0.383713  -0.145368   0.002235   -0.104405
b  -0.383713   1.000000   0.125311  -0.372821    0.224908
c  -0.145368   0.125311   1.000000  -0.045661   -0.062840
d   0.002235  -0.372821  -0.045661   1.000000   -0.403380
e  -0.104405   0.224908  -0.062840  -0.403380    1.000000

If any non-numeric column is present in the DataFrame, it is excluded automatically.

Ranking Data

The rank() function assigns ranks to elements in a Series or DataFrame. In cases where multiple elements have the same value, it assigns the average rank by default, but this behavior can be adjusted.

Example

Following is the example of calculating the numerical data ranks of the Series elements using the rank() method.

import pandas as pd
import numpy as np

s = pd.Series(np.random.randn(5), index=list('abcde'))
s['d'] = s['b'] # so there's a tie
print(s.rank())

Its output is as follows −

a  1.0
b  3.5
c  2.0
d  3.5
e  5.0
dtype: float64

Rank optionally takes a parameter ascending which by default is true; when false, data is reverse-ranked, with larger values assigned a smaller rank. It supports different tie-breaking methods, specified with the method parameter −

  • average: average rank of tied group

  • min: lowest rank in the group

  • max: highest rank in the group

  • first: ranks assigned in the order they appear in the array

Python Pandas - Descriptive Statistics

Descriptive statistics are essential tools in data analysis, offering a way to summarize and understand your data. In Python's Pandas library, there are numerous methods available for computing descriptive statistics on Series and DataFrame objects.

These methods provide various aggregations like sum(), mean(), and quantile(), as well as operations like cumsum() and cumprod() that return an object of the same size.

In this tutorial we will discuss about the some of the most commonly used descriptive statistics functions in Pandas, applied to both Series and DataFrame objects. These methods can be classified into different categories based on their functionality, such as Aggregation Functions, Cumulative Functions, and more.

Aggregation Functions

Aggregation functions produce a single value from a series of data, providing a concise summary of your dataset. Here are some key aggregation functions −

Sr.No. Methods & Description
1

mean()

Returns the mean of the values over the requested axis.

2

sum()

Return the sum of the values over the requested axis.

3

median()

Returns the Arithmetic median of values.

4

min()

It return the minimum of the values over the requested axis.

5

max()

Returns the maximum of the values over the requested axis.

6

count()

Returns the number of non-NA/null observations in the given object.

7

quantile()

Returns the value at the given quantile(s).

8

mode()

Returns the mode(s) of each element along the selected axis/Series.

9

var()

Return unbiased variance over requested axis.

10

kurt()

Return unbiased kurtosis over requested axis.

11

skew()

Return unbiased skew over requested axis.

12

sem()

Return unbiased skew over requested axis.

13

corr()

Compute correlation with other objects, excluding missing values.

14

cov()

Computes the covariance between two objects, excluding NA/null values.

15

autocorr()

Computes the lag-N autocorrelation.

Cumulative Functions

Cumulative functions provide running totals or products and maintain the same shape as the input data. These are useful in time series analysis or for understanding trends −

Sr.No. Methods & Description
1

cumsum()

Return cumulative sum over a DataFrame or Series axis.

2

cumprod()

Return cumulative product over a DataFrame or Series axis.

3

cummax()

Return cumulative maximum over a DataFrame or Series axis.

4

cummin()

Return cumulative minimum over a DataFrame or Series axis.

Boolean Functions

Boolean functions return boolean values based on logical operations across the Series −

Sr.No. Methods & Description
1

all()

Returns True if all elements are True, potentially along an axis.

2

any()

Returns True if any element is True, potentially along an axis.

3

between()

Returns True for each element if it is between the left and right bounds.

Transformation Functions

Transformation functions apply a mathematical operation to each element in the Series, returning a transformed Series−

Sr.No. Methods & Description
1

diff()

Computes the difference between elements in the object, over the specified number of periods.

2

pct_change()

Computes the percentage change between the current and a prior element.

3

rank()

Computes the rank of values in the given object.

Index Related Functions

These functions relate to the Series index and provide ways to manipulate and analyze index labels −

Sr.No. Methods & Description
1

idxmax()

Returns the index of the first occurrence of the maximum value.

2

idxmin()

Returns the index of the first occurrence of the minimum value.

3

value_counts()

Returns a Series containing counts of unique values.

4

unique()

Returns an array of unique values in the Series elements.

Statistical Functions

These functions provide various statistical metrics on the Series data −

Sr.No. Methods & Description
1

nunique()

Returns the number of unique values in the given object.

2

std()

Returns the standard deviation of the Series values.

3

abs()

Return a Series/DataFrame with absolute numeric value of each element.

4

clip()

Trims values at input thresholds, returning values outside the bounds to the boundary.

5

round()

Round each value in the given object to the specified number of decimals.

6

prod()

Returns the product of the given object elements.

7

describe()

Generate descriptive statistics of the given object.

Python Pandas - Working with Text Data

Pandas provides powerful tools for working with text data using the .str accessor. This allows us to apply various string operations on Series and Index objects, which work efficiently on string manipulation within a Pandas DataFrame.

The .str accessor provides a variety of string methods that can perform operations like string transformation, concatenation, searching, and many other on string objects. Below, these methods are categorized based on their functionalities −

String Transformation

This category includes methods that transform the strings in some way, such as changing the case, formatting, or modifying specific characters.

Sr.No. Methods & Description
1

Series.str.capitalize()

Transforms the first character of each string in the Series or Index to uppercase and the rest to lowercase.

2

Series.str.casefold()

Converts each string to lowercase in a more aggressive manner suitable for case-insensitive comparisons.

3

Series.str.lower()

Converts all characters in each string of the Series or Index to lowercase.

4

Series.str.upper()

Converts all characters in each string of the Series or Index to uppercase.

5

Series.str.title()

Converts each string to titlecase, where the first character of each word is capitalized.

6

Series.str.swapcase()

Swaps case converts uppercase characters to lowercase and vice versa.

7

Series.str.replace()

Replaces occurrences of a pattern or regular expression in each string with another string.

String Trimming

This category includes methods to trim strings to a specific characters or specified prefix.

Sr.No. Methods & Description
1

Series.str.lstrip()

Removes leading characters (by default, whitespace) from each string.

2

Series.str.strip()

Removes leading and trailing characters (by default, whitespace) from each string.

3

Series.str.rstrip()

Removes trailing characters (by default, whitespace) from each string.

4

Series.str.removeprefix(prefix)

Removes the specified prefix from each string in the Series or Index, if it exists.

5

Series.str.removesuffix(suffix)

Removes the specified suffix from each string in the Series or Index, if it exists.

String Concatenation and Joining Methods

These methods allow you to combine multiple strings into one or join elements within strings using specified separators.

Sr.No. Methods & Description
1

Series.str.cat()

Concatenates strings in the Series or Index with an optional separator.

2

Series.str.join()

Joins the elements in lists contained in each string of the Series or Index using the specified separator.

String Padding Methods

This category includes methods to pad strings to a specific length or align them within a specified width.

Sr.No. Methods & Description
1

Series.str.center()

Centers each string in the Series or Index within a specified width, padding with a character.

2

Series.str.pad()

Pads each string in the Series or Index to a specified width, with an option to pad from the left, right, or both sides.

3

Series.str.ljust()

Pads the right side of each string in the Series or Index with a specified character to reach the specified width.

4

Series.str.rjust()

Pads the left side of each string in the Series or Index with a specified character to reach the specified width.

5

Series.str.zfill()

Pads each string in the Series or Index with zeros on the left, up to the specified width.

String Searching Methods

These methods help you locate substrings, count occurrences, or check for patterns within the text.

Sr.No. Methods & Description
1

Series.str.contains()

Checks whether each string in the Series or Index contains a specified pattern.

2

Series.str.count()

Counts occurrences of a pattern or regular expression in each string of the Series or Index.

3

Series.str.find()

Finds the lowest index of a substring in each string of the Series or Index.

4

Series.str.rfind()

Finds the highest index of a substring in each string of the Series or Index.

5

Series.str.index()

Similar to find(), but raises an exception if the substring is not found.

6

Series.str.rindex()

Similar to rfind(), but raises an exception if the substring is not found.

7

Series.str.match()

Checks for a match only at the beginning of each string.

8

Series.str.fullmatch()

Checks for a match across the entire string.

9

Series.str.extract()

Extracts matched groups in each string using regular expressions.

10

Series.str.extractall()

Extracts all matches in each string using regular expressions.

String Splitting Methods

Splitting methods divide strings based on a delimiter or pattern, which is useful for parsing text data into separate components.

Sr.No. Methods & Description
1

Series.str.split()

Splits each string in the Series or Index by the specified delimiter or regular expression, and returns a list of strings.

2

Series.str.rsplit()

Splits each string in the Series or Index by the specified delimiter or regular expression, starting from the right side, and returns a list of strings.

3

Series.str.partition()

Splits each string at the first occurrence of the delimiter, and returns a tuple containing three elements: the part before the delimiter, the delimiter itself, and the part after the delimiter.

4

Series.str.rpartition()

Splits each string at the last occurrence of the delimiter, and returns a tuple containing three elements: the part before the delimiter, the delimiter itself, and the part after the delimiter.

String Filtering Methods

These methods are useful for filtering out non-alphanumeric characters, controlling character sets, or cleaning text data.

Sr.No. Methods & Description
1

Series.str.filter()

Returns elements for which a provided function evaluates to true.

2

Series.str.get()

Extracts element from each component at specified position.

3

Series.str.get_dummies()

Splits each string in the Series by the specified delimiter and returns a DataFrame of dummy/indicator variables.

4

Series.str.isalpha()

Checks whether each string consists only of alphabetic characters.

5

Series.str.isdigit()

Checks whether each string consists only of digits.

6

Series.str.isnumeric()s

Checks whether each string consists only of numeric characters.

7

Series.str.isspace()

Checks whether each string consists only of whitespace.

8

Series.str.isupper()

Checks whether all characters in each string are uppercase.

9

Series.str.islower()

Checks if all characters in each string are lowercase.

10

Series.str.isalnum()

Checks if all characters in each string are alphanumeric (letters and digits).

11

Series.str.istitle()

Checks if each string in the Series or Index is in title case, where each word starts with a capital letter.

12

Series.str.isdecimal()

Checks if all characters in each string are decimal characters.

13

Series.str.len()

Computes the length of each string in the Series or Index.

14

Series.str.findall()

Finds all occurrences of a pattern or regular expression in each string.

Miscellaneous Methods

This category includes methods that perform a variety of other operations on strings, such as encoding, decoding, and checking for the presence of certain characters.

Sr.No. Methods & Description
1

Series.str.encode()

Encodes each string using the specified encoding.

2

Series.str.decode()

Decodes each string using the specified encoding.

3

Series.str.expandtabs()

Expands tab characters ('\t') into spaces.

4

Series.str.repeat()

Repeats each string in the Series or Index by the specified number of times.

5

Series.str.slice_replace()

Replaces a slice in each string with a passed replacement.

6

Series.str.translate()

Maps each character in the string through a translation table.

7

Series.str.slice()

Slices each string in the Series or Index by a passed argument.

8

Series.str.startswith()

Checks whether each string in the Series or Index starts with a specified pattern.

9

Series.str.endswith()

Checks whether each string in the Series or Index ends with a specified pattern.

10

Series.str.normalize()

Normalizes the Unicode representation of each string in the Series or Index to the specified normalization form.

11

Series.str.wrap()

Wraps each string in the Series or Index to the specified line width, breaking lines as needed.

Python Pandas - Function Application

Pandas provides powerful methods to apply custom or library functions to DataFrame and Series objects. Depending on whether you want to apply a function to the entire DataFrame, row- or column-wise, or element-wise, Pandas offers several methods to achieve these tasks.

In this tutorial, we will explore three essential methods for function application in Pandas −

  • Table wise Function Application: pipe()
  • Row or Column Wise Function Application: apply()
  • Element wise Function Application: map()

Let's dive into each method and see how they can be utilized effectively.

Table-wise Function Application

The pipe() function allows you to apply chainable functions that expect a DataFrame or Series as input. This method is useful for performing custom operations on the entire DataFrame in a clean and readable manner.

Example: Applying a Custom Function to the Entire DataFrame

Here is the example that demonstrates how you can add a value to all elements in the DataFrame using the pipe() function.

import pandas as pd
import numpy as np

def adder(ele1,ele2):
   return ele1+ele2

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
print('Original DataFrame:\n', df)

df.pipe(adder,2)
print('Modified DataFrame:\n', df)

Its output is as follows −

Original DataFrame:
        col1      col2      col3
0  2.349190  1.908931 -0.121444
1  1.306488 -0.946431  0.308926
2 -0.235694 -0.720602  1.089614
3  0.960508 -1.273928  0.943044
4 -1.180202 -0.959529  0.464541

Modified DataFrame:
        col1      col2      col3
0  2.349190  1.908931 -0.121444
1  1.306488 -0.946431  0.308926
2 -0.235694 -0.720602  1.089614
3  0.960508 -1.273928  0.943044
4 -1.180202 -0.959529  0.464541

Row or Column Wise Function Application

The apply() function is versatile and allows you to apply a function along the axes of a DataFrame. By default, it applies the function column-wise, but you can specify row-wise application using the axis parameter.

Example: Applying a Function Column-wise

This example applies a function to the DataFrame columns. Here the np.mean() function calculates the mean of each column.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), columns=['col1', 'col2', 'col3'])
print('Original DataFrame:\n', df)

result = df.apply(np.mean)
print('Result:\n',result)

Its output is as follows −

Original DataFrame:
        col1      col2      col3
0 -0.024666  0.058480  0.658520
1 -0.040997  1.253245 -1.242394
2  1.073832 -1.039897  0.840698
3  0.248157 -1.985475  0.310767
4 -0.973393 -1.002330 -0.890125

Result:
 col1    0.056587
col2   -0.543195
col3   -0.064507
dtype: float64

By passing value 1 to the axis parameter, operations can be performed row wise.

Example: Applying a Function Row-wise

This function applies the np.mean() function to the rows of the pandas DataFrame.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), columns=['col1', 'col2', 'col3'])
print('Original DataFrame:\n', df)

result = df.apply(np.mean, axis=1)
print('Result:\n',result)

Its output is as follows −

Original DataFrame:
        col1      col2      col3
0  0.069495 -1.228534 -1.431796
1  0.468724  0.497217 -0.270103
2 -0.754304  0.053360 -1.298396
3  0.762669 -2.181029 -2.067756
4  0.129679  0.131104  1.010851

Result:
 0   -0.863612
1    0.231946
2   -0.666446
3   -1.162039
4    0.423878
dtype: float64  

Example: Applying a Lambda Function

The following example applies the lambda function to the DataFrame elements using the apply() method.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), columns=['col1', 'col2', 'col3'])
print('Original DataFrame:\n', df)

result = df.apply(lambda x: x.max() - x.min())
print('Result:\n',result)

Its output is as follows −

Original DataFrame:
        col1      col2      col3
0 -1.143522  0.413272  0.633881
1  0.200806 -0.050024  0.108580
2 -2.147704 -0.400682 -1.191469
3  2.342222 -2.398639  0.063151
4 -1.071437  1.895879 -0.916805

Result:
 col1    4.489926
col2    4.294518
col3    1.825350
dtype: float64

Element Wise Function Application

When you need to apply a function to each element individually, you can use map() function. These methods are particularly useful when the function cannot be vectorized.

Example: Using map() Function

The following example demonstrates how to use the map() function for applying a custom function to the elements of the DataFrame object.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])

# My custom function
df['col1'].map(lambda x:x*100)
print(df.apply(np.mean))

Its output is as follows −

col1    0.480742
col2    0.454185
col3    0.266563
dtype: float64

Python Pandas - Options and Customization

Pandas provides a API to customize various aspects of its behavior, particularly related to display settings. This customization is essential for adjusting how data is presented based on your needs. Whether you want to adjust how many rows and columns are displayed or change the precision of floating-point numbers, Pandas provides a flexible and powerful API for these customization's.

The primary functions available for these customization's are −

  • get_option()
  • set_option()
  • reset_option()
  • describe_option()
  • option_context()

Frequently used Parameters

Before learning about the customization options, let's see about some of the frequently used Pandas display parameters that you can used for customization −

Sr.No Parameter & Description
1

display.max_rows

Maximum number of rows to display.

2

2 display.max_columns

Maximum number of columns to display.

3

display.expand_frame_repr

Whether to expand the display of DataFrames across multiple lines.

4

display.max_colwidth

Maximum width of columns.

5

display.precision

Precision to display for decimal numbers.

Let us now understand how the customization's functions operate.

Getting the Current Options

The get_option() function retrieves the current value of a specified parameter. This is useful for checking the current configuration of Pandas.

Example: Checking Maximum Rows Displayed

Following is the example that gets and returns the default number of maximum rows displayed. Interpreter reads this value and displays the rows with this value as upper limit to display.

import pandas as pd
print(pd.get_option("display.max_rows"))

Its output is as follows −

60

Example: Checking Maximum Columns Displayed

This example returns the default number of maximum columns displayed. Interpreter reads this value and displays the rows with this value as upper limit to display.

import pandas as pd
print(pd.get_option("display.max_columns"))

Its output is as follows −

0

Here, 60 and 0 are the default configuration parameter values.

Setting a New Option

The set_option() function allows you to change the value of a specific parameter, enabling you to customize how data is displayed.

Example: Changing Maximum Rows Displayed

Using set_option(), we can change the default number of rows to be displayed. Here is the example −

import pandas as pd
pd.set_option("display.max_rows",10)
print(pd.get_option("display.max_rows"))

Its output is as follows −

10

Example: Changing Maximum Columns Displayed

Following is the example that uses the set_option() function to change the default number of columns to be displayed.

import pandas as pd

pd.set_option("display.max_columns",30)

print(pd.get_option("display.max_columns"))

Its output is as follows −

30

Resetting an Option to Its Default Value

The reset_option() function resets the value of a specified parameter back to its default setting.

Example: Resetting Maximum Rows Displayed

Using the reset_option() function, we can change the value back to the default number of rows to be displayed.

import pandas as pd

pd.reset_option("display.max_rows")
print(pd.get_option("display.max_rows"))

Its output is as follows −

60

Describing an Option

The describe_option() function provides a description of a specified parameter, explaining what it does and its default value.

Example: Describing Maximum Rows Displayed

This example uses the reset_option() function to get the description of the max_row parameter.

import pandas as pd
pd.describe_option("display.max_rows")

Its output is as follows −

display.max_rows : int
   If max_rows is exceeded, switch to truncate view. Depending on
   'large_repr', objects are either centrally truncated or printed as
   a summary view. 'None' value means unlimited.

   In case python/IPython is running in a terminal and `large_repr`
   equals 'truncate' this can be set to 0 and pandas will auto-detect
   the height of the terminal and print a truncated object which fits
   the screen height. The IPython notebook, IPython qtconsole, or
   IDLE do not run in a terminal and hence it is not possible to do
   correct auto-detection.
   [default: 60] [currently: 60]

Temporary Option Setting

The option_context() function allows you to set an option temporarily within a with statement. Once the context is exited, the option is automatically reverted to its previous value.

Example: Temporarily Changing Maximum Rows Displayed

This example uses the option_context() function to set the temporarily value for the maximum rows to displayed.

import pandas as pd
with pd.option_context("display.max_rows",10):
   print(pd.get_option("display.max_rows"))
print(pd.get_option("display.max_rows"))

Its output is as follows −

10
60

See, the difference between the first and the second print statements. The first statement prints the value set by option_context() which is temporary within the with context itself. After the with context, the second print statement prints the configured value.

Python Pandas - Window Functions

Window functions in Pandas provide a powerful way to perform operations on a series of data, allowing you to compute statistics and other aggregations over a window of data points. These functions are particularly useful in time series analysis and other situations where you need to consider a range of data points around each observation.

Pandas supports four main types of windowing operations −

  • Rolling Window: A sliding window that can be fixed or variable in size.

  • Weighted Window: A non-rectangular, weighted window supplied by the scipy.signal library.

  • Expanding Window: An accumulating window that includes all data points up to the current one.

  • Exponentially Weighted Window: An accumulating window that applies exponential weighting to previous data points.

We will now learn how each of these can be applied on DataFrame objects.

Rolling Window

A rolling window operation involves moving a fixed-size window across the data, performing an aggregation function (like sum or mean) within that window. It is very flexible and supports various time-based operations.

To perform this operation we can use the rolling() function. This function can be applied on a series of data. Specify the window=n argument and apply the appropriate statistical function on top of it. The rolling() function returns the pandas.typing.api.Rolling object.

Example

Following is the example of applying the rolling window operation on DataFrame using the rolling() function.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),
   index = pd.date_range('1/1/2000', periods=10),
   columns = ['A', 'B', 'C', 'D'])
print(df.rolling(window=3).mean())

Its output is as follows −

                    A           B           C           D
2000-01-01        NaN         NaN         NaN         NaN
2000-01-02        NaN         NaN         NaN         NaN
2000-01-03   0.434553   -0.667940   -1.051718   -0.826452
2000-01-04   0.628267   -0.047040   -0.287467   -0.161110
2000-01-05   0.398233    0.003517    0.099126   -0.405565
2000-01-06   0.641798    0.656184   -0.322728    0.428015
2000-01-07   0.188403    0.010913   -0.708645    0.160932
2000-01-08   0.188043   -0.253039   -0.818125   -0.108485
2000-01-09   0.682819   -0.606846   -0.178411   -0.404127
2000-01-10   0.688583    0.127786    0.513832   -1.067156

Note: Since the window size is 3, for first two elements there are nulls and from third the value will be the average of the n, n-1 and n-2 elements. Thus we can also apply various functions as mentioned above.

Weighted Window

A weighted window applies a non-rectangular window function, often used in signal processing. The win_type parameter is used to specify the window type, which corresponds to a window function from the scipy.signal library.

Example

This example demonstrates applying the weighted window operation on the Pandas Series object.

import pandas as pd

# Creating a series
s = pd.Series(range(10))

# Applying a triangular weighted window
result = s.rolling(window=5, win_type="triang").mean()
print(result)

When we run above program, it produces following result −

0    NaN
1    NaN
2    NaN
3    NaN
4    2.0
5    3.0
6    4.0
7    5.0
8    6.0
9    7.0
dtype: float64

Expanding Window

An expanding window calculates the aggregation statistic over all the data points available up to the current point, allowing for cumulative calculations.

The expanding() function can be applied on a series of data. Specify the min_periods=n argument and apply the appropriate statistical function on top of it. This function returns a pandas.typing.api.Expanding object.

Example

Here is an example of applying the expanding window operation on the DataFame object.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),
   index = pd.date_range('1/1/2000', periods=10),
   columns = ['A', 'B', 'C', 'D'])
print(df.expanding(min_periods=3).mean())

Its output is as follows −

                    A           B           C           D
2000-01-01        NaN         NaN         NaN         NaN
2000-01-02        NaN         NaN         NaN         NaN
2000-01-03   0.434553   -0.667940   -1.051718   -0.826452
2000-01-04   0.743328   -0.198015   -0.852462   -0.262547
2000-01-05   0.614776   -0.205649   -0.583641   -0.303254
2000-01-06   0.538175   -0.005878   -0.687223   -0.199219
2000-01-07   0.505503   -0.108475   -0.790826   -0.081056
2000-01-08   0.454751   -0.223420   -0.671572   -0.230215
2000-01-09   0.586390   -0.206201   -0.517619   -0.267521
2000-01-10   0.560427   -0.037597   -0.399429   -0.376886

Exponentially Weighted Window

This type of windowing operation applies exponential weighting to previous data points, which means that older data points are given progressively less importance.

The ewm() function is applied on a series of data. Specify any of the com, span, halflife argument and apply the appropriate statistical function on top of it. It assigns the weights exponentially. This function returns pandas.typing.api.ExponentialMovingWindow object.

import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randn(10, 4),
   index = pd.date_range('1/1/2000', periods=10),
   columns = ['A', 'B', 'C', 'D'])
print(df.ewm(com=0.5).mean())

Its output is as follows −

                    A           B           C           D
2000-01-01   1.088512   -0.650942   -2.547450   -0.566858
2000-01-02   0.865131   -0.453626   -1.137961    0.058747
2000-01-03  -0.132245   -0.807671   -0.308308   -1.491002
2000-01-04   1.084036    0.555444   -0.272119    0.480111
2000-01-05   0.425682    0.025511    0.239162   -0.153290
2000-01-06   0.245094    0.671373   -0.725025    0.163310
2000-01-07   0.288030   -0.259337   -1.183515    0.473191
2000-01-08   0.162317   -0.771884   -0.285564   -0.692001
2000-01-09   1.147156   -0.302900    0.380851   -0.607976
2000-01-10   0.600216    0.885614    0.569808   -1.110113

Window functions are majorly used in finding the trends within the data graphically by smoothing the curve. If there is lot of variation in the everyday data and a lot of data points are available, then taking the samples and plotting is one method and applying the window computations and plotting the graph on the results is another method. By these methods, we can smooth the curve or the trend.

Python Pandas - Aggregations

Aggregating data is a key step in data analysis, especially when dealing with large datasets. In Pandas, you can perform aggregations using the DataFrame.agg() method, This method is flexible, enabling various operations that summarize and analyze your data. Aggregation operations in Pandas can be applied to either the index axis (default) or the column axis.

In this tutorial we will discuss about how to use the DataFrame.agg() method to perform various aggregation techniques, including how to apply multiple aggregation functions, customize aggregations for specific columns, and work with both rows and columns.

Understanding the DataFrame.agg() Method

The DataFrame.agg() method (an alias for aggregate) is a powerful tool that allows you to apply one or more aggregation functions to a DataFrame, either across rows or columns, providing a summary of the data.

Syntax

Following is the syntax −

DataFrame.agg(func=None, axis=0, *args, **kwargs)

Where,

  • func: This parameter specifies the aggregation function(s) to be applied. It accepts a single function or function name (e.g., np.sum, 'mean'), a list of functions or function names, or a dictionary mapping axis labels to functions.

  • axis: Specifies the axis along which to apply the aggregation. 0 or 'index' applies the function(s) to each column (default), while 1 or 'columns' applies the function(s) to each row.

  • *args: Positional arguments to pass to the aggregation function(s).

  • **kwargs: Keyword arguments to pass to the aggregation function(s).

The result of agg() method depends on the input, it returns a scalar or Series if a single function is used, or a DataFrame if multiple functions are applied.

Applying Aggregations on DataFrame Rows

You can aggregate multiple functions over the rows (index axis) using the agg function. This method applies the specified aggregation functions to each column in the DataFrame.

Example

Let us create a DataFrame and apply aggregation functions sum and min on it. In this example, the sum and min functions are applied to each column, providing a summary of the data.

import pandas as pd
import numpy as np

df = pd.DataFrame([[1, 2, 3, 1],
                   [4, 5, 6, np.nan],
                   [7, 8, 9, 2],
                   [np.nan, 2, np.nan, 3]],
   index = pd.date_range('1/1/2024', periods=4),
   columns = ['A', 'B', 'C', 'D'])

print("Input DataFrame:\n",df)
result = df.agg(['sum', 'min'])
print("\nResults:\n",result)

Its output is as follows −

Input DataFrame:
               A  B    C    D
2024-01-01  1.0  2  3.0  1.0
2024-01-02  4.0  5  6.0  NaN
2024-01-03  7.0  8  9.0  2.0
2024-01-04  NaN  2  NaN  3.0

Results:
         A   B     C    D
sum  12.0  17  18.0  6.0
min   1.0   2   3.0  1.0

Applying Different Functions Per Column

You can also apply different aggregation functions to different columns by passing a dictionary to the agg function. Each key in the dictionary corresponds to a column, and the value is a list of aggregation functions to apply.

import pandas as pd
import numpy as np

df = pd.DataFrame([[1, 2, 3, 1],
                   [4, 5, 6, np.nan],
                   [7, 8, 9, 2],
                   [np.nan, 2, np.nan, 3]],
   index = pd.date_range('1/1/2024', periods=4),
   columns = ['A', 'B', 'C', 'D'])

print("Input DataFrame:\n",df)
result = df.agg({'A': ['sum', 'min'], 'B': ['min', 'max']})
print("\nResults:\n",result)

On executing the above code, it produces following output:

Input DataFrame:
               A  B    C    D
2024-01-01  1.0  2  3.0  1.0
2024-01-02  4.0  5  6.0  NaN
2024-01-03  7.0  8  9.0  2.0
2024-01-04  NaN  2  NaN  3.0

Results:
         A    B
sum  12.0  NaN
min   1.0  2.0
max   NaN  8.0

Apply Aggregation on a Single Column

You can apply aggregation functions to individual columns, such as calculating a rolling sum.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),
   index = pd.date_range('1/1/2000', periods=10),
   columns = ['A', 'B', 'C', 'D'])
print(df)
r = df.rolling(window=3,min_periods=1)
print(r['A'].aggregate(np.sum))

Its output is as follows −

                 A           B           C           D
2000-01-01   1.088512   -0.650942   -2.547450   -0.566858
2000-01-02   1.879182   -1.038796   -3.215581   -0.299575
2000-01-03   1.303660   -2.003821   -3.155154   -2.479355
2000-01-04   1.884801   -0.141119   -0.862400   -0.483331
2000-01-05   1.194699    0.010551    0.297378   -1.216695
2000-01-06   1.925393    1.968551   -0.968183    1.284044
2000-01-07   0.565208    0.032738   -2.125934    0.482797
2000-01-08   0.564129   -0.759118   -2.454374   -0.325454
2000-01-09   2.048458   -1.820537   -0.535232   -1.212381
2000-01-10   2.065750    0.383357    1.541496   -3.201469
2000-01-01   1.088512
2000-01-02   1.879182
2000-01-03   1.303660
2000-01-04   1.884801
2000-01-05   1.194699
2000-01-06   1.925393
2000-01-07   0.565208
2000-01-08   0.564129
2000-01-09   2.048458
2000-01-10   2.065750
Freq: D, Name: A, dtype: float64

Customizing the Result

Pandas allows you to aggregate different functions across the columns and rename the resulting DataFrame's index. This can be done by passing tuples to the agg() function.

Example

The following example applies the aggregation with custom index labels.

import pandas as pd
import numpy as np

df = pd.DataFrame([[1, 2, 3, 1],
                   [4, 5, 6, np.nan],
                   [7, 8, 9, 2],
                   [np.nan, 2, np.nan, 3]],
   index = pd.date_range('1/1/2024', periods=4),
   columns = ['A', 'B', 'C', 'D'])

print("Input DataFrame:\n",df)
result = df.agg(x=('A', 'max'), y=('B', 'min'), z=('C', 'mean'))
print("\nResults:\n",result)

Its output is as follows −

Input DataFrame:
               A  B    C    D
2024-01-01  1.0  2  3.0  1.0
2024-01-02  4.0  5  6.0  NaN
2024-01-03  7.0  8  9.0  2.0
2024-01-04  NaN  2  NaN  3.0

Results:
      A    B    C
x  7.0  NaN  NaN
y  NaN  2.0  NaN
z  NaN  NaN  6.0

Applying Aggregating Over Columns

In addition to aggregating over rows, you can aggregate over the columns by setting the axis parameter to columns (axis=1). This is useful when you want to apply an aggregation function across the rows.

Example

This example applies the mean() function across the columns for each row.

import pandas as pd
import numpy as np

df = pd.DataFrame([[1, 2, 3, 1],
                   [4, 5, 6, np.nan],
                   [7, 8, 9, 2],
                   [np.nan, 2, np.nan, 3]],
   index = pd.date_range('1/1/2024', periods=4),
   columns = ['A', 'B', 'C', 'D'])

print("Input DataFrame:\n",df)
result = df.agg("mean", axis="columns")
print("\nResults:\n",result)

Its output is as follows −

Input DataFrame:
               A  B    C    D
2024-01-01  1.0  2  3.0  1.0
2024-01-02  4.0  5  6.0  NaN
2024-01-03  7.0  8  9.0  2.0
2024-01-04  NaN  2  NaN  3.0

Results:
 2024-01-01    1.75
2024-01-02    5.00
2024-01-03    6.50
2024-01-04    2.50
Freq: D, dtype: float64

Python Pandas - Merging/Joining

Pandas provides high-performance, in-memory join operations similar to those in SQL databases. These operations allow you to merge multiple DataFrame objects based on common keys or indexes efficiently.

The merge() Method in Pandas

The DataFrame.merge() method in Pandas enables merging of DataFrame or named Series objects using database-style joins. A named Series is treated as a DataFrame with a single named column. Joins can be performed on columns or indexes.

If merging on columns, DataFrame indexes are ignored. If merging on indexes or indexes with columns, then the index will remains the same. However, in cross merges (how='cross'), you cannot specify column names for merging.

Below is the syntax of this method −

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False)

The key parameters are −

  • right: A DataFrame or a named Series to merge with.

  • on: Columns (names) to join on. Must be found in both the DataFrame objects.

  • left_on: Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.

  • right_on: Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.

  • left_index: If True, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame.

  • right_index: Same usage as left_index for the right DataFrame.

  • how: Determines type of join operation, available options are 'left', 'right', 'outer', 'inner', and 'cross'. Defaults to 'inner'. Each method has been described below.

  • sort: Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve the performance substantially in many cases.

Example

Let's create two DataFrames and perform merge operations on them.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)

Output

Its output is as follows −

Left DataFrame:
   id    Name subject_id
0   1    Alex       sub1
1   2     Amy       sub2
2   3   Allen       sub4
3   4   Alice       sub6
4   5  Ayoung       sub5

Right DataFrame:
   id   Name subject_id
0   1  Billy       sub2
1   2  Brian       sub4
2   3   Bran       sub3
3   4  Bryce       sub6
4   5  Betty       sub5

Merge Two DataFrames on a Key

You can merge two DataFrames using a common key column by specifying the column name in the on parameter of the merge() method.

Example

The following example demonstrates how to merge two DataFrames on a key using the DataFrame.merge() method.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

# Merging DataFrames on a key 'id'
result = left.merge(right, on='id')
print(result)

Output

Its output is as follows −

   id  Name_x subject_id_x Name_y subject_id_y
0   1    Alex         sub1  Billy         sub2
1   2     Amy         sub2  Brian         sub4
2   3   Allen         sub4   Bran         sub3
3   4   Alice         sub6  Bryce         sub6
4   5  Ayoung         sub5  Betty         sub5

Merge Two DataFrames on Multiple Keys

To merge two DataFrames on multiple keys, provide a list of column names to the on parameter.

Example

The following example demonstrates how to merge DataFrames on multiple keys using the merge() method.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})

# Merging on multiple keys 'id' and 'subject_id'
result = left.merge(right, on=['id', 'subject_id'])
print(result)

Output

Its output is as follows −

   id  Name_x subject_id Name_y
0   4   Alice       sub6  Bryce
1   5  Ayoung       sub5  Betty

Merge Using 'how' Argument

The how argument determines which keys to include in the resulting DataFrame. If a key combination does not appear in either the left or right DataFrame, the values in the joined table will be NaN.

Merge Methods and Their SQL Equivalents

The following table summarizes the how options and their SQL equivalents −

Merge Method SQL Equivalent Description
left LEFT OUTER JOIN Use keys from left object
right RIGHT OUTER JOIN Use keys from right object
outer FULL OUTER JOIN Union of keys from both DataFrames.
inner INNER JOIN Intersection of keys from both DataFrames.

Example: Left Join

This example demonstrates merging the DataFrame by using the left method.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

# Merging DataFrames using the left join method 
print(left.merge(right, on='subject_id', how='left'))

Output

Its output is as follows −

   id_x  Name_x subject_id  id_y Name_y
0     1    Alex       sub1   NaN    NaN
1     2     Amy       sub2   1.0  Billy
2     3   Allen       sub4   2.0  Brian
3     4   Alice       sub6   4.0  Bryce
4     5  Ayoung       sub5   5.0  Betty

Example: Right Join

This example performs the right join operation on two DataFrames using the merge() method by setting the how='right'.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

# Merging DataFrames using the right join method
print(left.merge(right, on='subject_id', how='right'))

Output

Its output is as follows −

   id_x  Name_x subject_id  id_y Name_y
0   2.0     Amy       sub2     1  Billy
1   3.0   Allen       sub4     2  Brian
2   NaN     NaN       sub3     3   Bran
3   4.0   Alice       sub6     4  Bryce
4   5.0  Ayoung       sub5     5  Betty

Example: Outer Join

This example will apply the outer join operation on two DataFrames by specifying the how='outer' in the merge() method.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

# Merging the DataFrames using the outer join 
print(left.merge(right, how='outer', on='subject_id'))

Output

Its output is as follows −

   id_x  Name_x subject_id  id_y Name_y
0   1.0    Alex       sub1   NaN    NaN
1   2.0     Amy       sub2   1.0  Billy
2   3.0   Allen       sub4   2.0  Brian
3   4.0   Alice       sub6   4.0  Bryce
4   5.0  Ayoung       sub5   5.0  Betty
5   NaN     NaN       sub3   3.0   Bran

Inner Join

Joining will be performed on index. Join operation honors the object on which it is called. So, a.join(b) is not equal to b.join(a).

Example

The following example demonstrates how to apply inner join operation to the two DataFrames using the DataFrame.merge() method.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

# Merge the DataFrames using the inner join method
print(left.merge(right, on='subject_id', how='inner'))

Output

Its output is as follows −

   id_x  Name_x subject_id  id_y Name_y
0     2     Amy       sub2     1  Billy
1     3   Allen       sub4     2  Brian
2     4   Alice       sub6     4  Bryce
3     5  Ayoung       sub5     5  Betty

The join() Method in Pandas

Pandas also provides a DataFrame.join() method, which is useful for merging DataFrames based on their index. It works similarly to DataFrame.merge() but is more efficient for index-based operations.

Below is the syntax of this method −

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='')

Example

This example demonstrates how to use the DataFrame.join() method for merging DataFrames using indexes instead of columns.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

# Merge the DataFrames using the join() method
result = left.join(right, lsuffix='_left', rsuffix='_right')
print(result)

Output

Its output is as follows −

   id_left Name_left subject_id_left  id_right Name_right subject_id_right
0        1      Alex            sub1         1      Billy             sub2
1        2       Amy            sub2         2      Brian             sub4
2        3     Allen            sub4         3       Bran             sub3
3        4     Alice            sub6         4      Bryce             sub6
4        5    Ayoung            sub5         5      Betty             sub5

Python Pandas - Basics of MultiIndex

MultiIndex is also called Hierarchical Indexing, it is a powerful feature in pandas that allows you to work with higher-dimensional data in lower-dimensional structures like Series (1D) and DataFrame (2D). With MultiIndex, pandas objects have multiple levels of index labels. Using MultiIndex, you can represent and manipulate data with multiple levels of indexing, making it easier to handle complex data sets efficiently.

In this tutorial, we will learn about the basics of MultiIndex, including how to create MultiIndexed Series and DataFrames, perform basic indexing on MultiIndex axes, and align data using MultiIndex.

Creating MultiIndexed Pandas Objects

There are several ways to create a MultiIndex object in pandas, including from lists of arrays, tuples, products of iterables, or directly from a DataFrame.

Following are the list of helper methods to construct a new MultiIndex −

  • MultiIndex.from_arrays()

  • MultiIndex.from_product()

  • MultiIndex.from_tuples()

  • MultiIndex.from_frame()

Creating MultiIndex from Lists of Arrays

By using the pandas.MultiIndex.from_arrays() method we can create MultiIndex from list of arrays.

Example: Creating MultiIndexed Series from List of lists

The following example demonstrates the creation of MultiIndexed Series object using the pandas.MultiIndex.from_arrays() method.

import pandas as pd
import numpy as np

# Create a 2D list
list_2d = [["BMW", "BMW", "Lexus", "Lexus", "foo", "foo", "Audi", "Audi"],
["1", "2", "1", "2", "1", "2", "1", "2"]]

# Create a MultiIndex object
index = pd.MultiIndex.from_arrays(list_2d, names=["first", "second"])

# Creating a MultiIndexed Series
s = pd.Series(np.random.randn(8), index=index)

# Display the output Series 
print("Output MultiIndexed Series:\n",s)

Output

Following is the output of the above code −

Output MultiIndexed Series:
 first  second
BMW    1         0.507702
       2         0.315580
Lexus  1        -0.913939
       2        -0.470642
foo    1        -0.419916
       2        -0.617791
Audi   1        -0.394219
       2         0.324891
dtype: float64

Creating MultiIndex from Tuples

Pandas MultiIndex.from_tuples() method is used to convert list of tuples to MultiIndex.

Example: Creating MultiIndexed DataFrame from Tuples

This example demonstrates the creation of MultiIndexed DataFrame object using the pandas.MultiIndex.from_tuples() method.

import pandas as pd
import numpy as np

# Create a 2D list
list_2d = [["BMW", "BMW", "Lexus", "Lexus", "foo", "foo", "Audi", "Audi"],
["1", "2", "1", "2", "1", "2", "1", "2"]]

# Create a MultiIndex object
tuples = list(zip(*list_2d ))
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

# Creating a MultiIndexed DataFrame
df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=["A", "B", "C", "D"])

# Display the output Series 
print("Output MultiIndexed DataFrame:\n", df)

Output

Following is the output of the above code −

Output MultiIndexed DataFrame:
                      A         B         C         D
first second                                        
BMW   1      -0.936446 -0.274192  0.308845  0.825323
      2       0.418242  0.830447 -0.539598  1.080315
Lexus 1      -1.139546 -1.018409 -1.849736  0.166109
      2      -0.704976  0.503610 -0.689764  1.412166
foo   1       0.464476  1.714391  0.070771 -0.587529
      2       0.427123  1.002659 -0.408395 -1.933066
Audi  1      -0.093110 -0.020240 -0.569153  0.116810
      2      -0.426054 -0.537713 -1.526764 -0.137989

Creating MultiIndex Using from_product()

The Pandas MultiIndex.from_product() method is uses the cartesian product of multiple iterables to create MultiIndex. It is useful when you want every possible combination of elements from two or more iterables.

Example: Creating MultiIndexed DataFrame Using from_product()

This example demonstrates how to create the MultiIndexed DataFrame using the pandas MultiIndex.from_product() method.

import pandas as pd
import numpy as np

# Create a list of lits 
iterable = [[1, 2, 3], ['green', 'black']]

# Create a MultiIndex object
index = pd.MultiIndex.from_product(iterable, names=["number", "color"])

# Creating a MultiIndexed DataFrame
df = pd.DataFrame(np.random.randn(6, 3), index=index, columns=["A", "B", "C"])

# Display the output Series 
print("Output MultiIndexed DataFrame:\n", df)

Output

Following is the output of the above code −

Output MultiIndexed DataFrame:
                      A         B         C
number color                              
1      green  1.399149 -0.995173  1.537441
       black -0.962953 -0.398537  0.072796
2      green  0.064748 -0.148591  0.111019
       black -0.204023 -1.706223  1.415122
3      green -0.219234 -0.113010  0.626351
       black  0.069791 -0.665270  0.900951

Creating MultiIndex from DataFrame

The Pandas MultiIndex.from_frame() method is used to create a MultiIndex from a DataFrame.

Example: Creating MultiIndex from DataFrame

This example uses the pd.MultiIndex.from_frame() method to directly create a MultiIndex object from a DataFrame.

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame([["BMW", 1], ["BMW", 2], ["Lexus", 1],["Lexus", 2]], 
 columns=["first", "second"])

# Create a MultiIndex object
index = pd.MultiIndex.from_frame(df)

# Creating a MultiIndexed DataFrame
df = pd.DataFrame(np.random.randn(4, 3), index=index, columns=["A", "B", "C"])
# Display the output Series 
print("Output MultiIndexed DataFrame:\n", df)

Output

Following is the output of the above code −

Output MultiIndexed DataFrame:
                      A         B         C
first second                              
BMW   1       0.918728 -1.224909 -1.486071
      2      -0.413480 -0.239801  0.000995
Lexus 1       2.550773  0.885128  1.252554
      2       1.077487 -1.021780 -0.360193

Basic Indexing on Axis with MultiIndex

Indexing with MultiIndex used to slice and select data in more flexible ways compared to a regular index.

Example: Selecting Data by Index Level

Here is a basic example demonstrating the indexing MultiIndexed Series object using the .loc[] method.

import pandas as pd
import numpy as np

# Creating MultiIndex from arrays
arrays = [["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"]]

# Creating a list of tuples from the arrays
tuples = list(zip(*arrays))

# Creating a MultiIndex from tuples
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

# Creating a Series with MultiIndex
s = pd.Series([2, 3, 1, 4, 6, 1, 7, 8], index=index)

print("MultiIndexed Series:\n", s)

# Indexing the MultiIndexed Series using .loc[]
print("\nSelecting data at index ('bar', 'one') and column 'A':")
print(s.loc[('bar', 'one')])

Output

Following is the output of the above code −

MultiIndexed Series:
 first  second
bar    one       2
       two       3
baz    one       1
       two       4
foo    one       6
       two       1
qux    one       7
       two       8
dtype: int64

Selecting data at index ('bar', 'one') and column 'A':
2

Python Pandas - Indexing with MultiIndex

Indexing with MultiIndex refers to accessing and selecting data in a Pandas DataFrame that has multiple levels of indexing. Unlike standard DataFrames that have a single index, a MultiIndexed DataFrame allows hierarchical indexing, where rows and columns are labeled using multiple keys.

This type of indexing is useful for handling structured datasets, making it easier to perform operations like grouping, slicing, and advanced selections. Instead of using a single label or position-based indexing, you can use tuples of labels to access data at different levels.

In this tutorial, you will learn how to use MultiIndex for advanced indexing and selection, including slicing, and Boolean indexing.

Basic Indexing with MultiIndex

Indexing with MultiIndex is similar to single-index DataFrames, but here you can also use tuples to index by multiple levels.

Example

Here is a basic example of selecting a subset of data using the level name with the .loc[] indexer.

import pandas as pd

# Create a MultiIndex object
index = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'), ('B', 'one'), ('B', 'two')])

# Create a DataFrame
data = [[1, 2], [3, 4], [5, 6], [7, 8]]
df = pd.DataFrame(data, index=index, columns=['X', 'Y'])

# Display the input DataFrame
print('Original MultiIndexed DataFrame:\n',df)

# Select all rows based on the level label
print('Selected Subset:\n',df.loc['A'])   

Output

Following is the output of the above code −

Original MultiIndexed DataFrame:
        X  Y
A one  1  2
  two  3  4
B one  5  6
  two  7  8
Selected Subset:
      X  Y
one  1  2
two  3  4

Example

Here is another example demonstrating indexing a MultiIndexed DataFrame using a tuple of level labels with the .loc[] indexer.

import pandas as pd

# Create a MultiIndex object
index = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'), ('B', 'one'), ('B', 'two')])

# Create a DataFrame
data = [[1, 2], [3, 4], [5, 6], [7, 8]]
df = pd.DataFrame(data, index=index, columns=['X', 'Y'])

# Display the input DataFrame
print('Original MultiIndexed DataFrame:\n',df)


# Index the data based on the tuple of level labels   
print('Selected Subset:')
print(df.loc[('B', 'one')])  

Output

Following is the output of the above code −

Original MultiIndexed DataFrame:
        X  Y
A one  1  2
  two  3  4
B one  5  6
  two  7  8
Selected Subset:
X    5
Y    6
Name: (B, one), dtype: int64

Advanced Indexing with MultiIndexed Data

Advanced indexing with a MultiIndexed DataFrame can be done by using the .loc indexer, it allows you to specify more complex conditions and selections in a MultiIndex DataFrame.

Example

Following is the example of selecting the data from a MultiIndexed DataFrame using the advanced indexing with .loc[] indexer.

import pandas as pd

# Create a MultiIndex object
index = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'), ('B', 'one'), ('B', 'two')])

# Create a DataFrame
data = [[1, 2], [3, 4], [5, 6], [7, 8]]
df = pd.DataFrame(data, index=index, columns=['X', 'Y'])

# Display the input DataFrame
print('Original MultiIndexed DataFrame:\n',df)

# Select specific element 
print('Selected data:')
print(df.loc[('A', 'two'), 'Y']) 

Output

Following is the output of the above code −

Original MultiIndexed DataFrame:
        X  Y
A one  1  2
  two  3  4
B one  5  6
  two  7  8
Selected data:
4

Boolean Indexing with MultiIndex

Pandas MultiIndexed objects allows you to apply the boolean indexing to filter data based on conditions. It will create a mask and apply it to the DataFrame.

Example

The following example demonstrates applying the boolean indexing to the MultiIndexed DataFrame to select the rows where 'X' is greater than 2.

import pandas as pd

# Create a MultiIndex object
index = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'), ('B', 'one'), ('B', 'two')])

# Create a DataFrame
data = [[1, 2], [3, 4], [5, 6], [7, 8]]
df = pd.DataFrame(data, index=index, columns=['X', 'Y'])

# Display the input DataFrame
print('Original MultiIndexed DataFrame:\n',df)

# Select data based on the boolean indexing
print('Selected data:')   
mask = df['X'] > 2
print(df[mask])  

Output

Following is the output of the above code −

Original MultiIndexed DataFrame:
        X  Y
A one  1  2
  two  3  4
B one  5  6
  two  7  8
Selected data:
       X  Y
A two  3  4
B one  5  6
  two  7  8

Slicing with MultiIndex

Slicing with MultiIndex works similarly to single-index DataFrames but requires tuples for complex operations.

Example

This example demonstrates how to apply slicing to a MultiIndexed DataFrame using the pandas slicer and the .loc[] indexer.

import pandas as pd

# Create a MultiIndex object
index = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'), ('A', 'three'),('B', 'one'), ('B', 'two'), ('B', 'three')])
# Create a DataFrame
data = [[1, 2], [3, 4], [1, 1], [5, 6], [7, 8], [2, 2]]
df = pd.DataFrame(data, index=index, columns=['X', 'Y'])

# Display the input DataFrame
print('Original MultiIndexed DataFrame:\n',df)

# Slice rows between 'A' and 'B'
print('Sliced data:')
print(df.loc[('A', 'B'),['one','three'],:])

Output

Following is the output of the above code −

Original MultiIndexed DataFrame:
          X  Y
A one    1  2
  two    3  4
  three  1  1
B one    5  6
  two    7  8
  three  2  2
Sliced data:
         X  Y
A one    1  2
  three  1  1
B one    5  6
  three  2  2

Python Pandas - Advanced Reindexing with MultiIndex

In Pandas, MultiIndex or hierarchical indexing allows you to work with data structures that have multiple levels of indexing for rows and columns. When dealing with these type of structured datasets, advanced reindexing with MultiIndex becomes essential for reshaping and aligning data across different levels.

Advanced reindexing and alignment in MultiIndex DataFrames enables flexible data manipulation and reshaping in Pandas. By using methods like reindex(), swaplevel(), and reorder_levels() you can easily perform the data manipulation and restructuring tasks in Pandas.

Reindexing DataFrame with MultiIndex

Reindexing allows you to change the index of a DataFrame to match a new set of labels. The Pandas DataFrame.reindex() method is used to reindex a data along specific level of a MultiIndex.

Example

Let us see an explore of using the df.reindex() method to reindex a MultiIndexed DataFrame.

import pandas as pd

# Create a MultiIndex object
index = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'), ('A', 'three'),('B', 'one'), ('B', 'two'), ('B', 'three')])
# Create a DataFrame
data = [[1, 2], [3, 4], [1, 1], [5, 6], [7, 8], [2, 2]]
df = pd.DataFrame(data, index=index, columns=['X', 'Y'])

# Display the input DataFrame
print('Original MultiIndexed DataFrame:\n',df)

# New index for reindexing
new_index = [('A', 'one'), ('foo', 'two'), ('B', 'two'), ('A', 'three'), ('B', 'one'), ('A', 'two')]

# Reindexing the DataFrame
reindexed_df = df.reindex(new_index)

print('\nReindexed DataFrame:\n', reindexed_df)

Output

Following is the output of the above code −

Original MultiIndexed DataFrame:
          X  Y
A one    1  2
  two    3  4
  three  1  1
B one    5  6
  two    7  8
  three  2  2

Reindexed DataFrame:
              X    Y
A   one    1.0  2.0
foo two    NaN  NaN
B   two    7.0  8.0
A   three  1.0  1.0
B   one    5.0  6.0
A   two    3.0  4.0

Changing MultiIndex Levels with swaplevel()

In a MultiIndex DataFrame, you can swap the order of the levels using the DataFrame.swaplevel() method. This is useful for reorder the levels of a DataFrame to perform operations across different hierarchical levels.

Example

The following example swaps the levels of a MultiIndexed DataFrame using the df.swaplevel() method.

import pandas as pd

# Create a MultiIndex object
index = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'), ('A', 'three'),('B', 'one'), ('B', 'two'), ('B', 'three')])
# Create a DataFrame
data = [[1, 2], [3, 4], [1, 1], [5, 6], [7, 8], [2, 2]]
df = pd.DataFrame(data, index=index, columns=['X', 'Y'])

# Display the input DataFrame
print('Original MultiIndexed DataFrame:\n',df)

# Swap the levels of the original DataFrame
swapped_df = df.swaplevel(0, 1, axis=0)

print('\nDataFrame After Swapping Levels:\n', swapped_df)

Output

Following is the output of the above code −

Original MultiIndexed DataFrame:
          X  Y
A one    1  2
  two    3  4
  three  1  1
B one    5  6
  two    7  8
  three  2  2

DataFrame After Swapping Levels:
          X  Y
one   A  1  2
two   A  3  4
three A  1  1
one   B  5  6
two   B  7  8
three B  2  2

Reordering MultiIndex Levels with reorder_levels()

Similar to the above approach, Pandas MultiIndex.reorder_levels() method is also used to reorder index levels of a MultiIndexed object.

Example

This example uses the Pandas MultiIndex.reorder_levels() method to reorder the levels of a MultiIndex object.

import pandas as pd

# Create a MultiIndex object
index = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'), ('A', 'three'),('B', 'one'), ('B', 'two'), ('B', 'three')])
# Create a DataFrame
data = [[1, 2], [3, 4], [1, 1], [5, 6], [7, 8], [2, 2]]
df = pd.DataFrame(data, index=index, columns=['X', 'Y'])

# Display the input DataFrame
print('Original MultiIndexed DataFrame:\n',df)

# Reordering levels
reordered_df = df.reorder_levels([1, 0], axis=0)
print('\nDataFrame after reordering levels:\n', reordered_df)

Following is the output of the above code −

Original MultiIndexed DataFrame:
          X  Y
A one    1  2
  two    3  4
  three  1  1
B one    5  6
  two    7  8
  three  2  2

DataFrame after reordering levels:
          X  Y
one   A  1  2
two   A  3  4
three A  1  1
one   B  5  6
two   B  7  8
three B  2  2

Python Pandas - Renaming MultiIndex Labels

Renaming MultiIndex labels of a Pandas data structures is a common task, especially when working with hierarchical datasets. It involves the renaming specific labels, axis names, or index levels of the MultiIndexed objects. Pandas provides several methods to efficiently rename index labels, column labels, or index levels in MultiIndexed objects −

  • rename(): Renames specific index or column labels.

  • rename_axis(): Renames the names of the axis for the index or columns.

  • set_names(): Directly sets or changes the names of MultiIndex levels.

In this tutorial you will learn about various ways to rename labels and names of MultiIndexed data structures in Pandas.

Renaming MultiIndex Labels Using rename()

To rename the labels of the index or columns in a MultiIndexed object, you can use the pandas DataFame.rename() method. This method is useful for renaming individual labels in either the index or the columns of the pandas objects using the index and column parameters.

Example: Renaming the Specific Index Labels

Here is a basic example of using the df.rename() method to rename the specific index labels of a MultiIndexed DataFrame.

import pandas as pd

# Create a MultiIndex object
index = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'), ('A', 'three'),('B', 'one'), ('B', 'two'), ('B', 'three')])
# Create a DataFrame
data = [[1, 2], [3, 4], [1, 1], [5, 6], [7, 8], [2, 2]]
df = pd.DataFrame(data, index=index, columns=['X', 'Y'])

# Display the input DataFrame
print('Original MultiIndexed DataFrame:\n',df)

# Renaming specific index labels
df_renamed = df.rename(index={"A": "aaa", "one": "1"})
print("Renamed DataFrame:")
print(df_renamed)

Output

Following is the output of the above code −

Original MultiIndexed DataFrame:
          X  Y
A one    1  2
  two    3  4
  three  1  1
B one    5  6
  two    7  8
  three  2  2
Renamed DataFrame:
           X  Y
aaa 1      1  2
    two    3  4
    three  1  1
B   1      5  6
    two    7  8
    three  2  2

Example: Renaming the Specific Column Labels

Following is the another example of using the df.rename() method to rename the specific column labels of a MultiIndexed DataFrame.

import pandas as pd

# Create a MultiIndex object
index = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'), ('A', 'three'),('B', 'one'), ('B', 'two'), ('B', 'three')])
# Create a DataFrame
data = [[1, 2], [3, 4], [1, 1], [5, 6], [7, 8], [2, 2]]
df = pd.DataFrame(data, index=index, columns=['X', 'Y'])

# Display the input DataFrame
print('Original MultiIndexed DataFrame:\n',df)

# Renaming columns
df_renamed = df.rename(columns={'X': "col0", 'Y': "col1"})
print("Renamed DataFrame:")
print(df_renamed)

Output

Following is the output of the above code −

Original MultiIndexed DataFrame:
          X  Y
A one    1  2
  two    3  4
  three  1  1
B one    5  6
  two    7  8
  three  2  2
Renamed DataFrame:
         col0  col1
A one       1     2
  two       3     4
  three     1     1
B one       5     6
  two       7     8
  three     2     2

Renaming the MultiIndex Axis Names

The pandas DataFrame.rename_axis() method is used to rename or set the names of the index levels in a MultiIndex. This can be particularly useful when working with multi-level indexing.

Example: Specifying/renaming the names of the index levels

This example demonstrates use of the df.rename_axis() method to rename the names of the index levels in a MultiIndexed DataFrame.

import pandas as pd

# Create a MultiIndex object
index = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'), ('A', 'three'),('B', 'one'), ('B', 'two'), ('B', 'three')])
# Create a DataFrame
data = [[1, 2], [3, 4], [1, 1], [5, 6], [7, 8], [2, 2]]
df = pd.DataFrame(data, index=index, columns=['X', 'Y'])

# Display the input DataFrame
print('Original MultiIndexed DataFrame:\n',df)

# Set names for the index levels
result = df.rename_axis(index=["level1", "level2"])
print("Resultant DataFrame:")
print(result)

Output

Following is the output of the above code −

Original MultiIndexed DataFrame:
          X  Y
A one    1  2
  two    3  4
  three  1  1
B one    5  6
  two    7  8
  three  2  2
Resultant DataFrame:
               X  Y
level1 level2      
A      one     1  2
       two     3  4
       three   1  1
B      one     5  6
       two     7  8
       three   2  2

Renaming MultiIndex Levels Using set_names()

The pandas Index.set_names() method is used to rename the levels of a MultiIndex directly. This method allows you to set or change the names of individual levels in the index.

Example: Renaming the Names of the MultiIndex Levels

This example demonstrates how to change the names of a MultiIndex levels using the Index.set_names() method.

import pandas as pd

# Create a MultiIndex object
index = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'), ('A', 'three'),('B', 'one'), ('B', 'two'), ('B', 'three')],
names=["level0", "level1"])

# Create a DataFrame
data = [[1, 2], [3, 4], [1, 1], [5, 6], [7, 8], [2, 2]]
df = pd.DataFrame(data, index=index, columns=['X', 'Y'])

# Display the input DataFrame
print('Original MultiIndexed DataFrame:\n',df)

# Renaming a specific level
df.index= df.index.set_names("new_name", level=0)
print("Resultant DataFrame:")
print(df)

Output

Following is the output of the above code −

Original MultiIndexed DataFrame:
                X  Y
level0 level1      
A      one     1  2
       two     3  4
       three   1  1
B      one     5  6
       two     7  8
       three   2  2
Resultant DataFrame:
                 X  Y
new_name level1      
A        one     1  2
         two     3  4
         three   1  1
B        one     5  6
         two     7  8
         three   2  2

Python Pandas - Sorting a MultiIndex

Sorting MultiIndex in Pandas is used to efficiently organize the hierarchical datasets. In Pandas MultiIndex is also known as a hierarchical index and it has multiple levels of index in Pandas data structures such as, DataFrame or Series objects. Each level in a MultiIndexed object can be sorted independently to apply the efficient slicing, indexing, filtering, and retrieving operations on your data.

Below are the key methods to sort MultiIndexed objects in Pandas −

  • sort_index(): Sort object by labels.

  • sortlevel(): Used for sorting the MultiIndexed object at a specific level.

  • sort_values(): Used to get the sorted copy if the DataFrame.

In this tutorial, we will learn how to sort a MultiIndexed objects in Pandas using these methods with different approaches.

Sorting MultiIndex Using sort_index()

The Pandas DataFrame.sort_index() method is used to sort a MultiIndex by all levels. Sorting a MultiIndex object can be useful for efficient indexing and slicing of the data.

Example

Here is the basic example of using the df.sort_index() method is to sort a MultiIndex by all levels. This sorts the data according to both levels of the MultiIndex.

import pandas as pd

# Create a MultiIndex object
index = pd.MultiIndex.from_tuples([('A', 'one'), ('A', 'two'), ('A', 'three'),('B', 'one'), ('B', 'two'), ('B', 'three')],
names=["level0", "level1"])

# Create a DataFrame
data = [[1, 2], [3, 4], [1, 1], [5, 6], [7, 8], [2, 2]]
df = pd.DataFrame(data, index=index, columns=['X', 'Y'])

# Display the input DataFrame
print('Original MultiIndexed DataFrame:\n',df)

# Sort MultiIndex with default levels
sorted_df = df.sort_index()
print("Resultant DataFrame:")
print(sorted_df)

Output

Following is the output of the above code −

Original MultiIndexed DataFrame:
                X  Y
level0 level1      
A      one     1  2
       two     3  4
       three   1  1
B      one     5  6
       two     7  8
       three   2  2
Resultant DataFrame:
               X  Y
level0 level1      
A      one     1  2
       three   1  1
       two     3  4
B      one     5  6
       three   2  2
       two     7  8

Sorting MultiIndex by Specific Level

If you want to sort by a specific level of the MultiIndex, you can use the level parameter of the df.sort_index() method.

Example

Following is the example of sorting a MultiIndex by its the first level (ie., level=0).

import pandas as pd

# Create a MultiIndex object
index = pd.MultiIndex.from_tuples([('C', 'one'), ('C', 'two'),('B', 'one'), ('B', 'two')])

# Create a DataFrame
data = [[1, 2], [3, 4], [5, 6], [7, 8]]
df = pd.DataFrame(data, index=index, columns=['X', 'Y'])

# Display the input DataFrame
print('Original MultiIndexed DataFrame:\n',df)

# Sort MultiIndex by the first level
sorted_df = df.sort_index(level=0)
print("Resultant DataFrame:")
print(sorted_df)

Output

Following is the output of the above code −

Original MultiIndexed DataFrame:
        X  Y
C one  1  2
  two  3  4
B one  5  6
  two  7  8
Resultant DataFrame:
       X  Y
B one  5  6
  two  7  8
C one  1  2
  two  3  4

Sorting MultiIndex by Level Names

Similar to the above approach you can also sort the MultiIndex by level names instead of the numerical index using the df.sort_index() method with level parameter.

Example

This example sorts the MultiIndex by using the level name specified to the level parameter of the set_names() method.

import pandas as pd

# Create a MultiIndex object
index = pd.MultiIndex.from_tuples([('D', 'z'), ('D', 'x'), ('D', 'y'),('B', 't'), ('B', 's'), ('B', 'v')],
names=["level0", "level1"])

# Create a DataFrame
data = [[1, 2], [3, 4], [1, 1], [5, 6], [7, 8], [2, 2]]
df = pd.DataFrame(data, index=index, columns=['X', 'Y'])

# Display the input DataFrame
print('Original MultiIndexed DataFrame:\n',df)

# Sort by the level name
sorted_df = df.sort_index(level='level1')
print("Resultant DataFrame:")
print(sorted_df)

Output

Following is the output of the above code −

Original MultiIndexed DataFrame:
                X  Y
level0 level1      
D      z       1  2
       x       3  4
       y       1  1
B      t       5  6
       s       7  8
       v       2  2
Resultant DataFrame:
               X  Y
level0 level1      
B      s       7  8
       t       5  6
       v       2  2
D      x       3  4
       y       1  1
       z       1  2

Sorting MultiIndex at Specific Levels with sortlevel()

By using the MultiIndex.sortlevel() method you can also sort a MultiIndex at a specific level.

Example

Following is the example of sorting the MultiIndex object by using the MultiIndex.sortlevel() method.

import pandas as pd

# Create arrays
arrays = [[2, 4, 3, 1], ['Peter', 'Chris', 'Andy', 'Jacob']]

# The from_arrays() is used to create a MultiIndex
multiIndex = pd.MultiIndex.from_arrays(arrays, names=('ranks', 'student'))

# display the MultiIndex
print("The Multi-index...\n",multiIndex)

# get the levels in MultiIndex
print("\nThe levels in Multi-index...\n",multiIndex.levels)

# Sort MultiIndex
# The specific level to sort is set as a parameter i.e. level 1 here
print("\nSort MultiIndex at the requested level...\n",multiIndex.sortlevel(1))

Output

Following is the output of the above code −

The Multi-index...
 MultiIndex([(2, 'Peter'),
            (4, 'Chris'),
            (3,  'Andy'),
            (1, 'Jacob')],
           names=['ranks', 'student'])

The levels in Multi-index...
 [[1, 2, 3, 4], ['Andy', 'Chris', 'Jacob', 'Peter']]

Sort MultiIndex at the requested level...
 (MultiIndex([(3,  'Andy'),
            (4, 'Chris'),
            (1, 'Jacob'),
            (2, 'Peter')],
           names=['ranks', 'student']), array([2, 1, 3, 0]))

Sorting MultiIndex Using sort_values()

The sort_values() method sorts the index object and returns the copy of the index.

Example

The following example demonstrates how to sort the MultiIndex object using the sort_values() method.

import pandas as pd

# Create arrays
arrays = [[2, 4, 3, 1], ['Peter', 'Chris', 'Andy', 'Jacob']]

# The from_arrays() is used to create a MultiIndex
multiIndex = pd.MultiIndex.from_arrays(arrays, names=('ranks', 'student'))

# display the MultiIndex
print("The Multi-index...\n",multiIndex)

# Sort MultiIndex using the sort_values() method
print("\nSort MultiIndex...\n",multiIndex.sort_values())

Output

Following is the output of the above code −

The Multi-index...
 MultiIndex([(2, 'Peter'),
            (4, 'Chris'),
            (3,  'Andy'),
            (1, 'Jacob')],
           names=['ranks', 'student'])

Sort MultiIndex...
 MultiIndex([(1, 'Jacob'),
            (2, 'Peter'),
            (3,  'Andy'),
            (4, 'Chris')],
           names=['ranks', 'student'])

Python Pandas - Binary Comparison Operations

Binary comparison operations in Pandas are used to compare elements in a Pandas Data structure such as, Series or DataFrame objects with a scalar value or another Data structure. These operations return Boolean results that indicate the outcome of each comparison, and these operations are useful for for filtering, condition-based operations, and data analysis.

In this tutorial, you will learn how to perform binary comparison operations like less than, greater than, equal to, and others, on a Pandas Data structure with scalar values and between other DataFrames/Series objects.

Binary Comparison Operators in Pandas

Binary comparison operators are used to compare elements in a Pandas Series or DataFrame with a scalar value. The result of these operations is a boolean Data structure where True indicates the given condition is satisfied and False for not.

Here is a list of common binary comparison operators that can be used on a Pandas DataFrame or Series −

  • <: Checks if each element is less than the given value.

  • >: Checks if each element is greater than the given value.

  • <=: Checks if each element is less than or equal to the given value.

  • >=: Checks if each element is greater than or equal to the given value.

  • ==: Checks if each element is equal to the given value.

  • !=: Checks if each element is not equal to the given value.

Example

The following example demonstrates how to apply comparison operators to a Pandas DataFrame with a scalar value.

import pandas as pd

# Create a sample DataFrame
data = {'A': [1, 5, 3, 8], 'B': [4, 6, 2, 9]}
df = pd.DataFrame(data)

# Display the input DataFrame
print("Input DataFrame:\n", df)

# Perform binary comparison operations
print("\nLess than 5:\n", df < 5)
print("\nGreater than 5:\n", df > 5)
print("\nLess than or equal to 5:\n", df <= 5)
print("\nGreater than or equal to 5:\n", df >= 5)
print("\nEqual to 5:\n", df == 5)
print("\nNot equal to 5:\n", df != 5)

Output

Following is the output of the above code −

Input DataFrame:
    A  B
0  1  4
1  5  6
2  3  2
3  8  9

Less than 5:
        A      B
0   True   True
1  False  False
2   True   True
3  False  False

Greater than 5:
        A      B
0  False  False
1  False   True
2  False  False
3   True   True

Less than or equal to 5:
        A      B
0   True   True
1   True  False
2   True   True
3  False  False

Greater than or equal to 5:
        A      B
0  False  False
1   True   True
2  False  False
3   True   True

Equal to 5:
        A      B
0  False  False
1   True  False
2  False  False
3  False  False

Not equal to 5:
        A     B
0   True  True
1  False  True
2   True  True
3   True  True

Binary Comparison Functions in Pandas

In addition to the above operators, Pandas provides various functions to perform binary comparison operations on Pandas Data structure, by providing the additional options for customization, like selecting the axis and specifying levels for the MultiIndex objects.

Following is the list of binary comparison functions in Pandas −

S.No Function Description
1 lt(other[, axis, level]) Element-wise less than comparison.
2 gt(other[, axis, level]) Element-wise greater than comparison.
3 le(other[, axis, level]) Element-wise less than or equal comparison.
4 ge(other[, axis, level]) Element-wise greater than or equal comparison.
5 ne(other[, axis, level]) Element-wise not equal comparison.
6 eq(other[, axis, level]) Element-wise equal comparison.

Example: Binary Comparison Operations on Pandas Series

This example demonstrates the applying the binary comparison functions between a Pandas Series and a scalar value.

import pandas as pd

# Create a Pandas Series
s = pd.Series([10, 20, 30, 40, 50])

# Display the Series
print("Pandas Series:\n", s)

# Perform comparison operations
print("\nLess than 25:\n", s.lt(25))
print("\nGreater than 25:\n", s.gt(25))
print("\nLess than or equal to 30:\n", s.le(30))
print("\nGreater than or equal to 40:\n", s.ge(40))
print("\nNot equal to 30:\n", s.ne(30))
print("\nEqual to 50:\n", s.eq(50))

Output

Following is the output of the above code −


Pandas Series:
 0    10
1    20
2    30
3    40
4    50
dtype: int64

Less than 25:
 0     True
1     True
2    False
3    False
4    False
dtype: bool

Greater than 25:
 0    False
1    False
2     True
3     True
4     True
dtype: bool

Less than or equal to 30:
 0     True
1     True
2     True
3    False
4    False
dtype: bool

Greater than or equal to 40:
 0    False
1    False
2    False
3     True
4     True
dtype: bool

Not equal to 30:
 0     True
1     True
2    False
3     True
4     True
dtype: bool

Equal to 50:
 0    False
1    False
2    False
3    False
4     True
dtype: bool

Example: Binary Comparison Operations on Pandas DataFrame

Similarly above example, this will perform binary comparison operations between a DataFrame and a scalar value using the binary comparison functions in Pandas.

import pandas as pd

# Create a DataFrame
data = {'A': [10, 20, 30], 'B': [40, 50, 60]}
df = pd.DataFrame(data)

# Display the DataFrame
print("DataFrame:\n", df)

# Perform comparison operations
print("\nLess than 25:\n", df.lt(25))
print("\nGreater than 50:\n", df.gt(50))
print("\nEqual to 30:\n", df.eq(30))
print("\nLess than or equal to 30:\n", df.le(30))
print("\nGreater than or equal to 40:\n", df.ge(40))
print("\nNot equal to 30:\n", df.ne(30))

Output

Following is the output of the above code −

DataFrame:
     A   B
0  10  40
1  20  50
2  30  60

Less than 25:
        A      B
0   True  False
1   True  False
2  False  False

Greater than 50:
        A      B
0  False  False
1  False  False
2  False   True

Equal to 30:
        A      B
0  False  False
1  False  False
2   True  False

Less than or equal to 30:
       A      B
0  True  False
1  True  False
2  True  False

Greater than or equal to 40:
        A     B
0  False  True
1  False  True
2  False  True

Not equal to 30:
        A     B
0   True  True
1   True  True
2  False  True

Example: Binary Comparison Between Two Pandas Data Structures

This example compares the two DataFrames element-wise using the eq(), ne(), lt(), gt(), le(), and gt() functions.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({'A': [1, 0, 3], 'B': [9, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 1], 'B': [6, 5, 4]})

# Display the Input DataFrames
print("DataFrame 1:\n", df1)
print("\nDataFrame 2:\n", df2)

# Perform comparison operations between two DataFrames
print("\nEqual :\n", df1.eq(df2))
print("\nNot Equal:\n", df1.ne(df2))
print("\ndf1 Less than df2:\n", df1.lt(df2))
print("\ndf1 Greater than df2:\n", df1.gt(df2))
print("\ndf1 Less than or equal to df2:\n", df1.le(df2))
print("\ndf1 Greater than or equal to df2:\n", df1.ge(df2))

Output

Following is the output of the above code −

DataFrame 1:
    A  B
0  1  9
1  0  5
2  3  6

DataFrame 2:
    A  B
0  1  6
1  2  5
2  1  4

Equal :
        A      B
0   True  False
1  False   True
2  False  False

Not Equal:
        A      B
0  False   True
1   True  False
2   True   True

df1 Less than df2:
        A      B
0  False  False
1   True  False
2  False  False

df1 Greater than df2:
        A      B
0  False   True
1  False  False
2   True   True

df1 Less than or equal to df2:
        A      B
0   True  False
1   True   True
2  False  False

df1 Greater than or equal to df2:
        A     B
0   True  True
1  False  True
2   True  True

Python Pandas - Boolean Indexing

Boolean indexing is a technique used to filter data based on specific conditions. It allows us to create masks or filters that extract subsets of data meeting defined criteria. It allows selecting elements from an array, list, or DataFrame using boolean values (True or False).

Instead of manually iterating through data to find values that meet a condition, Boolean indexing simplifies the process by applying logical expressions.

What is Boolean Indexing in Pandas?

In Pandas, Boolean indexing is used to filter rows or columns of a DataFrame or Series based on conditional statements. It helps extract specific data that meets the defined condition by creating boolean masks, which are arrays of True and False values. The True values indicate that the respective data should be selected, while False values indicate not selected.

In this tutorial, we will learn how to access data in a Pandas DataFrame using Boolean indexing with conditional expressions, .loc[], and .iloc[] methods. We will also explore how to apply complex conditions using logical operators for advanced filtering.

Creating a Boolean Index

Creating a boolean index is done by applying a conditional statement to a DataFrame or Series object. For example, if you specify a condition to check whether values in a column are greater than a specific number, then Pandas will return a series of True or False values, which results in a Boolean index.

Example: Creating a Boolean Index

The following example demonstrates how to create a boolean index based on a condition.

import pandas as pd

# Create a Pandas DataFrame
df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=['A', 'B'])

# Display the DataFrame
print("Input DataFrame:\n", df)

# Create Boolean Index
result = df > 2

print('Boolean Index:\n', result)

Output

Following is the output of the above code −

Input DataFrame:
    A  B
0  1  2
1  3  4
2  5  6
Boolean Index:
        A      B
0  False  False
1   True   True
2   True   True

Filtering Data Using Boolean Indexing

Once a boolean index is created, you can use it to filter rows or columns in the DataFrame. This is done by using .loc[] for label-based indexing and .iloc[] for position-based indexing.

Example: Filtering Data using the Boolean Index with .loc

The following example demonstrates filtering the data using boolean indexing with the .loc method. The .loc method is used to filter rows based on the boolean index and specify columns by their label.

import pandas as pd

# Create a Pandas DataFrame
df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=['A', 'B'])

# Display the DataFrame
print("Input DataFrame:\n", df)

# Create Boolean Index
s = (df['A'] > 2)

# Filter DataFrame using the Boolean Index with .loc
print('Output Filtered DataFrame:\n',df.loc[s, 'B'])

Output

Following is the output of the above code −

Input DataFrame:
    A  B
0  1  2
1  3  4
2  5  6
Output Filtered DataFrame:
 1    4
2    6
Name: B, dtype: int64

Filtering Data using the Boolean Index with .iloc

Similar to the above approach, the .iloc method is used for position-based indexing.

Example: Using .iloc with a Boolean Index

This example uses the .iloc method for positional indexing. By converting the boolean index to an array using .values attribute, we can filter the DataFrame similarly to .loc method.

import pandas as pd

# Create a Pandas DataFrame
df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=['A', 'B'])

# Display the DataFrame
print("Input DataFrame:\n", df)

# Create Boolean Index
s = (df['A'] > 2)

# Filter data using .iloc and the Boolean Index
print('Output Filtered Data:\n',df.iloc[s.values, 1])

Output

Following is the output of the above code −

Input DataFrame:
    A  B
0  1  2
1  3  4
2  5  6
Output Filtered Data:
 1    4
2    6
Name: B, dtype: int64

Advanced Boolean Indexing with Multiple Conditions

Pandas provides more complex boolean indexing by combining multiple conditions with the operators like & (and), | (or), and ~ (not). And also you can apply these conditions across different columns to create highly specific filters.

Example: Using Multiple Conditions Across Columns

The following example demonstrates how apply the boolean indexing with multiple conditions across columns.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 3, 5, 7],'B': [5, 2, 8, 4],'C': ['x', 'y', 'x', 'z']})

# Display the DataFrame
print("Input DataFrame:\n", df)

# Apply multiple conditions using boolean indexing
result = df.loc[(df['A'] > 2) & (df['B'] < 5), 'A':'C']

print('Output Filtered DataFrame:\n',result)

Output

Following is the output of the above code −

Input DataFrame:
    A  B  C
0  1  5  x
1  3  2  y
2  5  8  x
3  7  4  z
Output Filtered DataFrame:
    A  B  C
1  3  2  y
3  7  4  z

Python Pandas - Boolean Masking

A boolean mask is an array of boolean values (True or False) used to filter data. It is created by applying conditional expressions to the dataset, which evaluates each element and returns True for matching conditions and False otherwise.

Boolean Masking in Pandas

Boolean masking in Pandas is a useful technique to filter data based on specific conditions. It works by creating a boolean mask, where each element in a DataFrame or Series is represented as either True or False. When you apply this mask to a DataFrame or Series to select data, it selects only the rows or columns that satisfy the given condition.

Why Use Boolean Masks?

Boolean masks provide an efficient way to filter and manipulate data in Pandas without using loops. They are useful for −

  • Selecting data based on specific conditions.

  • Performing conditional operations on DataFrames.

  • Filtering data based on index and column values.

In this tutorial we will learn how to create a Boolean mask and apply it to a Pandas DataFrame or Series for filtering data based on index and column values.

Creating a Boolean Mask

Creating a boolean mask is done by applying a conditional statement to a DataFrame or Series object. For example, if you specify a condition to check whether values in a series are greater than a specific number, then Pandas will return a series of True or False values, which results in a Boolean mask.

Example

The following example demonstrates how to create a boolean mask for Series object in Pandas.

import pandas as pd

# Create a Pandas Series
s = pd.Series([1, 5, 2, 8, 4], index=['A', 'B', 'C', 'D', 'E'])

# Display the Series
print("Input Series:")
print(s)

# Create Boolean mask
result = s > 2

print('\nBoolean Mask:')
print(result)

Output

Following is the output of the above code −

Input Series:
A    1
B    5
C    2
D    8
E    4
dtype: int64

Boolean Mask:
A    False
B     True
C    False
D     True
E     True
dtype: bool          

Selecting Data with Boolean Mask

Selecting or filtering data in a DataFrame is done by creating a boolean mask that defines the conditions for selecting rows.

Example

The following example demonstrates how to filter data using boolean masking.

import pandas as pd

# Create a sample DataFrame
df= pd.DataFrame({'Col1': [1, 3, 5, 7, 9],
'Col2': ['A', 'B', 'A', 'C', 'A']})

# Display the Input DataFrame
print('Original DataFrame:\n', df)

# Create a boolean mask
mask = (df['Col2'] == 'A') & (df['Col1'] > 4)

# Apply the mask to the DataFrame
filtered_data = df[mask]

print('Filtered Data:\n',filtered_data)

Output

Following is the output of the above code −

Original DataFrame:
    Col1 Col2
0     1    A
1     3    B
2     5    A
3     7    C
4     9    A
Filtered Data:
    Col1 Col2
2     5    A
4     9    A

Masking Data Based on Index Value

Filtering data based on the index values of the DataFrame can be possible by creating the mask for the index, so that you can select rows based on their position or label.

Example

This example uses the df.isin() method to create a boolean mask based on the index labels.

import pandas as pd

# Create a DataFrame with a custom index
df = pd.DataFrame({'A1': [10, 20, 30, 40, 50], 'A2':[9, 3, 5, 3, 2]
}, index=['a', 'b', 'c', 'd', 'e'])

# Dispaly the Input DataFrame
print('Original DataFrame:\n', df)

# Define a mask based on the index
mask = df.index.isin(['b', 'd'])

# Apply the mask
filtered_data = df[mask]

print('Filtered Data:\n',filtered_data)

Output

Following is the output of the above code −

Original DataFrame:
    A1  A2
a  10   9
b  20   3
c  30   5
d  40   3
e  50   2
Filtered Data:
    A1  A2
b  20   3
d  40   3

Masking Data Based on Column Value

In addition to filtering based on index values, you can also filter data based on specific column values using boolean masks. The df.isin() method is used to check if values in a column match a list of values.

Example

The following example demonstrates how to create and apply a boolean mask to select data based on DataFrame column values.

import pandas as pd

# Create a DataFrame
df= pd.DataFrame({'A': [1, 2, 3],'B': ['a', 'b', 'f']})

# Dispaly the Input DataFrame
print('Original DataFrame:\n', df)

# Define a mask for specific values in column 'A' and 'B'
mask = df['A'].isin([1, 3]) | df['B'].isin(['a'])

# Apply the mask using the boolean indexing
filtered_data = df[mask]

print('Filtered Data:\n', filtered_data)

Output

Following is the output of the above code −

Original DataFrame:
    A  B
0  1  a
1  2  b
2  3  f
Filtered Data:
    A  B
0  1  a
2  3  f

Python Pandas - Pivoting

Pivoting in Python Pandas is a powerful data transformation technique that reshapes data for easier analysis and visualization. It changes the data representation from a "long" format to a "wide" format, making it simpler to perform aggregations and comparisons.

This technique is particularly useful when dealing with time series data or datasets with multiple columns. Pandas provides two primary methods for pivoting −

  • pivot(): Reshapes data according to specified column or index values.

  • pivot_table(): It is a more flexible method that allows you to create a spreadsheet-style pivot table as a DataFrame.

In this tutorial, we will learn about the pivoting in Pandas using these methods with examples to demonstrate their applications in data manipulation.

Pivoting with pivot()

The Pandas df.pivot() method is used to reshape data when there are unique values for the specified index and column pairs. It is straightforward and useful when your data is well-structured without duplicate entries for the index/column combination.

Example

Here is a basic example demonstrating pivoting a Pandas DataFrame with the Pandas df.pivot() method.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({"Col1": range(12),"Col2": ["A", "A", "A", "B", "B","B", "C", "C", "C", "D", "D", "D"],
"date": pd.to_datetime(["2024-01-03", "2024-01-04", "2024-01-05"] * 4)})

# Display the Input DataFrame
print('Original DataFrame:\n', df)

# Pivot the DataFrame
pivoted = df.pivot(index="date", columns="Col2", values="Col1")

# Display the output
print('Pivoted DataFrame:\n', pivoted)

Output

Following is the output of the above code −

Original DataFrame:
     Col1 Col2       date
0      0    A 2024-01-03
1      1    A 2024-01-04
2      2    A 2024-01-05
3      3    B 2024-01-03
4      4    B 2024-01-04
5      5    B 2024-01-05
6      6    C 2024-01-03
7      7    C 2024-01-04
8      8    C 2024-01-05
9      9    D 2024-01-03
10    10    D 2024-01-04
11    11    D 2024-01-05
Pivoted DataFrame:
 Col2        A  B  C   D
date                   
2024-01-03  0  3  6   9
2024-01-04  1  4  7  10
2024-01-05  2  5  8  11

Note: The pivot() method requires that the index and columns specified have unique values. If your data contains duplicates, you should use the pivot_table() method instead.

Pivoting with pivot_table()

The pivot() method is a straightforward way to reshape data, while pivot_table() offers flexibility for aggregation, making it suitable for more complex data manipulation tasks. This is particularly useful for summarizing data when dealing with duplicates and requires aggregation of data.

Example

This example demonstrates pivoting a DataFrame using the df.pivot_table() method.

import numpy as np
import pandas as pd
import datetime

# Create a DataFrame
df = pd.DataFrame({"A": [1, 1, 2, 3] * 6,
"B": ["A", "B", "C"] * 8,
"C": ["x", "x", "x", "y", "y", "y"] * 4,
"D": np.random.randn(24),
"E": np.random.randn(24),
"F": [datetime.datetime(2013, i, 1) for i in range(1, 13)] +[datetime.datetime(2013, i, 15) for i in range(1, 13)]})

# Display the Input DataFrame
print('Original DataFrame:\n', df)

# Pivot the DataFrame
pivot_table = pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])

# Display the output
print('Pivoted DataFrame:\n', pivot_table)

Output

Following is the output of the above code −

Original DataFrame:
     A  B  C         D         E          F
0   1  A  x  1.326898  0.106289 2013-01-01
1   1  B  x  0.173894  1.815661 2013-02-01
2   2  C  x -1.244485  2.225515 2013-03-01
3   3  A  y  1.333984  0.440766 2013-04-01
4   1  B  y  0.084371 -0.287495 2013-05-01
5   1  C  y -1.005378 -0.778836 2013-06-01
6   2  A  x  0.028336  0.074827 2013-07-01
7   3  B  x -0.726482  0.911117 2013-08-01
8   1  C  x -0.136928 -1.215960 2013-09-01
9   1  A  y -0.854257  1.398881 2013-10-01
10  2  B  y -0.339238  0.286684 2013-11-01
11  3  C  y -0.036610  1.820238 2013-12-01
12  1  A  x -0.589002  0.386662 2013-01-15
13  1  B  x -0.097922 -0.284418 2013-02-15
14  2  C  x -0.259232  0.751310 2013-03-15
15  3  A  y -0.685608  0.048374 2013-04-15
16  1  B  y  0.293147 -1.217476 2013-05-15
17  1  C  y  0.491561 -0.050036 2013-06-15
18  2  A  x -1.404094  1.101318 2013-07-15
19  3  B  x -0.551091 -1.400021 2013-08-15
20  1  C  x  0.300324 -1.230676 2013-09-15
21  1  A  y  1.278413 -1.970118 2013-10-15
22  2  B  y -0.866687  1.002037 2013-11-15
23  3  C  y -1.154852  1.245371 2013-12-15
Pivoted DataFrame:
 C           x         y
A B                    
1 A  0.368948  0.212078
  B  0.037986  0.188759
  C  0.081698 -0.256909
2 A -0.687879       NaN
  B       NaN -0.602962
  C -0.751859       NaN
3 A       NaN  0.324188
  B -0.638786       NaN
  C       NaN -0.595731

Pivoting with Aggregation

The Pandas pivot_table() method can be used to specify an aggregation function. By default it calculates the mean, but you can also use functions like sum, count, or even custom functions for applying aggregation to the pivoting.

Example

This example demonstrates how to apply aggregation function with pivoting a DataFrame using the df.pivot_table() method.

import numpy as np
import datetime
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({"A": [1, 1, 2, 3] * 6,
"B": ["A", "B", "C"] * 8,
"C": ["x", "x", "x", "y", "y", "y"] * 4,
"D": np.random.randn(24),
"E": np.random.randn(24),
"F": [datetime.datetime(2013, i, 1) for i in range(1, 13)] +[datetime.datetime(2013, i, 15) for i in range(1, 13)]})

# Display the Input DataFrame
print('Original DataFrame:\n', df)

# Pivot the DataFrame with a aggregate function
pivot_table = pd.pivot_table(df, values=["D", "E"], index=["B"], columns=["A", "C"], aggfunc="sum")

# Display the output
print('Pivoted DataFrame:\n', pivot_table)

Output

Following is the output of the above code −

Original DataFrame:
     A  B  C         D         E          F
0   1  A  x  0.122689 -1.893287 2013-01-01
1   1  B  x  0.224266 -0.233766 2013-02-01
2   2  C  x  0.054157 -1.079204 2013-03-01
3   3  A  y  0.340879 -0.388190 2013-04-01
4   1  B  y  1.104747  1.402694 2013-05-01
5   1  C  y -1.788430 -1.925411 2013-06-01
6   2  A  x -0.445921 -0.835855 2013-07-01
7   3  B  x -0.866662 -0.110326 2013-08-01
8   1  C  x  0.881280 -0.828522 2013-09-01
9   1  A  y  0.270564 -0.484500 2013-10-01
10  2  B  y -1.674650 -0.121973 2013-11-01
11  3  C  y  0.887327  1.591251 2013-12-01
12  1  A  x -0.233174  0.387652 2013-01-15
13  1  B  x  0.224098 -1.580638 2013-02-15
14  2  C  x  0.872398  0.389193 2013-03-15
15  3  A  y -0.576675  0.178034 2013-04-15
16  1  B  y  0.812351  0.188114 2013-05-15
17  1  C  y  0.930502 -1.714477 2013-06-15
18  2  A  x -0.247967  1.381405 2013-07-15
19  3  B  x  0.556805 -1.048162 2013-08-15
20  1  C  x -1.040586  1.107391 2013-09-15
21  1  A  y -0.648565  1.118233 2013-10-15
22  2  B  y  1.660616  0.252881 2013-11-15
23  3  C  y  1.745446 -0.541543 2013-12-15
Pivoted DataFrame:
           D                      ...         E                    
A         1                   2  ...         2         3          
C         x         y         x  ...         y         x         y
B                                ...                              
A -0.110485 -0.378000 -0.693888  ...       NaN       NaN -0.210156
B  0.448364  1.917098       NaN  ...  0.130908 -1.158488       NaN
C -0.159307 -0.857928  0.926556  ...       NaN       NaN  1.049708

[3 rows x 12 columns]

Python Pandas - Stacking and Unstacking

Stacking and unstacking in Pandas are the useful techniques for reshaping DataFrames to extract more information in different ways. It works efficiently with multi-level indices also. Whether it's compressing columns into row levels or expanding rows into columns, these operations are crucial for handling complex datasets.

The Pandas library provides two main methods for these operations −

  • stack(): Used for Stacking, converts columns into row indices, creating a long-format DataFrame.

  • unstack(): For Unstacking, which moves an index level back to columns, converting long-format data into a wide format.

In this tutorial, we will learn about stacking and unstacking techniques in Pandas along with practical examples, including handling missing data.

Stacking in Pandas

Stacking in Pandas is a process of compressing a DataFrame columns into rows. The DataFrame.stack() method in Pandas is used for stacking the levels from column to index. This method pivots a level of column labels (possibly hierarchical) into row labels, and returns a new DataFrame or Series with a multi-level index.

Example

Following example uses the df.stack() method for pivoting the columns into the row index.

import pandas as pd
import numpy as np

# Create MultiIndex
tuples = [["x", "x", "y", "y", "", "f", "z", "z"],["1", "2", "1", "2", "1", "2", "1", "2"]]
index = pd.MultiIndex.from_arrays(tuples, names=["first", "second"])

# Create a DataFrame
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])

# Display the input DataFrame
print('Input DataFrame:\n', df)

# Stack columns
stacked = df.stack()

print('Output Reshaped DataFrame:\n', stacked)

Output

Following is the output of the above code −


Input DataFrame:
                      A         B
first second                    
x     1      -0.339419 -0.512231
      2      -1.054681  0.357907
y     1      -1.099937  0.508792
      2      -1.072396  0.044404
      1       0.889355 -1.364669
f     2      -0.966859  1.629298
z     1      -1.505033  1.454150
      2      -1.272312  0.169660
Output Reshaped DataFrame:
 first  second   
x      1       A   -0.339419
               B   -0.512231
       2       A   -1.054681
               B    0.357907
y      1       A   -1.099937
               B    0.508792
       2       A   -1.072396
               B    0.044404
       1       A    0.889355
               B   -1.364669
f      2       A   -0.966859
               B    1.629298
z      1       A   -1.505033
               B    1.454150
       2       A   -1.272312
               B    0.169660
dtype: float64

Here, the stack() method pivots the columns A and B into the index, compressing the DataFrame into a long format.

Unstacking in Pandas

Unstacking reverses the stacking operation by moving the row index level back to the columns. The Pandas DataFrame.unstack() method is used to pivot a level of the row index to become a column, which is useful for converting a long-format DataFrame into a wide format.

Example

The following example demonstrates the working of the df.unstack() method for unstacking a DataFrame.

import pandas as pd
import numpy as np

# Create MultiIndex
tuples = [["x", "x", "y", "y", "", "f", "z", "z"],["1", "2", "1", "2", "1", "2", "1", "2"]]
index = pd.MultiIndex.from_arrays(tuples, names=["first", "second"])

# Create a DataFrame
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])

# Display the input DataFrame
print('Input DataFrame:\n', df)

# Unstack the DataFrame
unstacked = df.unstack()

print('Output Reshaped DataFrame:\n', unstacked)

Output

Following is the output of the above code −

Input DataFrame:
                      A         B
first second                    
x     1       1.063385  0.573406
      2       0.276126 -1.181493
y     1       1.543061  0.686945
      2      -0.305357  0.046154
      1      -0.686433 -0.491983
f     2      -1.158806 -0.213141
z     1      -0.896814  0.449843
      2       1.576535  0.726083
Output Reshaped DataFrame:
                A                   B          
second         1         2         1         2
first                                         
       -0.686433       NaN -0.491983       NaN
f            NaN -1.158806       NaN -0.213141
x       1.063385  0.276126  0.573406 -1.181493
y       1.543061 -0.305357  0.686945  0.046154
z      -0.896814  1.576535  0.449843  0.726083

Handling Missing Data during Unstacking

Unstacking can produce missing values when the reshaped DataFrame has unequal label sets in subgroups. Pandas handles these missing values with NaN by default, but you can specify a custom fill value.

Example

This example demonstrates how to handle missing values when unstacking a DataFrame.

import pandas as pd
import numpy as np

# Create Data
index = pd.MultiIndex.from_product([["bar", "baz", "foo", "qux"], ["one", "two"]], names=["first", "second"])
columns = pd.MultiIndex.from_tuples([("A", "cat"), ("B", "dog"), ("B", "cat"), ("A", "dog")], names=["exp", "animal"])

df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns)

# Create a DataFrame
df3 = df.iloc[[0, 1, 4, 7], [1, 2]]

print(df3)

# Unstack the DataFame
unstacked = df3.unstack()

# Display the Unstacked DataFrame
print("Unstacked DataFrame without Filling:\n",unstacked)

unstacked_filled = df3.unstack(fill_value=1)
print("Unstacked DataFrame with Filling:\n",unstacked_filled)

Output

Following is the output of the above code −

exp                  B          
animal             dog       cat
first second                    
bar   one    -0.556587 -0.157084
      two     0.109060  0.856019
foo   one    -1.034260  1.548955
qux   two    -0.644370 -1.871248

Unstacked DataFrame without Filling:
exp            B                             
animal       dog                cat          
second       one      two       one       two
first                                        
bar    -0.556587  0.10906 -0.157084  0.856019
foo    -1.034260      NaN  1.548955       NaN
qux          NaN -0.64437       NaN -1.871248

Unstacked DataFrame with Filling:
exp            B                             
animal       dog                cat          
second       one      two       one       two
first                                        
bar    -0.556587  0.10906 -0.157084  0.856019
foo    -1.034260  1.00000  1.548955  1.000000
qux     1.000000 -0.64437  1.000000 -1.871248

Python Pandas - Melting

Melting in Pandas is the process of converting a DataFrame from a wide format to a long format. In the wide format, data is spread across multiple columns. In simpler terms, it "unpivots" the DataFrame columns into rows, and it is useful for visualizing and performing statistical analysis on datasets.

Pandas provides two primary methods for melting DataFrames −

  • melt(): This function "unpivots" DataFrame from wide to long format, making it easier to reshape the data.

  • wide_to_long(): This function offers more options for melting, especially when working with column matching.

In this tutorial, we will learn about the melt() and wide_to_long() functions in Pandas and how these two methods can be used to transform a DataFrame from a wide format to a long format.

Melting in Pandas

The melt() function in Pandas converts a wide DataFrame into a long format. Which is nothing but "unpivots" the DataFrame.

Example

The following example demonstrates melting a simple DataFrame using the pandas.melt() function.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},'B': {0: 1, 1: 3, 2: 5},'C': {0: 2, 1: 4, 2: 6}})

# Display the input DataFrame
print('Input DataFrame:\n', df)

# Melt the DataFrame
melted_df = pd.melt(df, id_vars=['A'], value_vars=['B'])

print('Output melted DataFrame:\n', melted_df)

Output

Following is the output of the above code −

Input DataFrame:
    A  B  C
0  a  1  2
1  b  3  4
2  c  5  6
Output melted DataFrame:
    A variable  value
0  a        B      1
1  b        B      3
2  c        B      5

Example: Handling Index Values While Melting

This example demonstrates how to handle the missing values while melting the DataFrame using the pandas.melt() function.

import pandas as pd

# Create a DataFrame
index = pd.MultiIndex.from_tuples([("person", "A"), ("person", "B")])
df= pd.DataFrame({
"first": ["John", "Mary"],"last": ["Doe", "Bo"],
"height": [5.5, 6.0],"weight": [130, 150]}, index=index)

# Display the input DataFrame
print('Input DataFrame:\n', df)

# Melt the DataFrame
melted_df = pd.melt(df, id_vars=["first", "last"], ignore_index=False)

print('Output melted DataFrame:\n', melted_df)

Output

Following is the output of the above code −

Input DataFrame:
          first last  height  weight
person A  John  Doe     5.5     130
       B  Mary   Bo     6.0     150
Output melted DataFrame:
          first last variable  value
person A  John  Doe   height    5.5
       B  Mary   Bo   height    6.0
       A  John  Doe   weight  130.0
       B  Mary   Bo   weight  150.0

Melting with wide_to_long()

The pandas wide_to_long() function provides more control over the transformation. It's useful when your columns have a structured naming pattern that includes a suffix.

Example

This example uses the wide_to_long() function for performing the advanced melting transformations.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'famid': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'birth': [1, 2, 3, 1, 2, 3, 1, 2, 3],
'ht1': [2.8, 2.9, 2.2, 2, 1.8, 1.9, 2.2, 2.3, 2.1],
'ht2': [3.4, 3.8, 2.9, 3.2, 2.8, 2.4, 3.3, 3.4, 2.9]})

# Display the input DataFrame
print('Input DataFrame:\n', df)

# Melt the DataFrame using wide_to_long()
long_df = pd.wide_to_long(df, stubnames='ht', i=['famid', 'birth'], j='age')

print('Output Long Melted DataFrame:\n', long_df)

Output

Following is the output of the above code −

Input DataFrame:
    famid  birth  ht1  ht2
0      1      1  2.8  3.4
1      1      2  2.9  3.8
2      1      3  2.2  2.9
3      2      1  2.0  3.2
4      2      2  1.8  2.8
5      2      3  1.9  2.4
6      3      1  2.2  3.3
7      3      2  2.3  3.4
8      3      3  2.1  2.9
Output Long Melted DataFrame:
                   ht
famid birth age     
1     1     1    2.8
            2    3.4
      2     1    2.9
            2    3.8
      3     1    2.2
            2    2.9
2     1     1    2.0
            2    3.2
      2     1    1.8
            2    2.8
      3     1    1.9
            2    2.4
3     1     1    2.2
            2    3.3
      2     1    2.3
            2    3.4
      3     1    2.1
            2    2.9

Python Pandas - Computing Dummy Variables

Dummy variables, also known as indicator variables, are binary (0 or 1) variables created to represent categorical data numerically. In data analysis, particularly when working with categorical data, it is often necessary to convert categorical variables into a numerical format. Converting categorical variables into dummy variables is essential for statistical modeling and machine learning, where numerical inputs are required.

Pandas provides two key functions for handling dummy variables −

  • get_dummies(): Converts categorical data into dummy/indicator variables.

  • from_dummies(): Reconstructs the original categorical variable from dummy variables.

In this tutorial, we will explore how to create dummy variables using get_dummies(), customize them with prefixes, handle collinearity, and revert them back to categorical format using from_dummies().

Creating Dummy Variables with get_dummies()

The get_dummies() function in Pandas is used to convert categorical variables of a Series or a DataFrame into dummy variables.

Example: Basic example of creating the Dummy Variables

Here is a basic example of creates dummy variables using the pandas.get_dummies() function.

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({"keys": list("aeeioou"), "values": range(7)})

# Display the Input DataFrame
print('Input DataFrame:\n',df)

# Create dummy variables for the keys column
dummies = pd.get_dummies(df["keys"])
print('Resultant Dummy Variables:\n',dummies)

Output

Following is the output of the above code −

Input DataFrame:
   keys  values
0    a       0
1    e       1
2    e       2
3    i       3
4    o       4
5    o       5
6    u       6
Resultant Dummy Variables:
        a      e      i      o      u
0   True  False  False  False  False
1  False   True  False  False  False
2  False   True  False  False  False
3  False  False   True  False  False
4  False  False  False   True  False
5  False  False  False   True  False
6  False  False  False  False   True

Creating Dummy Variables with Prefix

The get_dummies() function allows you to add a prefix to the dummy variable column names when converting the categorical variables of a Pandas objects into dummy variables by using the prefix parameter.

Example

This example demonstrates creating dummy variables with a prefix using the pandas.get_dummies() function.

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({"keys": list("aeeioou"), "values": range(7)})

# Display the Input DataFrame
print('Input DataFrame:\n',df)

# Create dummy variables for the keys column
dummies = pd.get_dummies(df["keys"], prefix="Col_")

print('Resultant Dummy Variables with Prefix:\n',dummies)

Output

Following is the output of the above code −

Input DataFrame:
   keys  values
0    a       0
1    e       1
2    e       2
3    i       3
4    o       4
5    o       5
6    u       6
Resultant Dummy Variables with Prefix:
    Col__a  Col__e  Col__i  Col__o  Col__u
0    True   False   False   False   False
1   False    True   False   False   False
2   False    True   False   False   False
3   False   False    True   False   False
4   False   False   False    True   False
5   False   False   False    True   False
6   False   False   False   False    True

Handling Collinearity While Creating Dummy Variables

To avoid collinearity issues in statistical models, you can drop the first dummy variable by setting the drop_first parameter to True.

Example

This example drops the first dummy variable using the drop_first parameter of the pandas.get_dummies() function.

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({"keys": list("aeeioou"), "values": range(7)})

# Display the Input DataFrame
print('Input DataFrame:\n',df)

# Create dummy variables for the keys column
dummies = pd.get_dummies(df["keys"], drop_first=True)

print('Resultant Dummy Variables with Prefix:\n',dummies)

Output

Following is the output of the above code −

Input DataFrame:
   keys  values
0    a       0
1    e       1
2    e       2
3    i       3
4    o       4
5    o       5
6    u       6
Resultant Dummy Variables with Prefix:
        e      i      o      u
0  False  False  False  False
1   True  False  False  False
2   True  False  False  False
3  False   True  False  False
4  False  False   True  False
5  False  False   True  False
6  False  False  False   True

Creating Categorical Variables from Dummies

The pandas.from_dummies() function is used to convert the output of get_dummies() back into a categorical Series.

Example

This example demonstrates creating a categorical Series from dummy variables using the pandas.from_dummies() function.

import pandas as pd
import numpy as np

# Create a DataFrame with dummy variables
df = pd.DataFrame({"Col_a": [0, 1, 0], "Col_b": [1, 0, 1]})

# Display the Input DataFrame
print('Input DataFrame:\n',df)

# Convert the dummy variables back to categorical
original_series = pd.from_dummies(df, sep="_")

print('Resultant Categorical Variables:\n',original_series )

Output

Following is the output of the above code −

Input DataFrame:
    Col_a  Col_b
0      0      1
1      1      0
2      0      1
Resultant Categorical Variables:
   Col
0   b
1   a
2   b

Python Pandas - Categorical Data

In pandas, categorical data refers to a data type that represents categorical variables, similar to the concept of factors in R. It is a specialized data type designed for handling categorical variables, commonly used in statistics. A categorical variable can represent values like "male" or "female," or ratings on a scale such as "poor," "average," and "excellent." Unlike numerical data, you cannot perform mathematical operations like addition or division on categorical data.

In Pandas, categorical data is stored more efficiently because it uses a combination of an array of category values and an array of integer codes that refer to those categories. This saves memory and improves performance when working with large datasets containing repeated values.

The categorical data type is useful in the following cases −

  • A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory.

  • The lexical order of a variable is not the same as the logical order (one, two, three). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order.

  • As a signal to other python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

In this tutorial we will learn about basics of working with categorical data in Pandas, including series and DataFrame creation, controlling behavior, and regaining original data from categorical values.

Series and DataFrame Creation with Categorical Data

Pandas Series or DataFrame object can be created directly with the categorical data using the dtype="category" parameter of the Pandas Series() or DataFrame() constructors.

Example: Series Creation with Categorical Data

Following is the basic example of creating a Pandas Series object with the categorical data.

import pandas as pd

# Create Series object with categorical data
s = pd.Series(["a", "b", "c", "a"], dtype="category")

# Display the categorical Series 
print('Series with Categorical Data:\n', s)

Output

Following is the output of the above code −

Series with Categorical Data:
0    a
1    b
2    c
3    a
dtype: category

Categories (3, object): ['a', 'b', 'c']

Example: Converting an Existing DataFrame Column to Categorical

This example demonstrates converting an existing Pandas DataFrame column to categorical data type using the astype() method.

import pandas as pd
import numpy as np

# Create a DataFrame 
df = pd.DataFrame({"Col_a": list("aeeioou"), "Col_b": range(7)})

# Display the Input DataFrame
print('Input DataFrame:\n',df)
print('\nVerify the Data type of each column:\n', df.dtypes)

# Convert the Data type of col_a to categorical
df['Col_a'] = df["Col_a"].astype("category")

# Display the Input DataFrame
print('\nConverted DataFrame:\n',df)
print('\nVerify the Data type of each column:\n', df.dtypes)

Output

Following is the output of the above code −

Input DataFrame:
   Col_a  Col_b
0     a      0
1     e      1
2     e      2
3     i      3
4     o      4
5     o      5
6     u      6

Verify the Data type of each column:
 Col_a    object
Col_b     int64
dtype: object

Converted DataFrame:
   Col_a  Col_b
0     a      0
1     e      1
2     e      2
3     i      3
4     o      4
5     o      5
6     u      6

Verify the Data type of each column:
 Col_a    category
Col_b       int64
dtype: object

Controlling Behavior of the Categorical Data

By default, Pandas infers categories from the data and treats them as unordered. To control the behavior, you can use the CategoricalDtype class from the pandas.api.types module.

Example

This example demonstrates how to apply the CategoricalDtype to a whole DataFrame.

import pandas as pd
from pandas.api.types import CategoricalDtype

# Create a DataFrame 
df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})

# Display the Input DataFrame
print('Input DataFrame:\n',df)
print('\nVerify the Data type of each column:\n', df.dtypes)

# Applying CategoricalDtype to a DataFrame
cat_type = CategoricalDtype(categories=list("abcd"), ordered=True)
df_cat = df.astype(cat_type)

# Display the Input DataFrame
print('\nConverted DataFrame:\n', df_cat)
print('\nVerify the Data type of each column:\n', df_cat.dtypes)

Output

Following is the output of the above code −

Input DataFrame:
    A  B
0  a  b
1  b  c
2  c  c
3  a  d

Verify the Data type of each column:
 A    object
B    object
dtype: object

Converted DataFrame:
    A  B
0  a  b
1  b  c
2  c  c
3  a  d

Converting the Categorical Data Back to Original

After converting a Series to categorical data, you can convert it back to its original form using Series.astype() or np.asarray().

Example

This example converts the categorical data of Series object back to the object data type using the astype() method.

import pandas as pd

# Create Series object with categorical data
s = pd.Series(["a", "b", "c", "a"], dtype="category")

# Display the categorical Series 
print('Series with Categorical Data:\n', s)

# Display the converted Series
print('Converted Series back to original:\n ', s.astype(str))

Output

Following is the output of the above code −

Series with Categorical Data:
 0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
Converted Series back to original:
  0    a
1    b
2    c
3    a
dtype: object

Description to a Data Column

Using the .describe() command on the categorical data, we get similar output to a Series or DataFrame of the type string.

Example

The following example demonstrates how to get the description of Pandas categorical DataFrame using the describe() method.

import pandas as pd
import numpy as np

cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
df = pd.DataFrame({"cat":cat, "s":["a", "c", "c", np.nan]})

print("Description for whole DataFrame:")
print(df.describe())

print("\nDescription only for a DataFrame column:")
print(df["cat"].describe())

Output

Its output is as follows −

Description for whole DataFrame:
       cat  s
count    3  3
unique   2  2
top      c  c
freq     2  2

Description only for a DataFrame column:
count     3
unique    2
top       c
freq      2
Name: cat, dtype: object

Python Pandas - Ordering & Sorting Categorical Data

In data analysis, we often need to work with categorical data, especially in columns with repeated string values such as country names, gender, or ratings. Categorical data refers to data that can take only a limited number of distinct values. For instance, values like 'India', 'Australia', in a country names column and "male", and "female" values in the gender column are categorical. These values can also be ordered, allowing for logical sorting.

Categorical data is one of the data type in Pandas that is used to handle variables with a fixed number of possible values, also known as "categories." This type of data is commonly used in statistical analysis. In this tutorial, we will learn how to order and sort categorical data using Pandas.

Ordering Categorical Data

Ordered categorical data in Pandas have a meaning, and allowing you to perform certain operations like sorting, min(), max(), and comparisons. Pandas will raise a TypeError when you try to apply min/max operations on unordered data. The Pandas .cat accessor provides the as_ordered() method to convert a categorical data type into an ordered one.

Example

The following example demonstrates how to create an ordered categorical series using the .cat.as_ordered() method and perform operations such as finding the minimum and maximum values on the ordered categorical series.

import pandas as pd

# Create a categorical series
s = pd.Series(["a", "b", "c", "a", "a", "a", "b", "b"]).astype(pd.CategoricalDtype())

# Convert the categorical series into ordered using the .cat.as_ordered() method 
s = s.cat.as_ordered()

# Display the ordered categorical series
print('Ordered Categorical Series:\n',s)

# Perform the minimum and maximum operation on ordered categorical series
print('Minimum value of the categorical series:',s.min())
print('Maximum value of the categorical series:', s.max())

Output

Following is the output of the above code −

Ordered Categorical Series: 
0    a
1    b
2    c
3    a
4    a
5    a
6    b
7    b
dtype: category
Categories (3, object): ['a' < 'b' < 'c']

Minimum value of the categorical series: a
Maximum value of the categorical series: c

Reordering Categories

Pandas allows you to reorder or reset the categories in your categorical data using .cat.reorder_categories() and .cat.set_categories() methods.

  • reorder_categories(): This method is used to reorder the existing categories with the specified new_categaries.

  • set_categories(): This method allows you to define a new set of categories, which may involve adding new categories or removing existing ones.

Example

The following example demonstrates how to reorder categories using both reorder_categories() and set_categories() methods.

import pandas as pd

# Create a categorical series with a specific order
s = pd.Series(["b", "a", "c", "a", "b"], dtype="category")

# Reorder categories using reorder_categories
s_reordered = s.cat.reorder_categories(["b", "a", "c"], ordered=True)

print("Reordered Categories:\n", s_reordered)

# Set new categories using set_categories
s_new_categories = s.cat.set_categories(["d", "b", "a", "c"], ordered=True)

print("\nNew Categories Set:\n", s_new_categories)

Output

Following is the output of the above code −

Reordered Categories:
0    b
1    a
2    c
3    a
4    b
dtype: category
Categories (3, object): ['b' < 'a' < 'c']

New Categories Set:
0    b
1    a
2    c
3    a
4    b
dtype: category
Categories (4, object): ['d' < 'b' < 'a' < 'c']

Sorting the Categorical Data

Sorting categorical data refers to arranging data in a specific order based on the defined order of categories. For example, if you have categorical data with a specific order like, ["c", "a", "b"], sorting will arrange the values according to this order. Otherwise if you are not specified the order explicitly then, sorting might behave lexically (alphabetically or numerically).

Example

The following example demonstrates how the sorting behaves in Pandas with both unordered and ordered categorical data.

import pandas as pd

# Create a categorical series without any specific order
s = pd.Series(["a", "b", "c", "a", "a", "a", "b", "b"], dtype="category")

# Sort the categorical series without any predefined order (lexical sorting)
print("Lexical Sorting:\n", s.sort_values())

# Define a custom order for the categories
s = s.cat.set_categories(['c', 'a', 'b'], ordered=True)

# Sort the categorical series with the defined order
print("\nSorted with Defined Category Order:\n", s.sort_values())

Output

Following is the output of the above code −

Lexical Sorting:
0    a
3    a
4    a
5    a
1    b
6    b
7    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

Sorted with Defined Category Order:
2    c
0    a
3    a
4    a
5    a
1    b

Multi-Column Sorting with Categorical Data

If you have multiple categorical columns in your DataFrame then a categorical column will be sorted with other columns, and its order will follow the defined categories.

Example

In this example, a DataFrame is created with two categorical columns, "A" and "B". The DataFrame is then sorted first by column "A" based on its categorical order, and then by column "B".

import pandas as pd

# Create a DataFrame with categorical columns
dfs = pd.DataFrame({
"A": pd.Categorical(["X", "X", "Y", "Y", "X", "Z", "Z", "X"], categories=["Y", "Z", "X"], ordered=True),
"B": [1, 2, 1, 2, 2, 1, 2, 1]
})

# Sort by multiple columns
sorted_dfs = dfs.sort_values(by=["A", "B"])

print("Sorted DataFrame:\n", sorted_dfs)

Output

Following is the output of the above code −

Sorted DataFrame:
    A  B
2  Y  1
3  Y  2
5  Z  1
6  Z  2
0  X  1
7  X  1
1  X  2
4  X  2

Python Pandas - Comparing Categorical Data

Comparing categorical data is an essential task for getting insights and understanding the relationships between different categories of the data. In Python, Pandas provides various ways to perform comparisons using comparison operators (==, !=, >, >=, <, and <=) on categorical data. These comparisons can be made in three main scenarios −

  • Equality comparison (== and !=).

  • All comparisons (==, !=, >, >=, <, and <=).

  • Comparing categorical data to a scalar value.

It is important to note that any non-equality comparisons between categorical data with different categories or between a categorical Series and a list-like object will raise a TypeError. This is due to the categories ordering could be interpreted in two ways, one with taking into account the ordering and one without.

In this tutorial, we will learn how to compare categorical data in Python Pandas library using the comparison operators such as ==, !=, >, >=, <, and <=.

Equality comparisons of Categorical Data

In Pandas, comparing categorical data for equality is possible with a variety of objects such as lists, arrays, or Series objects of the same length as the categorical data.

Example

The following example demonstrates how to perform equality and inequality comparisons between categorical Series and the list-like objects.

import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np

# Creating a categorical Series
s = pd.Series([1, 2, 1, 1, 2, 3, 1, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))

# Creating another categorical Series for comparison
s2 = pd.Series([2, 2, 2, 1, 1, 3, 3, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))

# Equality comparison
print("Equality comparison (s == s2):")
print(s == s2)

print("\nInequality comparison (s != s2):")
print(s != s2)

# Equality comparison with a NumPy array
print("\nEquality comparison with NumPy array:")
print(s == np.array([1, 2, 3, 1, 2, 3, 2, 1]))

Following is the output of the above code −

Equality comparison (s == s2):
0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
dtype: bool

Inequality comparison (s != s2):
0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
dtype: bool

Equality comparison with NumPy array:
0     True
1     True
2    False
3     True
4     True
5     True
6    False
7    False
dtype: bool

All Comparisons of Categorical Data

Pandas allows you to perform various comparison operations including (>, >=, <=, <=) between the ordered categorical data.

Example

This example demonstrates how to perform non-equality comparisons (>, >=, <=, <=) on ordered categorical data.

import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np

# Creating a categorical Series
s = pd.Series([1, 2, 1, 1, 2, 3, 1, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))

# Creating another categorical Series for comparison
s2 = pd.Series([2, 2, 2, 1, 1, 3, 3, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))

# Greater than comparison 
print("Greater than comparison:\n",s > s2)

# Less than comparison 
print("\nLess than comparison:\n",s < s2)

# Greater than or equal to comparison 
print("\nGreater than or equal to comparison:\n",s >= s2)

# Lessthan or equal to comparison 
print("\nLess than or equal to comparison:\n",s <= s2)

Following is the output of the above code −

Greater than comparison: 
0     True
1    False
2     True
3    False
4    False
5    False
6     True
7    False
dtype: bool

Less than comparison: 
0    False
1    False
2    False
3    False
4     True
5    False
6    False
7    False
dtype: bool

Greater than or equal to comparison: 
0     True
1     True
2     True
3     True
4    False
5     True
6     True
7     True
dtype: bool

Lessthan or equal to comparison: 
0    False
1     True
2    False
3     True
4     True
5     True
6    False
7     True
dtype: bool

Comparing Categorical Data to Scalars

Categorical data can also be compared to scalar values using all comparison operators (==, !=, >, >=, <, and <=). The categorical values are compared to the scalar based on the order of their categories.

Example

The following example demonstrates how the categorical data can be compared to a scalar value.

import pandas as pd

# Creating a categorical Series
s = pd.Series([1, 2, 3]).astype(pd.CategoricalDtype([3, 2, 1], ordered=True))

# Compare to a scalar
print("Comparing categorical data to a scalar:")
print(s > 2)

Following is the output of the above code −

Comparing categorical data to a scalar:
0     True
1    False
2    False
dtype: bool

Comparing Categorical Data with Different Categories

When comparing two categorical Series that have different categories or orderings, then a TypeError will be raised.

Example

The following example demonstrates handling the TypeError while performing the comparison between the two categorical Series objects with the different categories or orders.

import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np

# Creating a categorical Series
s = pd.Series([1, 2, 1, 1, 2, 3, 1, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))

# Creating another categorical Series for comparison
s3 = pd.Series([2, 2, 2, 1, 1, 3, 1, 2]).astype(CategoricalDtype(ordered=True))

try:
    print("Attempting to compare differently ordered two Series objects:")
    print(s > s3)
except TypeError as e:
    print("TypeError:", str(e))

Following is the output of the above code −

Attempting to compare differently ordered two Series objects:
TypeError: Categoricals can only be compared if 'categories' are the same.

Python Pandas - Missing Data

Missing data is always a problem in real life scenarios. particularly in areas like machine learning and data analysis. Missing values can significantly impact the accuracy of models and analyses, making it crucial to address them properly. This tutorial will about how to identify and handle missing data in Python Pandas.

When and Why Is Data Missed?

Consider a scenario where an online survey is conducted for a product. Many a times, people do not share all the information related to them, they might skip some questions, leading to incomplete data. For example, some might share their experience with the product but not how long they have been using it, or vice versa. Missing data is a frequent occurrence in such real-time scenarios, and handling it effectively is essential.

Representing Missing Data in Pandas

Pandas uses different sentinel values to represent missing data (NA or NaN), depending on the data type.

  • numpy.nan: Used for NumPy data types. When missing values are introduced in an integer or boolean array, the array is upcast to np.float64 or object, as NaN is a floating-point value.

  • NaT: Used for missing dates and times in np.datetime64, np.timedelta64, and PeriodDtype. NaT stands for "Not a Time".

  • <NA>: A more flexible missing value representation for StringDtype, Int64Dtype, Float64Dtype, BooleanDtype, and ArrowDtype. This type preserves the original data type when missing values are introduced.

Example

Let us now see how Pandas represent the missing data for different data types.

import pandas as pd
import numpy as np

ser1 = pd.Series([1, 2], dtype=np.int64).reindex([0, 1, 2])
ser2 = pd.Series([1, 2], dtype=np.dtype("datetime64[ns]")).reindex([0, 1, 2])
ser3 = pd.Series([1, 2], dtype="Int64").reindex([0, 1, 2])

df = pd.DataFrame({'NumPy':ser1, 'Dates':ser2, 'Others':ser3} )
print(df)

Output

Its output is as follows −

   NumPy                         Dates  Others
0    1.0 1970-01-01 00:00:00.000000001       1
1    2.0 1970-01-01 00:00:00.000000002       2
2    NaN                           NaT    <NA>

Checking for Missing Values

Pandas provides the isna() and notna() functions to detect missing values, which work across different data types. These functions return a Boolean Series indicating the presence of missing values.

Example

The following example detecting the missing values using the isna() method.

import pandas as pd
import numpy as np

ser = pd.Series([pd.Timestamp("2020-01-01"), pd.NaT])
print(pd.isna(ser))

Output

On executing the above code we will get the following output −

0    False
1     True
dtype: bool

It is important to note that None is also treated as a missing value when using isna() and notna().

Calculations with Missing Data

When performing calculations with missing data, Pandas treats NA as zero. If all data in a calculation are NA, the result will be NA.

Example

This example calculates the sum of value in the DataFrame "one" column with the missing data.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df['one'].sum())

Output

Its output is as follows −

2.02357685917

Replacing/Filling Missing Data

Pandas provides several methods to handle missing data. One common approach is to replace missing values with a specific value using the fillna() method.

Example

The following program shows how you can replace NaN with a scalar value ("NaN" with "0") using the fillna() method.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])

df = df.reindex(['a', 'b', 'c'])

print("Input DataFrame:\n",df)
print("Resultant DataFrame after NaN replaced with '0':")
print(df.fillna(0))

Output

Its output is as follows −

Input DataFrame:
         one       two     three
a  0.542556 -0.241177 -0.040763
b       NaN       NaN       NaN
c  0.156530 -1.760620 -1.515743
Resultant DataFrame after NaN replaced with '0':
        one       two     three
a  0.542556 -0.241177 -0.040763
b  0.000000  0.000000  0.000000
c  0.156530 -1.760620 -1.515743

Drop Missing Values

If you want to simply exclude the missing values instead of replacing then, then use the dropna() function for dropping missing values.

Example

This example removes the missing values using the dropna() function.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df.dropna())

Output

Its output is as follows −

        one       two     three
a -1.261841  0.150864  0.399744
c -0.431818  1.861201  0.400416
e  0.413045  0.054227  1.849954
f -1.217879  1.115346  1.558122
h  1.006885 -1.017327  0.777561

Python Pandas - Filling Missing Data

Filling missing data is a process of replacing the missing (NaN) values with meaningful alternatives. Whether you want to replace missing values with a constant value, or propagate the values forward or backward, Pandas has built-in functions to achieve this.

In this tutorial, we'll learn different ways to fill missing data in Pandas including −

  • Replacing missing values with a scalar.

  • Forward and backward filling.

  • Using a specified limit for filling.

  • Replacing Data with the replace() method.

  • Replacing values with regular expressions.

Filling Missing Data with Scalar Value

The fillna() method in Pandas is used to fill missing values (NA or NaN) with a scalar value, such as any specific number.

Example

The following demonstrates how to fill the missing values NaN with a scalar value ("NaN" with "5") using the fillna() method.

import pandas as pd
import numpy as np

# Create DataFrame with missing values
data = {"Col1": [3, np.nan, np.nan, 2], "Col2": [1.0, pd.NA, pd.NA, 2.0]}
df = pd.DataFrame(data)

# Display the original DataFrame with missing values
print("Original DataFrame:\n",df)

# Fill missing values with 5
df_filled = df.fillna(5)
print("\nResultant DataFrame after NaN replaced with '5':\n", df_filled)

Output

Its output is as follows −

Original DataFrame:
    Col1  Col2
0   3.0   1.0
1   NaN  <NA>
2   NaN  <NA>
3   2.0   2.0

Resultant DataFrame after NaN replaced with '5':
    Col1  Col2
0   3.0   1.0
1   5.0   5.0
2   5.0   5.0
3   2.0   2.0

Filling Missing Values Forward or Backward

You can also propagate the last valid observation forward or backward to fill gaps using the ffill() and bfill() methods respectively.

Sr.No Method & Action
1

ffill()

This method fills missing values with the previous valid value.

2

bfill()

This methods fills missing values with the next valid value.

Example: Forward Fill

This example replaces the missing values with the forward fill ffill() method.

import pandas as pd
import numpy as np

# Create DataFrame with missing values
df = pd.DataFrame([[9, -3, -2], [-5, 1, 8], [6, 4, -8]], 
index=['a', 'c', 'd'], columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e'])
 
# Display the original DataFrame with missing values
print("Original DataFrame:\n",df)

# Forward Fill the missing values
result = df.ffill()
print("\nResultant DataFrame after Forward fill:\n", result)

Output

Its output is as follows −

Original DataFrame:
    one  two  three
a  9.0 -3.0   -2.0
b  NaN  NaN    NaN
c -5.0  1.0    8.0
d  6.0  4.0   -8.0
e  NaN  NaN    NaN

Resultant DataFrame after Forward fill:
    one  two  three
a  9.0 -3.0   -2.0
b  9.0 -3.0   -2.0
c -5.0  1.0    8.0
d  6.0  4.0   -8.0
e  6.0  4.0   -8.0

Example: Backward Fill

This example replaces the missing values with backward fill bfill() method.

import pandas as pd
import numpy as np

# Create DataFrame with missing values
df = pd.DataFrame([[9, -3, -2], [-5, 1, 8], [6, 4, -8]], 
index=['a', 'c', 'd'], columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e'])
 
# Display the original DataFrame with missing values
print("Original DataFrame:\n",df)

# Backward Fill the missing values
result = df.bfill()
print("\nResultant DataFrame after Backward fill:\n", result)

Output

Its output is as follows −

Original DataFrame:
    one  two  three
a  9.0 -3.0   -2.0
b  NaN  NaN    NaN
c -5.0  1.0    8.0
d  6.0  4.0   -8.0
e  NaN  NaN    NaN

Resultant DataFrame after Backward fill:
    one  two  three
a  9.0 -3.0   -2.0
b -5.0  1.0    8.0
c -5.0  1.0    8.0
d  6.0  4.0   -8.0
e  NaN  NaN    NaN

Limiting the Number of Fills

You can also control the limit of how many consecutive missing values are filled by specifying the limit parameter.

Example

The following example demonstrates how to set limit for filling the missing values using the ffill() method with the limit parameter.

import pandas as pd
import numpy as np

# Create DataFrame with missing values
df = pd.DataFrame([[9, -3, -2], [-5, 1, 8], [6, 4, -8]], 
index=['a', 'c', 'd'], columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'd', 'e', 'f'])
 
# Display the original DataFrame with missing values
print("Original DataFrame:\n",df)

# Forward Fill the missing values with limit
result = df.ffill(limit=1)
print("\nResultant DataFrame after Forward fill:\n", result)

Output

Following is the output of the above code −

Original DataFrame:
    one  two  three
a  9.0 -3.0   -2.0
b  NaN  NaN    NaN
d  6.0  4.0   -8.0
e  NaN  NaN    NaN
f  NaN  NaN    NaN

Resultant DataFrame after Forward fill:
    one  two  three
a  9.0 -3.0   -2.0
b  9.0 -3.0   -2.0
d  6.0  4.0   -8.0
e  6.0  4.0   -8.0
f  NaN  NaN    NaN

Replacing Data with the replace() method

Many times, we have to replace a generic value with some specific value. We can achieve this by applying the replace() method.

Replacing NA with a scalar value is equivalent behavior of the fillna() function.

Example

Here is the example of replacing the generic values using the replace() method.

import pandas as pd
import numpy as np

# Create DataFrame 
df = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})

# Replace the generic values
print(df.replace({1000:10,2000:60}))

Output

Its output is as follows −

   one  two
0   10   10
1   20    0
2   30   30
3   40   40
4   50   50
5   60   60

Replacing Missing Data Using Regular Expressions

You can also use regex patterns to replace the missing values in your data with the replace() method.

Example

Here is the example of replacing the a specific data using the regular expression with the replace() method.

import pandas as pd
import numpy as np

# Create DataFrame with missing values
df = pd.DataFrame({"a": list(range(4)), "b": list("ab.."), "c": ["a", "b", np.nan, "d"]})

# Display the original DataFrame with missing values
print("Original DataFrame:\n",df)

# Replace the missing values with regular exp
result = df.replace(r"\.", 10, regex=True)

print("\nResultant DataFrame after filling the missing values using regex:\n", result)

Output

Its output is as follows −

Original DataFrame:
    a  b    c
0  0  a    a
1  1  b    b
2  2  .  NaN
3  3  .    d

Resultant DataFrame after filling the missing values using regex:
    a   b    c
0  0   a    a
1  1   b    b
2  2  10  NaN
3  3  10    d

Python Pandas - Interpolation of Missing Values

Interpolation is a powerful technique in Pandas that used for handling the missing values in a dataset. This technique estimates the missing values based on other data points of the dataset. Pandas provides the interpolate() method for both DataFrame and Series objects to fill in missing values using various interpolation methods.

In this tutorial, we will learn about the interpolate() methods in Pandas for filling the missing values in a time series data, numeric data, and more using the different interpolation methods.

Basic Interpolation

The Pandas interpolate() method of the both DataFrame and Series objects is used to fills the missing values using different Interpolation strategies. By default, Pandas automatically uses linear interpolation as the default method.

Example

Here is a basic example of calling the interpolate() method for filling the missing values.

import numpy as np
import pandas as pd

df = pd.DataFrame({"A": [1.1, np.nan, 3.5, np.nan, np.nan, np.nan, 6.2, 7.9],
"B": [0.25, np.nan, np.nan, 4.7, 10, 14.7, 1.3, 9.2],
})

print("Original DataFrame:")
print(df)

# Using the  interpolate() method
result = df.interpolate()
print("\nResultant DataFrame after applying the interpolation:")
print(result)

Output

Following is the output of the above code −

Original DataFrame:
     A      B
0  1.1   0.25
1  NaN    NaN
2  3.5    NaN
3  NaN   4.70
4  NaN  10.00
5  NaN  14.70
6  6.2   1.30
7  7.9   9.20

Resultant DataFrame after applying the interpolation:
       A          B
0  1.100   0.250000
1  2.300   1.733333
2  3.500   3.216667
3  4.175   4.700000
4  4.850  10.000000
5  5.525  14.700000
6  6.200   1.300000
7  7.900   9.200000

Different Interpolating Methods

Pandas supports several interpolation methods, including linear, polynomial, pchip, akima, spline, and more. These methods provide flexibility for filling the missing values depending on the nature of your data.

Example

The following example demonstrates using the interpolate() method with the barycentric interpolation technique.

import numpy as np
import pandas as pd

df = pd.DataFrame({"A": [1.1, np.nan, 3.5, np.nan, np.nan, np.nan, 6.2, 7.9],
"B": [0.25, np.nan, np.nan, 4.7, 10, 14.7, 1.3, 9.2],
})

print("Original DataFrame:")
print(df)

# Applying the interpolate() with Barycentric method
result = df.interpolate(method='barycentric')

print("\nResultant DataFrame after applying the interpolation:")
print(result)

Output

Following is the output of the above code −

Original DataFrame:
     A      B
0  1.1   0.25
1  NaN    NaN
2  3.5    NaN
3  NaN   4.70
4  NaN  10.00
5  NaN  14.70
6  6.2   1.30
7  7.9   9.20

Resultant DataFrame after applying the interpolation:
          A          B
0  1.100000   0.250000
1  2.596429  57.242857
2  3.500000  24.940476
3  4.061429   4.700000
4  4.531429  10.000000
5  5.160714  14.700000
6  6.200000   1.300000
7  7.900000   9.200000

Handling Limits in Interpolation

By default, Pandas interpolation fills all the missing values, but you can limit how many consecutive NaN values are filled using the limit parameter of the interpolate() method.

Example

The following example demonstrates filling the missing values of a Pandas DataFrame by limiting the consecutive fills using the limit parameter of the interpolate() method.

import numpy as np
import pandas as pd

df = pd.DataFrame({"A": [1.1, np.nan, 3.5, np.nan, np.nan, np.nan, 6.2, 7.9],
"B": [0.25, np.nan, np.nan, 4.7, 10, 14.7, 1.3, 9.2],
})

print("Original DataFrame:")
print(df)

# Applying the interpolate() with limit
result = df.interpolate(method='spline', order=2, limit=1)

print("\nResultant DataFrame after applying the interpolation:")
print(result)

Output

Following is the output of the above code −

Original DataFrame:
     A      B
0  1.1   0.25
1  NaN    NaN
2  3.5    NaN
3  NaN   4.70
4  NaN  10.00
5  NaN  14.70
6  6.2   1.30
7  7.9   9.20

Resultant DataFrame after applying the interpolation:
          A          B
0  1.100000   0.250000
1  2.231383  -1.202052
2  3.500000        NaN
3  4.111529   4.700000
4       NaN  10.000000
5       NaN  14.700000
6  6.200000   1.300000
7  7.900000   9.200000

Interpolating Time Series Data

Interpolation can be applied to the Pandas time series data as well. It is useful when filling gaps in missing data points over time.

Example

Example statement −

import numpy as np
import pandas as pd

indx = pd.date_range("2024-01-01", periods=10, freq="D")
data = np.random.default_rng(2).integers(0, 10, 10).astype(np.float64)
s = pd.Series(data, index=indx)
s.iloc[[1, 2, 5, 6, 9]] = np.nan

print("Original Series:")
print(s)

result = s.interpolate(method="time")

print("\nResultant Time Series after applying the interpolation:")
print(result)

Output

Following is the output of the above code −

Original Series:
2024-01-01    8.0
2024-01-02    NaN
2024-01-03    NaN
2024-01-04    2.0
2024-01-05    4.0
2024-01-06    NaN
2024-01-07    NaN
2024-01-08    0.0
2024-01-09    3.0
2024-01-10    NaN
Freq: D, dtype: float64

Resultant Time Series after applying the interpolation:
2024-01-01    8.000000
2024-01-02    6.000000
2024-01-03    4.000000
2024-01-04    2.000000
2024-01-05    4.000000
2024-01-06    2.666667
2024-01-07    1.333333
2024-01-08    0.000000
2024-01-09    3.000000
2024-01-10    3.000000
Freq: D, dtype: float64

Python Pandas - Dropping Missing Values

Missing data is a common issue when working with real-world datasets. The Python Pandas library provides an easy way for removing rows or columns that contain missing values (NaN or NaT) from a dataset using the dropna() method.

The dropna() method in Pandas is a useful tool to handle missing data by dropping rows or columns based on your specific requirements. In this tutorial, we will learn how to use dropna() to clean your dataset by dropping missing data based on various conditions.

The dropna() Method

The Pandas dropna() method allows you to remove missing values from a Pandas data structures such as, Series and DataFrame objects. It offers several options to customize how you drop rows or columns based on the presence of NaN values. This method returns a new Pandas object with missing data dropped or it returns None if inplace parameter is set to True.

Syntax

Following is the syntax −

DataFrame.dropna(*, axis=0, how=<no_default>, thresh=<no_default>, subset=None, inplace=False, ignore_index=False)

Where,

  • axis: 0 or 'index' (default) to drop rows; 1 or 'columns' to drop columns.

  • how: By default it is set to 'any', which drops that row or column if any missing values are present. If set to 'all', then it drops that row or column if all the missing values.

  • thresh: Require a minimum number of non-NA values to retain the row or column.

  • subset: List of specific columns (if dropping rows) or rows (if dropping columns) to consider.

  • inplace: Modify the DataFrame in place (default is False).

  • ignore_indexReset the index of the result (default is False).

Let's explore the how of the dropna() method drops the missing data based on various conditions.

Drop Rows with Any Missing Values

By default, the dropna() method removes rows where any missing values are present.

Example

The following example uses the dropna() method to drop the rows that have any missing values.

import pandas as pd
import numpy as np

dataset = {"Student_name": ["Ajay", "Krishna", "Deepak", "Swati"], "Roll_number": [23, 45, np.nan, 18],
           "Major_Subject": ["Maths", "Physics", "Arts", "Political science"], "Marks": [57, np.nan, 98, np.nan]}

df = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("Original DataFrame:")
print(df)

# Drop the rows that have any missing values
df_cleaned = df.dropna()
print('\nResultant DataFrame after removing row:\n',df_cleaned)

Output

Following is the output of the above code −

Original DataFrame:
  Student_name  Roll_number      Major_Subject  Marks
1         Ajay         23.0              Maths   57.0
2      Krishna         45.0            Physics    NaN
3       Deepak          NaN               Arts   98.0
4        Swati         18.0  Political science    NaN

Resultant DataFrame after removing row:
   Student_name  Roll_number Major_Subject  Marks
1         Ajay         23.0         Maths   57.0

Drop Rows Where All Values Are Missing

To drop rows where all values are missing, then we need to set how='all' parameter to the dropna() method.

Example

The following example demonstrates how to drop the rows where all values are missing in a DataFrame.

import pandas as pd
import numpy as np

dataset = {"Student name": ["Ajay", np.nan, "Deepak", "Swati"], 
"Roll number": [23, np.nan, np.nan, 18],
"Major Subject": ["Maths", np.nan, "Arts", "Political science"], 
"Marks": [57, np.nan, 98, np.nan]}

df = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("Original DataFrame:")
print(df)

# Drop rows where all values are missing
reslut = df.dropna(how='all')
print('\nResultant DataFrame after removing row:\n',reslut)

Output

Following is the output of the above code −

Original DataFrame:
  Student name  Roll number      Major Subject  Marks
1         Ajay         23.0              Maths   57.0
2          NaN          NaN                NaN    NaN
3       Deepak          NaN               Arts   98.0
4        Swati         18.0  Political science    NaN

Resultant DataFrame after removing row:
   Student name  Roll number      Major Subject  Marks
1         Ajay         23.0              Maths   57.0
3       Deepak          NaN               Arts   98.0
4        Swati         18.0  Political science    NaN

Keep Rows with a Minimum Number of Missing Values

The pandas dropan() method provides the thresh parameter to specify a minimum threshold of non-missing values for keeping rows with a minimum number of Non-Na values.

Example

This example demonstrates how to keep the rows the minimum number of missing values.

import pandas as pd
import numpy as np

dataset = {"Student name": ["Ajay", "Krishna", "Deepak", "Swati"], 
"Roll number": [23, np.nan, np.nan, 18],
"Major Subject": ["Maths", np.nan, "Arts", "Political science"], 
"Marks": [57, np.nan, 98, np.nan]}

df = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("Original DataFrame:")
print(df)

# Drop the rows with a threshold 
result = df.dropna(thresh=2)
print('\nResultant DataFrame after removing row:\n',result)

Output

Following is the output of the above code −

Original DataFrame:
  Student name  Roll number      Major Subject  Marks
1         Ajay         23.0              Maths   57.0
2      Krishna          NaN                NaN    NaN
3       Deepak          NaN               Arts   98.0
4        Swati         18.0  Political science    NaN

Resultant DataFrame after removing row:
   Student name  Roll number      Major Subject  Marks
1         Ajay         23.0              Maths   57.0
3       Deepak          NaN               Arts   98.0
4        Swati         18.0  Political science    NaN

Drop Columns with Any Missing Values

To drop columns that contain any missing values, then we can use axis parameter of the dropna() method to select the columns.

Example

This example show how the dropna() method removes the entire column where any of the value is missing.

import pandas as pd
import numpy as np

dataset = {"Student_name": ["Ajay", "Krishna", "Deepak", "Swati"], 
"Roll_number": [23, 45, np.nan, 18],
"Major_Subject": ["Maths", "Physics", "Arts", "Political science"], 
"Marks": [57, np.nan, 98, np.nan]}

df = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("Original DataFrame:")
print(df)

# Drop column with any missing values
result = df.dropna(axis='columns')
print('\nResultant DataFrame after removing columns:\n',result)

Output

Following is the output of the above code −

Original DataFrame:
  Student_name  Roll_number      Major_Subject  Marks
1         Ajay         23.0              Maths   57.0
2      Krishna         45.0            Physics    NaN
3       Deepak          NaN               Arts   98.0
4        Swati         18.0  Political science    NaN

Resultant DataFrame after removing columns:
   Student_name      Major_Subject
1         Ajay              Maths
2      Krishna            Physics
3       Deepak               Arts
4        Swati  Political science

Drop Rows Based on Missing Data in Specific Columns

You can use the subset parameter of the drop() method to focus only on those particular columns while dropping rows where data is missing.

Example

This example shows how to remove the rows based on missing data present in the specific column using the subset parameter of the dropna() method.

import pandas as pd
import numpy as np

dataset = {"Student_name": ["Ajay", "Krishna", "Deepak", "Swati"], 
"Roll_number": [23, 45, np.nan, 18],
"Major_Subject": ["Maths", "Physics", np.nan, "Political science"], 
"Marks": [57, np.nan, 98, np.nan]}

df = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("Original DataFrame:")
print(df)

# Drop Rows Based on Missing Data in Specific Columns
result = df.dropna(subset=['Roll_number', 'Major_Subject'])
print('\nResultant DataFrame after removing rows:\n',result)

Output

Following is the output of the above code −

Original DataFrame:
  Student_name  Roll_number      Major_Subject  Marks
1         Ajay         23.0              Maths   57.0
2      Krishna         45.0            Physics    NaN
3       Deepak          NaN                NaN   98.0
4        Swati         18.0  Political science    NaN

Resultant DataFrame after removing rows:
   Student_name  Roll_number      Major_Subject  Marks
1         Ajay         23.0              Maths   57.0
2      Krishna         45.0            Physics    NaN
4        Swati         18.0  Political science    NaN

Python Pandas - Calculations with Missing Data

When working with data, you will often come across missing values, which are represented as NaN (Not a Number) in Pandas. Calculations with the missing values requires more attention since NaN values propagate through most arithmetic operations, which may alter the results.

Pandas offers flexible ways to manage missing data during calculations, allowing you to control how these values affect your results. In this tutorial, we will learn how Pandas handles missing data during calculations, including arithmetic operations, descriptive statistics, and cumulative operations.

Arithmetic Operations with Missing Data

When performing arithmetic operations between Pandas objects, missing values (NaN) are propagated by default. For example, when you add two series with NaN values, the result will also have NaN wherever there was a missing value in any of series.

Example

The following example demonstrates performing the arithmetic operations between two series objects with missing values.

import pandas as pd
import numpy as np

# Create 2 input series objects
ser1 = pd.Series([1, np.nan, np.nan, 2])
ser2 = pd.Series([2, np.nan, 1, np.nan])

# Display the series
print("Input Series 1:\n",ser1)
print("\nInput Series 2:\n",ser2)

# Adding two series with NaN values
result = ser1 + ser2
print('\nResult After adding Two series:\n',result)

Output

Following is the output of the above code −

Input Series 1:
 0    1.0
1    NaN
2    NaN
3    2.0
dtype: float64

Input Series 2:
 0    2.0
1    NaN
2    1.0
3    NaN
dtype: float64

Result After adding Two series:
 0    3.0
1    NaN
2    NaN
3    NaN
dtype: float64

Handling Missing Data in Descriptive Statistics

The Pandas library provides several methods for computing descriptive statistics, such as summing, calculating the product, or finding the cumulative sum or product. These methods are designed to handle missing data efficiently.

Example: Summing with Missing Values

When summing data with missing values, NaN values are excluded. This allows you to calculate meaningful totals even when some data is missing.

The following example performing the summing operation on a DataFrame column using the sum() function. By default, NaN values are skipped in summation operation.

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {'A': [np.nan, 2, np.nan, 4], 'B': [5, 6, 7, 8]}
df = pd.DataFrame(data)

# Display the input DataFrame
print("Input DataFrame:\n", df)

# Summing a column with NaN values
result = df['A'].sum()

print('\nResult After Summing the values of a column:\n',result)

Output

Following is the output of the above code −

Input DataFrame:
      A  B
0  NaN  5
1  2.0  6
2  NaN  7
3  4.0  8

Result After Summing the values of a column:
 6.0

Example: Product Calculation with Missing Values

Similar to summing, when calculating the product of values with the missing data (NaN) is treated as 1. This ensures that missing values do not alter the final product.

The following example uses the pandas df.prod() function to calculate the product of a pandas object.

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {'A': [np.nan, 2, np.nan, 4], 'B': [5, 6, np.nan, np.nan]}
df = pd.DataFrame(data)

# Display the input DataFrame
print("Input DataFrame:\n", df)

# Product with NaN values
result = df.prod()

print('\nResult After Product the values of a DataFrame:\n',result)

Output

Following is the output of the above code −

Input DataFrame:
      A    B
0  NaN  5.0
1  2.0  6.0
2  NaN  NaN
3  4.0  NaN

Result After Product the values of a DataFrame:
 A     8.0
B    30.0
dtype: float64

Cumulative Operations with Missing Data

Pandas provides cumulative methods like cumsum() and cumprod() to generate running totals or products. By default, these methods ignore missing values but preserve them in the output. If you want to include the missing data in the calculation, you can set the skipna parameter to False.

Example: Cumulative Sum with Missing Values

The following example demonstrates calculating the cumulative sum of a DataFrame with missing values using the df.cumsum() method.

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {'A': [np.nan, 2, np.nan, 4], 'B': [5, 6, np.nan, np.nan]}
df = pd.DataFrame(data)

# Display the input DataFrame
print("Input DataFrame:\n", df)

# Calculate cumulative sum by ignoring NaN
print('Cumulative sum by ignoring NaN:\n',df.cumsum())

Output

Following is the output of the above code −

Input DataFrame:
      A    B
0  NaN  5.0
1  2.0  6.0
2  NaN  NaN
3  4.0  NaN
Cumulative sum by ignoring NaN:
      A     B
0  NaN   5.0
1  2.0  11.0
2  NaN   NaN
3  6.0   NaN

From the above output you can observe that, the missing values are skipped, and the cumulative sum is computed for the available values.

Example: Including NaN in Cumulative Sum

This example shows how the cumulative sum is performed by including the missing using the df.cumsum() method by setting the skipna=False.

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {'A': [np.nan, 2, np.nan, 4], 'B': [5, 6, np.nan, np.nan]}
df = pd.DataFrame(data)

# Display the input DataFrame
print("Input DataFrame:\n", df)

# Calculate the cumulative sum by preserving NaN
print('Cumulative sum by including NaN:\n', df.cumsum(skipna=False))

Output

Following is the output of the above code −

Input DataFrame:
      A    B
0  NaN  5.0
1  2.0  6.0
2  NaN  NaN
3  4.0  NaN
Cumulative sum by including NaN:
     A     B
0 NaN   5.0
1 NaN  11.0
2 NaN   NaN
3 NaN   NaN

With skipna=False, the cumulative sum stops when it encounters a NaN value, and all subsequent values are also become NaN.

Python Pandas - Duplicated Data

Duplicated data refers to rows in a dataset that appear more than once. Duplicate data can occur due to various reasons such as data collection errors, repeated records, or merging datasets. Identifying and removing duplicates is an essential task in Data Preprocessing and Data Analysis to avoid incorrect results.

Consider this sample dataset containing student names and their dates of birth −

StudentDate of Birth
Rahul01 December 2017
Raj14 April 2018
Rahul01 December 2017

In this dataset, the first and last rows contain repeated values, indicating that "Rahul" is a duplicate entry.

Pandas provides two primary methods to detect and remove duplicate rows in a DataFrame −

  • duplicated(): Identifies duplicate rows and returns a Boolean mask, where True indicates a duplicate entry.

  • drop_duplicates(): Removes duplicate rows from the DataFrame while keeping the first occurrence by default.

In this tutorial, we will learn how to identify duplicates, check for duplicates in specific columns, and remove them using Pandas.

Identifying Duplicates in a DataFrame

Pandas DataFrame.duplicated() method is used to identify duplicate rows in a DataFrame. By default, it considers all columns to identify duplicates and marks them as True, except for the first occurrence.

This method returns a Boolean Series indicating whether a row is duplicated, where −

  • False: The row is not a duplicate (i.e., it's the first occurrence).

  • True: The row is a duplicate of another row in the DataFrame.

Example

The following example demonstrates how to identify duplicate rows in a Pandas DataFrame using duplicated() method.

import pandas as pd

# Sample dataset
df = pd.DataFrame({
'Name': ['Rahul', 'Raj', 'Rahul'],
'Date_of_Birth': ['01 December 2017', '14 April 2018', '01 December 2017']})

print("Original DataFrame:")
print(df)

# Find duplicates in the DataFrame
result = df.duplicated()

# Display the resultant Duplicates
print('\nResult after finding the duplicates:')
print(result)

Output

Following is the output of the above code −

Original DataFrame:
    Name     Date_of_Birth
0  Rahul  01 December 2017
1    Raj     14 April 2018
2  Rahul  01 December 2017

Result after finding the duplicates:
0    False
1    False
2     True
dtype: bool

In the example, the third row is marked as a duplicate since it has the same values as the first row.

Identifying Duplicates on Specific Columns

To find duplicates based on specific columns, use the subset parameter of the duplicated() method.

Example

The following example demonstrates how to identify the duplicate values on a specific column using the subset parameter of the duplicated() method.

import pandas as pd

# Sample dataset
df = pd.DataFrame({
'Name': ['Rahul', 'Raj', 'Rahul', 'Karthik', 'Arya', 'Karthik'],
'Date_of_Birth': ['01 December 2017', '14 April 2018', '01 December 2017', '14 July 2000', '26 May 2000', '14 July 2000'],
'City': ['Hyderabad', 'Chennai', 'Kolkata', 'Hyderabad', 'Chennai', 'Hyderabad']})

print("Original DataFrame:")
print(df)

# Find duplicates in the DataFrame
result = df.duplicated(subset=['Name', 'City'])

# Display the resultant Duplicates
print('\nResult after finding the duplicates:')
print(result) 

Output

Following is the output of the above code −

Original DataFrame:
      Name     Date_of_Birth       City
0    Rahul  01 December 2017  Hyderabad
1      Raj     14 April 2018    Chennai
2    Rahul  01 December 2017    Kolkata
3  Karthik      14 July 2000  Hyderabad
4     Arya       26 May 2000    Chennai
5  Karthik      14 July 2000  Hyderabad

Result after finding the duplicates:
0    False
1    False
2    False
3    False
4    False
5     True
dtype: bool

Removing Duplicates

The drop_duplicates() method is used to remove duplicate rows from the DataFrame. By default, it considers all columns and keeps the first occurrence of each duplicated row, while removing the rest.

Example

This example removes the duplicate rows from a Pandas DataFrame using the drop_duplicates() method.

import pandas as pd

# Sample dataset
df = pd.DataFrame({
'Name': ['Rahul', 'Raj', 'Rahul', 'Karthik', 'Arya', 'Karthik'],
'Date_of_Birth': ['01 December 2017', '14 April 2018', '01 December 2017', '14 July 2000', '26 May 2000', '14 July 2000'],
'City': ['Hyderabad', 'Chennai', 'Kolkata', 'Hyderabad', 'Chennai', 'Hyderabad']})

print("Original DataFrame:")
print(df)

# Drop duplicates in the DataFrame

result = df.drop_duplicates()

# Display the resultant Duplicates
print('\nResult after finding the duplicates:')
print(result) 

Output

Following is the output of the above code −

Original DataFrame:
      Name     Date_of_Birth       City
0    Rahul  01 December 2017  Hyderabad
1      Raj     14 April 2018    Chennai
2    Rahul  01 December 2017    Kolkata
3  Karthik      14 July 2000  Hyderabad
4     Arya       26 May 2000    Chennai
5  Karthik      14 July 2000  Hyderabad

Result after finding the duplicates:
      Name     Date_of_Birth       City
0    Rahul  01 December 2017  Hyderabad
1      Raj     14 April 2018    Chennai
2    Rahul  01 December 2017    Kolkata
3  Karthik      14 July 2000  Hyderabad
4     Arya       26 May 2000    Chennai

Removing Duplicates in Specific Columns

You can also remove duplicates based on specific columns using the subset parameter of the drop_duplicates() method.

Example

This example removes the duplicate data of a DataFrame based on specific columns using the subset parameter of the drop_duplicates() method.

import pandas as pd

# Sample dataset
df = pd.DataFrame({
'Name': ['Rahul', 'Raj', 'Rahul', 'Karthik', 'Arya', 'Karthik'],
'Date_of_Birth': ['01 December 2017', '14 April 2018', '01 December 2017', '14 July 2000', '26 May 2000', '14 July 2000'],
'City': ['Hyderabad', 'Chennai', 'Kolkata', 'Hyderabad', 'Chennai', 'Hyderabad']})

print("Original DataFrame:")
print(df)

# Drop duplicates in the DataFrame
result = df.drop_duplicates(subset=['Date_of_Birth'])

# Display the resultant Duplicates
print('\nResult after finding the duplicates:')
print(result) 

Output

Following is the output of the above code −

Original DataFrame:
      Name     Date_of_Birth       City
0    Rahul  01 December 2017  Hyderabad
1      Raj     14 April 2018    Chennai
2    Rahul  01 December 2017    Kolkata
3  Karthik      14 July 2000  Hyderabad
4     Arya       26 May 2000    Chennai
5  Karthik      14 July 2000  Hyderabad

Result after finding the duplicates:
      Name     Date_of_Birth       City
0    Rahul  01 December 2017  Hyderabad
1      Raj     14 April 2018    Chennai
3  Karthik      14 July 2000  Hyderabad
4     Arya       26 May 2000    Chennai

Python Pandas - Counting and Retrieving Unique Elements

When working with real time data, we often encounter duplicated entries, which are rows or entities in a dataset that appear more than once. Duplicate data can occur due to various reasons, such as data collection errors, repeated records, or merging datasets. Counting and identifying unique elements from duplicated data is an essential task in data preprocessing and data analysis.

Pandas provides several methods for handling unique values, including −

  • nunique(): Counts the number of distinct values in each column or row.

  • value_counts(): Returns the frequency of each unique value in an object.

  • unique()Retrieves unique values based on a hash table.

In this tutorial, we will explore how to count and retrieve unique elements in a Pandas DataFrame.

Counting Unique Elements

The Pandas DataFrame.nunique() method is used to count the number of distinct elements along a specific axis of a DataFrame. It can be applied row-wise or column-wise and can also choose to ignore NaN values.

Syntax

Following is the syntax −

DataFrame.nunique(axis=0, dropna=True)

Where,

  • axis: Defines whether to count unique elements row-wise (axis=1) or column-wise (axis=0, default).

  • dropna: When set to True (default), it ignores NaN values in the counts.

Example: Counting Unique Elements Column-wise

Here is the basic example demonstrating the working of the DataFrame.nunique() method for counting the column wise unique values of a Pandas DataFrame.

import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({'A': [4, 5, 6],'B': [4, 1, 1]})

# Display the Original DataFrame
print("Original DataFrame:")
print(df)

# Counting unique values column-wise
result = df.nunique()
print('Column wise count of the unique elements:\n', result)

Output

Following is the output of the above code −

Original DataFrame:
   A  B
0  4  4
1  5  1
2  6  1
Column wise count of the unique elements:
 A    3
B    2
dtype: int64

Here, column A has 3 unique values, while column B has 2 unique values.

Example: Counting Unique Values Row-wise

This example demonstrates the working of the DataFrame.nunique() method for counting the row wise unique values of a Pandas DataFrame. You can also count unique values row-wise by setting the axis parameter to 1.

import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({'A': [4, 5, 6],'B': [4, 1, 1]})

# Display the Original DataFrame
print("Original DataFrame:")
print(df)

# Counting unique values row-wise
result = df.nunique(axis=1)
print('\nRow wise count of the unique elements:\n', result)

Output

Following is the output of the above code −

Original DataFrame:
   A  B
0  4  4
1  5  1
2  6  1
Column wise count of the unique elements:
 A    3
B    2
dtype: int64

Count Unique Values using the value_counts()

The pandas DataFrame.value_counts() method is used for getting the more detailed information on unique values. It returns a Series containing counts of unique values in descending order.

Example

This example uses the DataFrame.value_counts() method for counting the unique values in a DataFrame column.

import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({'A': [4, 5, 6],'B': [4, 1, 1]})

# Display the Original DataFrame
print("Original DataFrame:")
print(df)

# Count the frequency of unique values in column 'B'
result = df['B'].value_counts()

print('\nThe unique values:')
print(result)

Output

Following is the output of the above code −

Original DataFrame:
   A  B
0  4  4
1  5  1
2  6  1

Row wise count of the unique elements:
 0    1
1    2
2    2
dtype: int64

Retrieving Unique Elements

The pandas.unique() function is used to get an array of unique values from a single column of a DataFrame or a Series. Unlike nunique(), which counts unique values, unique() returns them directly.

Syntax

Following is the syntax −

pandas.unique(values)

Where, values is a single parameter that can be a 1D array-like structure, such as a Series or DataFrame column.

Example

The following example uses the pandas.unique() function to get all the unique elements from a Pandas DataFrame column.

import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({'A': [4, 5, 6],'B': [4, 1, 1]})

# Display the Original DataFrame
print("Original DataFrame:")
print(df)

# Get unique values from a column
result= pd.unique(df['A'])
print('\nThe unique values:\n', result)

Output

Following is the output of the above code −

Original DataFrame:
   A  B
0  4  4
1  5  1
2  6  1

The unique values:
B
1    2
4    1
Name: count, dtype: int64

Python Pandas - Duplicated Labels

In Pandas row and column labels in both Series and DataFrames are not required to be unique. If a dataset contains the repeated index labels then we call it as duplicated labels, it can lead to unexpected results in some operations such as filtering, aggregating, or slicing.

Pandas provides several methods to detect, manage, and handle such duplicated labels. In this tutorial, we will learn various ways to detect, manage, and handle duplicated labels in Pandas.

Checking for Unique Labels

To check if the row or column labels of a DataFrame are unique, you can use the pandas Index.is_unique attribute. If it returns False, then it means there are duplicate labels in your Index.

Example

The following example uses the pandas Index.is_unique attribute for checking the unique labels of a DataFrame.

import pandas as pd

# Creating a DataFrame with duplicate row labels
df = pd.DataFrame({"A": [0, 1, 2], 'B': [4, 1, 1]}, index=["a", "a", "b"])

# Display the Original DataFrame
print("Original DataFrame:")
print(df)

# Check if the row index is unique
print("Is row index is unique:",df.index.is_unique)  

# Check if the column index is unique
print('Is column index is unique:',df.columns.is_unique) 

Output

Following is the output of the above code −

Original DataFrame:
   A  B
a  0  4
a  1  1
b  2  1
Is row index is unique: False
Is column index is unique: True

Detecting Duplicates Labels

The Index.duplicated() method is used to detect duplicates labels of Pandas object, it returns a boolean array indicating whether each label in the Index is duplicated.

Example

The following example uses the Index.duplicated() method to detect the duplicates row labels of Pandas DataFrame.

import pandas as pd

# Creating a DataFrame with duplicate row labels
df = pd.DataFrame({"A": [0, 1, 2], 'B': [4, 1, 1]}, index=["a", "a", "b"])

# Display the Original DataFrame
print("Original DataFrame:")
print(df)

# Identify duplicated row labels
print('Duplicated Row Labels:', df.index.duplicated())

Output

Following is the output of the above code −

Original DataFrame:
   A  B
a  0  4
a  1  1
b  2  1
Duplicated Row Labels: [False  True False]

Rejecting Duplicate Labels

Pandas provides an ability to reject the duplicate labels. By default, pandas allows duplicate labels, but you can disallow them by setting .set_flags(allows_duplicate_labels=False). This can be applied to both Series and DataFrames. If pandas detects duplicate labels, it will raise a DuplicateLabelError.

Example

The following example demonstrates creating the Pandas Series object with disallowing the duplicate labels.

import pandas as pd

# Create a Series with duplicate labels and disallow duplicates
try:
    pd.Series([0, 1, 2], index=["a", "b", "b"]).set_flags(allows_duplicate_labels=False)
except pd.errors.DuplicateLabelError as e:
    print(e)

Output

Following is the output of the above code −

Index has duplicates.

      positions
label          
b        [1, 2]

Python Pandas - GroupBy

Pandas groupby() is an essential method for data aggregation and analysis in python. It follows the "Split-Apply-Combine" pattern, which means it allows users to −

  • Split data into groups based on specific criteria.

  • Apply functions independently to each group.

  • Combine the results into a structured format.

In this tutorial, we will learn about basics of groupby operations in pandas, such as splitting data, viewing groups, and selecting specific groups using an example dataset.

Introduction to GroupBy Operations

Every groupby() operation involves three key steps, splitting data into groups based on some criteria, apply functions independently to each group, and then merge the results back into a meaningful structure.

In many situations, we apply some functions on each splitted groups. In the apply functionality, we can perform the following operations −

  • Aggregation: Computing summary statistics like mean, sum, etc.

  • Transformation: Applying a function to transform data.

  • Filtration: Removing groups based on some condition.

Split Data into Groups

Pandas objects can be split into groups based on any of their column values using the groupby() method.

Example

Let us now see how the grouping objects can be applied to the Pandas DataFrame using the groupby() method.

# import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}

df = pd.DataFrame(ipl_data)

# Display the Original DataFrame
print("Original DataFrame:")
print(df)

# Display the Grouped Data
print('\nGrouped Data:')
print(df.groupby('Team'))

Output

Following is the output of the above code −

Original DataFrame:
      Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
2   Devils     2  2014     863
3   Devils     3  2015     673
4    Kings     3  2014     741
5    kings     4  2015     812
6    Kings     1  2016     756
7    Kings     1  2017     788
8   Riders     2  2016     694
9   Royals     4  2014     701
10  Royals     1  2015     804
11  Riders     2  2017     690

Grouped Data:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fca22795340>

GroupBy with Multiple Columns

You can group data based on multiple columns by applying a list of column values to the groupby() method.

Example

Here is an example where the data is grouped by multiple columns.

# import the pandas library
import pandas as pd

# Create a DataFrame
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

# Display the Grouped Data
print('Grouped Data:')

print(df.groupby(['Team','Year']).groups)

Output

Its output is as follows −

Grouped Data:
{('Devils', 2014): [2], ('Devils', 2015): [3], ('Kings', 2014): [4], 
('Kings', 2016): [6], ('Kings', 2017): [7], ('Riders', 2014): [0], 
('Riders', 2015): [1], ('Riders', 2016): [8], ('Riders', 2017): [11], 
('Royals', 2014): [9], ('Royals', 2015): [10], ('kings', 2015): [5]}

Viewing Grouped Data

Once you have your data split into groups, you can view them using different methods. One of the simplest ways is to view how it has been internally stored using the .groups attribute.

Example

The following example demonstrates how to view the grouped data using the using the .groups attribute.

# import the pandas library
import pandas as pd

# Create DataFrame 
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print('Viewing Grouped Data:')
print(df.groupby('Team').groups)

Output

Its output is as follows −

Viewing Grouped Data:
{'Devils': [2, 3], 'Kings': [4, 6, 7], 'Riders': [0, 1, 8, 11], 
'Royals': [9, 10], 'kings': [5]}

Selecting a Specific Group

Using the get_group() method, we can select a specific group.

Example

The following example demonstrates selecting a group from a grouped data using the get_group() method.

# import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Year')

# Display the Selected Data
print('Selected Group Data:')
print(grouped.get_group(2014))

Output

Its output is as follows −

Selected Group Data:
     Team  Rank  Year  Points
0  Riders     1  2014     876
2  Devils     2  2014     863
4   Kings     3  2014     741
9  Royals     4  2014     701

Python Pandas - Date Functionality

In time-series data analysis, especially in financial domains, date functionality plays a crucial role. Pandas provides robust tools to work with dates, allowing you to generate date sequences, manipulate date frequencies, and work with business days.

This tutorial will cover some of the essential date functionalities in Pandas, including generating sequences of dates, converting date series to different frequencies, and creating ranges of dates.

Pandas date functionality is divided into four primary concepts −

  • Date times: Represent specific points in time, like datetime.datetime from the standard library.

  • Time deltas: Represent duration in time, similar to datetime.timedelta.

  • Time spans: Define a span of time with a specific frequency, like months or years.

  • Date offsets: Represent relative time changes that respect calendar rules. Similar to dateutil.relativedelta.relativedelta from the dateutil package.

ConceptScalar ClassArray ClassPandas Data TypeCreation Method
Date timesTimestampDatetimeIndexdatetime64[ns]to_datetime() or date_range()
Time deltasTimedeltaTimedeltaIndextimedelta64[ns]to_timedelta() or timedelta_range()
Time spansPeriodPeriodIndexperiod[freq]Period() or period_range()
Date offsetsDateOffsetNoneNoneDateOffset

Generating a Sequence of Dates

You can create a range of dates using the date.range() function. By specifying the start date, number of periods, and frequency, you can generate a sequence of dates.

Example

The following example uses the date.range() function to generate a date range with a default frequency of one day ('D').

import pandas as pd

print(pd.date_range('1/1/2024', periods=5))

Its output is as follows −

DatetimeIndex(['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'],
   dtype='datetime64[ns]', freq='D')

Changing the Date Frequency

The frequency of a date range can be changed using the freq parameter in the pd.date_range() function. Pandas supports a variety of frequency options, allowing you to customize the intervals between dates.

Example

This example specifies the frequency 'M' to generates dates at the end of each month.

import pandas as pd

print(pd.date_range('1/1/2024', periods=5,freq='M'))

Its output is as follows −

DatetimeIndex(['2024-01-31', '2024-02-28', '2024-03-31', '2024-04-30', '2024-05-31'],
   dtype='datetime64[ns]', freq='M')

Working with Business Days

When analyzing financial data, it is common to exclude weekends and holidays. Pandas provides the bdate_range() function stands for business date ranges, which generates date ranges while excluding weekends. Unlike date_range(), it excludes Saturday and Sunday.

Example

This example uses the bdate_range() function to generate 10 working days.

import pandas as pd

print(pd.date_range('1/1/2024', periods=10))

Its output is as follows −

DatetimeIndex(['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04',
               '2024-01-05', '2024-01-06', '2024-01-07', '2024-01-08',
               '2024-01-09', '2024-01-10'],
              dtype='datetime64[ns]', freq='D')

Observe that the output excludes the weekends (January 6th and 7th), and the range continues from the next business day, January 8th. Check your calendar for the days.

Using Offset Aliases

Pandas uses a set of predefined string aliases for common time series frequencies. These aliases, known as offset aliases, simplify the process of setting the frequency of your date ranges.

Commonly Used Offset Aliases

Below are the commonly used offset aliases in pandas.

Alias Description Alias Description
B business day frequency BQS business quarter start frequency
D calendar day frequency A annual(Year) end frequency
W weekly frequency BA business year end frequency
M month end frequency BAS business year start frequency
SM semi-month end frequency BH business hour frequency
BM business month end frequency H hourly frequency
MS month start frequency T, min minutely frequency
SMS SMS semi month start frequency S secondly frequency
BMS business month start frequency L, ms milliseconds
Q quarter end frequency U, us microseconds
BQ business quarter end frequency N nanoseconds
QS quarter start frequency

For example, using the alias 'B' with pd.date_range() creates a date range with only business days, while 'M' sets the frequency to the end of the month.

Python Pandas - Timedelta

Timedelta in Pandas represents a duration, or the difference between two dates or times, expressed in units such as days, hours, minutes, or seconds. They are useful for performing arithmetic operations on datetime objects and can be both positive and negative duration's.

Pandas Timedelta Class

The pandas.Timedelta class is a powerful tool to represent a duration or the difference between two dates or times. It is equivalent of Python's datetime.timedelta object and can be used interchangeably in most cases.

Syntax

Following is the syntax of the class −

class pandas.Timedelta(value=<object object>, unit=None, **kwargs)

Where,

  • value − Accepts the any of the following time object: Timedelta, timedelta, np.timedelta64, str, or int.

  • unit − It is a optional parameter specifies the unit of the input if the input is an integer. Supported units include: 'W', 'D', 'days', 'hours', 'minutes', 'seconds', 'milliseconds', 'microseconds', 'nanoseconds'.

  • **kwargs − Accepts keyword arguments like days, seconds, microseconds, milliseconds, minutes, hours, and weeks.

Example

Following is the basic example of creating the Timedelta object.

import pandas as pd

# Initialize Timedelta with value and unit
td = pd.Timedelta(1, "d")
print(td)
print('Data Type of the Resultant Object:',type(td))

Following is the output of the above code −

1 days 00:00:00
Data Type of the Resultant Object: <class 'pandas._libs.tslibs.timedeltas.Timedelta'>

Also, you can create Timedelta objects in various ways, such as by passing a string, integer, or by using data offsets. Additionally, Pandas provides a top-level function to_timedelta() to convert scalar, array, list, or series into Timedelta type.

Creating Timedelta with a String

You can create a Timedelta object by passing a string that represents a duration.

Example

Here is the example of creating the Timedelta object using the string.

import pandas as pd

print(pd.Timedelta('2 days 2 hours 15 minutes 30 seconds'))

Its output is as follows −

2 days 02:15:30

Creating Timedelta with an Integer

By passing an integer value with the unit, an argument creates a Timedelta object.

Example

This example converts an integer into the Timedelta object.

import pandas as pd

print(pd.Timedelta(6,unit='h'))

Its output is as follows −

0 days 06:00:00

Creating Timedelta with Data Offsets

Data offsets such as - weeks, days, hours, minutes, seconds, milliseconds, microseconds, nanoseconds can also be used in construction.

Example

Here is the example −

import pandas as pd

print(pd.Timedelta(days=2))

Its output is as follows −

Creating Timedelta with an Integer
2 days 00:00:00

Using pd.to_timedelta() Function

The pd.to_timedelta function converts a scalar, array, list, or series from a recognized timedelta format or value into a Timedelta type. It will construct a Series if the input is a Series, a scalar if the input is scalar-like, or a TimedeltaIndex otherwise.

import pandas as pd

print(pd.Timedelta(days=2))

Its output is as follows −

2 days 00:00:00

Timedelta Operations

You can perform arithmetic operations on Series or DataFrames containing datetime64[ns] and timedelta64[ns] data types.

Example − Addition Operation

Let us now create a DataFrame with Timedelta and datetime objects and perform Addition operation on it −

import pandas as pd

s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))
td = pd.Series([ pd.Timedelta(days=i) for i in range(3) ])
df = pd.DataFrame(dict(A = s, B = td))
df['C']=df['A']+df['B']

print(df)

Its output is as follows −

           A      B          C
0 2012-01-01 0 days 2012-01-01
1 2012-01-02 1 days 2012-01-03
2 2012-01-03 2 days 2012-01-05

Example − Subtraction Operation

Here is the example of subtracting the Timedelta values.

import pandas as pd

s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))
td = pd.Series([ pd.Timedelta(days=i) for i in range(3) ])
df = pd.DataFrame(dict(A = s, B = td))
df['C']=df['A']+df['B']
df['D']=df['C']-df['B']

print(df)

Its output is as follows −

           A      B          C          D
0 2012-01-01 0 days 2012-01-01 2012-01-01
1 2012-01-02 1 days 2012-01-03 2012-01-04
2 2012-01-03 2 days 2012-01-05 2012-01-07

Timedelta Class Properties and Methods

The Timedelta object provides various properties and methods that are useful in date-time manipulation.

Properties

Following are the list of attributes of the Timedelta object.

Sr.No. Property & Description
1

Timedelta.asm8

Return a numpy timedelta64 array scalar view.

2

Timedelta.components

Return a components namedtuple-like.

3

Timedelta.days

Returns the days of the timedelta.

4

Timedelta.max

Return the maximum timedelta object.

5

Timedelta.microseconds

Return the microseconds of the timedelta.

6

Timedelta.min

Return the minimum timedelta object.

7

Timedelta.nanoseconds

Return the number of nanoseconds (n), where 0 <= n < 1 microsecond.

8

Timedelta.resolution

Return the resolution of the timedelta.

9

Timedelta.seconds

Return the total hours, minutes, and seconds of the timedelta as seconds.

10

Timedelta.unit

Return the unit of the timedelta.

11

Timedelta.value

Return the underlying value of the timedelta in nanoseconds.

Methods

In the following table you can found the list of method of the Timedelta object.

Sr.No. Method & Description
1

Timedelta.as_unit(unit[, round_ok])

Convert the underlying int64 representation to the given unit.

2

Timedelta.ceil(freq)

Return a new Timedelta ceiled to this resolution.

3

Timedelta.floor(freq)

Return a new Timedelta floored to this resolution.

4

Timedelta.isoformat()

Format the Timedelta as ISO 8601 Duration.

5

Timedelta.round(freq)

Round the Timedelta to the specified resolution.

6

Timedelta.to_pytimedelta()

Convert a pandas Timedelta object into a python datetime.timedelta object.

7

Timedelta.to_timedelta64()

Return a numpy.timedelta64 object with 'ns' precision.

8

Timedelta.to_numpy([dtype, copy])

Convert the Timedelta to a NumPy timedelta64.

9

Timedelta.total_seconds()

Return the total seconds in the duration.

10

Timedelta.view(dtype)

Array view compatibility.

Python Pandas - Sparse Data

Sparse data structures in Pandas are used to store data in a compressed format. They are particularly useful when you have large datasets with many repeated values (such as NaN). The compression is achieved by not storing these repeated values, making the storage more efficient.

Pandas provides specialized data structures for efficiently storing sparse data. Unlike typical sparse structures that mostly store zeros, Pandas' sparse objects allow you to compress data by omitting any values matching a specific fill value (like NaN). This compression leads to significant memory savings, especially with large datasets.

In this tutorial we will learn about the Sparse objects in pandas.

Sparse Arrays and Dtypes

Pandas offers the SparseArray class for handling sparse data at the array level. You can access the dtype information, which includes both the data type of the stored elements and the fill value.

Example

Let's see an example of creating a series with sparse data structures and verifying it's datatype.

import pandas as pd
import numpy as np

# Generate random data
arr = np.random.randn(10)
arr[2:-2] = np.nan  # Introduce NaN values

# Convert to a sparse Series
sparse_series = pd.Series(pd.arrays.SparseArray(arr))

print("Output sparse Series:\n",sparse_series)
print("DataType of the Series:",sparse_series.dtype)

Its output is as follows −

Output sparse Series:
 0    0.763830
1    0.821392
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8    0.532463
9    1.169153
dtype: Sparse[float64, nan]

DataType of the Series: Sparse[float64, nan]

Notice the dtype Sparse[float64, nan]. The nan indicates that NaN values are not actually stored, only the non-NaN elements are.

Memory Efficiency with Sparse DataFrames

Sparse objects are ideal for enhancing memory efficiency when working with large datasets containing many NaN values.

Example

Let us now assume you had a DataFrame with mostly NaN values and execute the following code −

import pandas as pd
import numpy as np

# Create a DataFrame and introduce NaN values 
df = pd.DataFrame(np.random.randn(10000, 4))
df.iloc[:9998] = np.nan

# Convert to a sparse DataFrame
sparse_df = df.astype(pd.SparseDtype("float", np.nan))

# Display the first few rows and data types
print("Sparse DataFrame: \n",sparse_df.head())
print("\nDataType:\n",sparse_df.dtypes)

# Compare memory usage
print("\nMemory Comparison:")
print('Dense: {:.2f} KB'.format(df.memory_usage().sum() / 1e3))
print('Sparse: {:.2f} KB'.format(sparse_df.memory_usage().sum() / 1e3))

Its output is as follows −

Sparse DataFrame: 
     0   1   2   3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN

DataType:
 0    Sparse[float64, nan]
1    Sparse[float64, nan]
2    Sparse[float64, nan]
3    Sparse[float64, nan]
dtype: object

Memory Comparison:
Dense: 320.13 KB
Sparse: 0.22 KB

By converting the DataFrame to a sparse format, memory usage is significantly reduced.

Converting Sparse Arrays to Dense

Any sparse object can be converted back to the standard dense form by calling sparse.to_dense()

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 2), columns=['A', 'B'])
df.iloc[:5] = np.nan

# Convert to a sparse DataFrame
sparse_df = df.astype(pd.SparseDtype("float", np.nan))

# Display input the sparse object
print("sparse object:\n",sparse_df.dtypes)

result = sparse_df.sparse.to_dense()

# Output Dense
print("Output Dense:\n", result.dtypes)

Its output is as follows −

sparse object:
 A    Sparse[float64, nan]
B    Sparse[float64, nan]
dtype: object

Output Dense:
 A    float64
B    float64
dtype: object

Working with Sparse Accessor

Pandas offers a .sparse accessor to work with sparse data structures, similar to .str for string data or .dt for datetime data.

Sparse data should have the same dtype as its dense representation. Currently, float64, int64 and booldtypes are supported. Depending on the original dtype, fill_value default changes −

  • float64 − np.nan

  • int64 − 0

  • bool − False

Example

Let us execute the following code to understand the working of the sparse accessor −

import pandas as pd
import numpy as np

# Create a sparse object
sparse_series = pd.Series([0, 0, 1, 2], dtype="Sparse[int]")

# Display input of the sparse object
print("sparse object:\n",sparse_series)

# Output of working with the Sparse Accessor
print("Percent of non-fill_value points:",sparse_series.sparse.density)
print("Fill value:", sparse_series.sparse.fill_value)
print("The number of non- fill_value points:", sparse_series.sparse.npoints)
print("Non fill value:", sparse_series.sparse.sp_values)

Its output is as follows −

sparse object:
0    0
1    0
2    1
3    2
dtype: Sparse[int64, 0]

Percent of non-fill_value points: 0.5
Fill value: 0
The number of non- fill_value points: 2
Non fill value: [1 2]

Python Pandas - Visualization

Visualization of data plays an important role in data analysis, it helps you represent the data graphically for better understanding, and identifying the patterns. However, Pandas library is primarily used for data manipulation and analysis but it also provides the data visualization capabilities by using the Python's Matplotlib library support.

In Python, the Pandas library provides a basic method called .plot() for generating a wide variety of visualizations along the different specialized plotting methods. These visualizations tools are built on top of the Python's Matplotlib library, offering flexibility and customization options.

Behind the scenes, every plot generated by Pandas is actually a Matplotlib object. This integration allows users to leverage Matplotlib's extensive customization options for fine-tuning Pandas-generated plots.

In this tutorial, we will learn about basics of visualizing data using the Pandas data structures.

Setting Up the Environment for Visualization

Before learning about Pandas data Visualization, we should ensure that Matplotlib library is installed. Following is the command for installing the Matplotlib library −

pip install pandas matplotlib

Importing Libraries

Along with the import pandas as pd you need to import the Matplotlib's functional interface for displaying, customizing, and saving plots using the following command −

import matplotlib.pyplot as plt

Displaying the plots

In environments like Jupyter Notebook or IPython shell, plots are often displayed automatically as they are generated. However, in a standard Python script or shell, this does not happen automatically. To explicitly display a plot in such environments, we need to call the following command −

plt.show()

This command renders the Matplotlib figure object in a GUI window.

Pandas Basic Plotting Method

The Pandas library provides a basic plotting method called plot() on both the Series and DataFrame objects for plotting different kind plots. This method is a simple wrapper around the matplotlib plt.plot() method.

Syntax

Following is the syntax of the Pandas .plot() method −

DataFrame.plot(*args, **kwargs)

Where,

  • kind: Specifies the type of plot (default: 'line').

  • *args:

  • **kwargs:

Example

Here is the following example of plotting a random DataFrame data using the Pandas plot() method.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Creating a random DataFrame 
df = pd.DataFrame(np.random.randn(10,4),index=pd.date_range('1/1/2000',
   periods=10), columns=list('ABCD'))

# Plotting the DataFrame
df.plot()
plt.show()

Output

Its output is as follows −

Basic Plotting

If the index consists of dates, then the pandas .plot() method calls the Matplotlib gct().autofmt_xdate() to format the x-axis labels.

Also we can plot one column versus another using the x and y keywords.

Types of Plots Available in Pandas

Pandas supports various plot types through the kind parameter or specialized plotting methods. Following is the overview of the different plotting methods −

Plot Typekind ValueSpecialized MethodUse Case
Line'line'.line()Visualizing trends over time or a sequence.
Bar'bar'.bar()Comparing quantities across categories.
Horizontal Bar'barh'.barh()Same as bar charts, but horizontal.
Histogram'hist'.hist()Visualizing Distribution of numeric data.
Box Plot'box'.box()Summarizing data distribution and outliers.
Area'area'.area()Highlighting trends with cumulative data.
Scatter'scatter'.scatter()Relationship between two variables, for DataFrame only.
Hexbin'hexbin'.hexbin()Visualizing data density in two dimensions, for DataFrame only.
Density'kde' or 'density'.kde() or .density()Smoothing data distributions (Kernel Density Estimation).
Pie'pie'.pie()Proportional data in a circular graph.

Example: Plotting Bar Plot with plot() method

Let us now see what a Bar Plot is by creating one. A bar plot can be created in the following way −

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Creating a random DataFrame 
df = pd.DataFrame(np.random.rand(10,4), columns=['a','b','c','d'])

# Plotting the bar plot
df.plot(kind='bar')
plt.show()

Output

Its output is as follows −

Bar Plot

To produce a stacked bar plot, pass stacked=True

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Creating a random DataFrame 
df = pd.DataFrame(np.random.rand(10,4), columns=['a','b','c','d'])

# Plotting the stacked Bar plot
df.plot(kind='bar', stacked=True)
plt.show()

Output

Its output is as follows −

Stacked Bar Plot

To get horizontal bar plots, use the barh option −

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Creating a random DataFrame 
df = pd.DataFrame(np.random.rand(10,4), columns=['a','b','c','d'])

# Plotting the horizontal bar plot
df.plot(kind='barh', stacked=True)
plt.show()

Output

Its output is as follows −

Horizontal Bar Plot

Histograms

Histograms can be plotted using the hist option of the plot() method kind argument. We can specify number of bins.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Creating a random DataFrame 
df = pd.DataFrame({'a':np.random.randn(1000)+1,'b':np.random.randn(1000),
'c':np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])

df.plot(kind='hist', bins=20)
plt.show()

Output

Its output is as follows −

Histograms using plot.hist()

Box Plots

Box plot can be drawn calling 'box' option for both the Series and DataFrame objects to visualize the distribution of values within each column.

For instance, here is a boxplot representing five trials of 10 observations of a uniform random variable on [0,1).

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Creating a random DataFrame 
df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])

df.plot(kind='box')
plt.show()

Output

Its output is as follows −

Box Plots

Area Plot

Area plot can be created using the plot(kind='area') option.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Creating a random DataFrame 
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
df.plot(kind='area')

plt.show()

Output

Its output is as follows −

Area Plot

Scatter Plot

Scatter plot can be created using the plot(kind='scatter') option.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Creating a random DataFrame 
df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd'])
df.plot(kind='scatter', x='a', y='b')

plt.show()

Output

Its output is as follows −

Scatter Plot

Pie Chart

Pie chart can be created using the plot(kind='pie') option.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Creating a random DataFrame 
df = pd.DataFrame(3 * np.random.rand(4), index=['a', 'b', 'c', 'd'], columns=['x'])
df.plot(kind='pie', subplots=True)

plt.show()

Output

Its output is as follows −

Pie Chart

Python Pandas - Caveats & Gotchas

Caveats means warning and gotcha means an unseen problem.

Using If/Truth Statement with Pandas

Pandas follows the numpy convention of raising an error when you try to convert something to a bool. This happens in an if or when using the Boolean operations, and, or, or not. It is not clear what the result should be. Should it be True because it is not zerolength? False because there are False values? It is unclear, so instead, Pandas raises a ValueError

import pandas as pd

if pd.Series([False, True, False]):
   print 'I am True'

Output

Its output is as follows −

ValueError: The truth value of a Series is ambiguous. 
Use a.empty, a.bool() a.item(),a.any() or a.all().

In if condition, it is unclear what to do with it. The error is suggestive of whether to use a None or any of those.

import pandas as pd

if pd.Series([False, True, False]).any():
   print("I am any")

Output

Its output is as follows −

I am any

To evaluate single-element pandas objects in a Boolean context, use the method .bool()

import pandas as pd

print pd.Series([True]).bool()

Output

Its output is as follows −

True

Bitwise Boolean

Bitwise Boolean operators like == and != will return a Boolean series, which is almost always what is required anyways.

import pandas as pd

s = pd.Series(range(5))
print s==4

Output

Its output is as follows −

0 False
1 False
2 False
3 False
4 True
dtype: bool

isin Operation

This returns a Boolean series showing whether each element in the Series is exactly contained in the passed sequence of values.

import pandas as pd

s = pd.Series(list('abc'))
s = s.isin(['a', 'c', 'e'])
print s

Output

Its output is as follows −

0 True
1 False
2 True
dtype: bool

Reindexing vs ix Gotcha

Many users will find themselves using the ix indexing capabilities as a concise means of selecting data from a Pandas object −

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three',
'four'],index=list('abcdef'))

print df
print df.ix[['b', 'c', 'e']]

Output

Its output is as follows −

          one        two      three       four
a   -1.582025   1.335773   0.961417  -1.272084
b    1.461512   0.111372  -0.072225   0.553058
c   -1.240671   0.762185   1.511936  -0.630920
d   -2.380648  -0.029981   0.196489   0.531714
e    1.846746   0.148149   0.275398  -0.244559
f   -1.842662  -0.933195   2.303949   0.677641

          one        two      three       four
b    1.461512   0.111372  -0.072225   0.553058
c   -1.240671   0.762185   1.511936  -0.630920
e    1.846746   0.148149   0.275398  -0.244559

This is, of course, completely equivalent in this case to using the reindex method −

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three',
'four'],index=list('abcdef'))

print df
print df.reindex(['b', 'c', 'e'])

Output

Its output is as follows −

          one        two      three       four
a    1.639081   1.369838   0.261287  -1.662003
b   -0.173359   0.242447  -0.494384   0.346882
c   -0.106411   0.623568   0.282401  -0.916361
d   -1.078791  -0.612607  -0.897289  -1.146893
e    0.465215   1.552873  -1.841959   0.329404
f    0.966022  -0.190077   1.324247   0.678064

          one        two      three       four
b   -0.173359   0.242447  -0.494384   0.346882
c   -0.106411   0.623568   0.282401  -0.916361
e    0.465215   1.552873  -1.841959   0.329404

Some might conclude that ix and reindex are 100% equivalent based on this. This is true except in the case of integer indexing. For example, the above operation can alternatively be expressed as −

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three',
'four'],index=list('abcdef'))

print df
print df.ix[[1, 2, 4]]
print df.reindex([1, 2, 4])

Output

Its output is as follows −

          one        two      three       four
a   -1.015695  -0.553847   1.106235  -0.784460
b   -0.527398  -0.518198  -0.710546  -0.512036
c   -0.842803  -1.050374   0.787146   0.205147
d   -1.238016  -0.749554  -0.547470  -0.029045
e   -0.056788   1.063999  -0.767220   0.212476
f    1.139714   0.036159   0.201912   0.710119

          one        two      three       four
b   -0.527398  -0.518198  -0.710546  -0.512036
c   -0.842803  -1.050374   0.787146   0.205147
e   -0.056788   1.063999  -0.767220   0.212476

    one  two  three  four
1   NaN  NaN    NaN   NaN
2   NaN  NaN    NaN   NaN
4   NaN  NaN    NaN   NaN

It is important to remember that reindex is strict label indexing only. This can lead to some potentially surprising results in pathological cases where an index contains, say, both integers and strings.

Advertisements