How to Use SQL in Pandas using pandasql Queries?

Data manipulation might sound complicated, but it is not actually! If you're familiar with SQL and want to work with Pandas, there's a fantastic thing for you: pandasql. We are here to help you connect the dots between SQL and Pandas in a straightforward way.

Throughout this blog, we'll guide you through pandasql, showing you how to apply SQL-style commands to filter, merge, and manage your data in Pandas. It doesn't matter if you're a data enthusiast, analyst, or just someone keen to learn – we'll make using SQL in Pandas a breeze.

Get ready to explore the best of both worlds and supercharge your data skills!

How to Use pandasql?

The pandasql Python library helps you query pandas dataframes using SQL commands, without connecting to an SQL server. It operates using SQLite syntax, automatically identifying any pandas dataframe and treating it as a standard SQL table.


Set Up the Environment

First, install pandasql:

pip install pandasql

Then, import the required packages:

from pandasql import sqldf
import pandas as pd

We imported the 'sqldf' function directly from 'pandasql', and it's essentially the library's most meaningful function. As the name suggests, it allows you to query dataframes using SQL syntax. Besides this function, 'pandasql' includes two straightforward built-in datasets that can be loaded using the self-explanatory functions: 'load_births()' and 'load_meat'.


pandasql Syntax

sqldf(query, env=None)

Here, the "query" parameter is important, and it takes a SQL query as a string. Additionally, there is an optional parameter "env," which is rarely used and can take on the values of either "locals()" or "globals()." This parameter allows sqldf to access the corresponding set of variables in your Python environment.

The sqldf function responds to the query as a pandas dataframe.

Examples of Using pandasql

Now, let’s take a more detailed look at running SQL queries on pandas dataframes using the sqldf function. For practice, let's use a built-in dataset of the seaborn library—penguins:

import seaborn as sns
penguins = sns.load_dataset('penguins')


No. Species Island bill_length_mm bill_depth_mm flipper_length_mm
0 Adelie Torgersen 39.1 18.7 181.0
1 Adelie Torgersen 39.5 17.4 186.0
2 Adelie Torgersen 40.3 18.0 195.0
3 Adelie Torgersen NA NA NA
4 Adelie Torgersen 36.7 19.3 193.0


No. body_mass_g sex
0 3750.0 Male
1 3800.0 Female
2 3250.0 Female
4 3450.0 Female


Extracting Data With pandasql
print(sqldf('''SELECT species, island 
FROM penguins 
LIMIT 5'''))


No. Species Island
0 Adelie Torgersen
1 Adelie Torgersen
2 Adelie Torgersen
3 Adelie Torgersen
4 Adelie Torgersen


Here, we have extracted information about the species and location of the first five penguins from the penguins dataframe. Note that running the sqldf function returns a pandas dataframe:

print(type(sqldf('''SELECT species, island 
                      FROM penguins 
                     LIMIT 5''')))


<class 'pandas.core.frame.DataFrame'>

In pure pandas, it would be:

print(penguins[['species', 'island']].head())


Sorting Data With pandasql
print(sqldf('''SELECT body_mass_g 
                 FROM penguins 
                ORDER BY body_mass_g DESC 
                LIMIT 5''')


No. body_mass_g
0 6300.0
1 6050.0
2 6050.0
3 6000.0
4 5950.0

We sorted our penguins by body mass in descending order and displayed the top five values of body mass.

In pandas, it would be:



0 6300.0
1 6050.0
2 6000.0
3 6000.0
4 5950.0

Name: body_mass_g, dtype: float64


Filtering Data With pandasql

Let’s say we have a group of penguins who are male and have flippers longer than 210 mm:

print(sqldf('''SELECT DISTINCT species
                 FROM penguins 
                WHERE sex = 'Male' 
                  AND flipper_length_mm > 210'''))

0  Chinstrap
1     Gentoo

Here, we filtered the data with these two conditions: sex = 'Male' and flipper_length_mm > 210.

The same code in pandas would look a bit more overwhelming:

print(penguins[(penguins['sex'] == 'Male') & (penguins['flipper_length_mm'] > 210)]['species'].unique())


['Chinstrap' 'Gentoo']

Grouping and Aggregating Data With pandasql

Now, let's try data grouping and aggregation to find each species’ longest bill in the dataframe:

print(sqldf('''SELECT species, MAX(bill_length_mm)
                 FROM penguins 
                GROUP BY species'''))


No. species MAX(bill_length_mm)
0 Adelie 46.0
1 Chinstrap 58.0
2 Gentoo 59.6

Same code in pandas:

print(penguins[['species', 'bill_length_mm']].groupby('species', as_index=False).max())


No. species bill_length_mm
0 Adelie 46.0
1 Chinstrap 58.0
2 Gentoo 59.6



Using SQL within Pandas via the pandasql library is an efficient way to handle data manipulation. The seamless integration of SQL queries with Pandas dataframes empowers users to employ their SQL expertise within the Python environment.

This approach simplifies complex data operations, creating succinct, readable code that enhances data analysis workflows. With the power of SQL and pandasql, you can perform diverse operations with ease, from filtering and aggregating data to handling complex joins.

Ultimately, SQL and Pandas improve the analytical capabilities, creating a smoother and more powerful data processing experience.


You might also like:

Read Also: How To Solve Route Not Found Exception Laravel 10

Read Also: How To Get Current Month Records In MySQL Query

Read Also: How to Validate Input Based on Condition in Laravel 10

Read Also: How To Force Redirect HTTP To HTTPS In Laravel


Techsolutionstuff | The Complete Guide

I'm a software engineer and the founder of Hailing from India, I craft articles, tutorials, tricks, and tips to aid developers. Explore Laravel, PHP, MySQL, jQuery, Bootstrap, Node.js, Vue.js, and AngularJS in our tech stack.