Data manipulation might sound complicated, but it is not actually! If you're familiar with SQL and want to work with Pandas, there's a fantastic thing for you: pandasql. We are here to help you connect the dots between SQL and Pandas in a straightforward way.
Throughout this blog, we'll guide you through pandasql, showing you how to apply SQL-style commands to filter, merge, and manage your data in Pandas. It doesn't matter if you're a data enthusiast, analyst, or just someone keen to learn – we'll make using SQL in Pandas a breeze.
Get ready to explore the best of both worlds and supercharge your data skills!
The pandasql Python library helps you query pandas dataframes using SQL commands, without connecting to an SQL server. It operates using SQLite syntax, automatically identifying any pandas dataframe and treating it as a standard SQL table.
First, install pandasql:
pip install pandasql
Then, import the required packages:
from pandasql import sqldf
import pandas as pd
We imported the 'sqldf' function directly from 'pandasql', and it's essentially the library's most meaningful function. As the name suggests, it allows you to query dataframes using SQL syntax. Besides this function, 'pandasql' includes two straightforward built-in datasets that can be loaded using the self-explanatory functions: 'load_births()' and 'load_meat'.
pandasql Syntax
sqldf(query, env=None)
Here, the "query" parameter is important, and it takes a SQL query as a string. Additionally, there is an optional parameter "env," which is rarely used and can take on the values of either "locals()" or "globals()." This parameter allows sqldf to access the corresponding set of variables in your Python environment.
The sqldf function responds to the query as a pandas dataframe.
Now, let’s take a more detailed look at running SQL queries on pandas dataframes using the sqldf function. For practice, let's use a built-in dataset of the seaborn library—penguins:
import seaborn as sns
penguins = sns.load_dataset('penguins')
print(penguins.head())
Output:
No. | Species | Island | bill_length_mm | bill_depth_mm | flipper_length_mm |
---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 |
3 | Adelie | Torgersen | NA | NA | NA |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 |
No. | body_mass_g | sex |
0 | 3750.0 | Male |
1 | 3800.0 | Female |
2 | 3250.0 | Female |
3 | NA | NA |
4 | 3450.0 | Female |
print(sqldf('''SELECT species, island
FROM penguins
LIMIT 5'''))
Output:
No. | Species | Island |
---|---|---|
0 | Adelie | Torgersen |
1 | Adelie | Torgersen |
2 | Adelie | Torgersen |
3 | Adelie | Torgersen |
4 | Adelie | Torgersen |
Here, we have extracted information about the species and location of the first five penguins from the penguins dataframe. Note that running the sqldf function returns a pandas dataframe:
print(type(sqldf('''SELECT species, island
FROM penguins
LIMIT 5''')))
Output:
<class 'pandas.core.frame.DataFrame'>
In pure pandas, it would be:
print(penguins[['species', 'island']].head())
print(sqldf('''SELECT body_mass_g
FROM penguins
ORDER BY body_mass_g DESC
LIMIT 5''')
Output:
No. | body_mass_g |
0 | 6300.0 |
1 | 6050.0 |
2 | 6050.0 |
3 | 6000.0 |
4 | 5950.0 |
We sorted our penguins by body mass in descending order and displayed the top five values of body mass.
In pandas, it would be:
print(penguins['body_mass_g'].sort_values(ascending=False,
ignore_index=True).head())
Output:
0 | 6300.0 |
1 | 6050.0 |
2 | 6000.0 |
3 | 6000.0 |
4 | 5950.0 |
Name: body_mass_g, dtype: float64
Let’s say we have a group of penguins who are male and have flippers longer than 210 mm:
print(sqldf('''SELECT DISTINCT species
FROM penguins
WHERE sex = 'Male'
AND flipper_length_mm > 210'''))
Output:
species
0 Chinstrap
1 Gentoo
Here, we filtered the data with these two conditions: sex = 'Male' and flipper_length_mm > 210.
The same code in pandas would look a bit more overwhelming:
print(penguins[(penguins['sex'] == 'Male') & (penguins['flipper_length_mm'] > 210)]['species'].unique())
Output:
['Chinstrap' 'Gentoo']
Now, let's try data grouping and aggregation to find each species’ longest bill in the dataframe:
print(sqldf('''SELECT species, MAX(bill_length_mm)
FROM penguins
GROUP BY species'''))
Output:
No. | species | MAX(bill_length_mm) |
0 | Adelie | 46.0 |
1 | Chinstrap | 58.0 |
2 | Gentoo | 59.6 |
Same code in pandas:
print(penguins[['species', 'bill_length_mm']].groupby('species', as_index=False).max())
Output:
No. | species | bill_length_mm |
0 | Adelie | 46.0 |
1 | Chinstrap | 58.0 |
2 | Gentoo | 59.6 |
Using SQL within Pandas via the pandasql library is an efficient way to handle data manipulation. The seamless integration of SQL queries with Pandas dataframes empowers users to employ their SQL expertise within the Python environment.
This approach simplifies complex data operations, creating succinct, readable code that enhances data analysis workflows. With the power of SQL and pandasql, you can perform diverse operations with ease, from filtering and aggregating data to handling complex joins.
Ultimately, SQL and Pandas improve the analytical capabilities, creating a smoother and more powerful data processing experience.
You might also like:
Read Also: How To Solve Route Not Found Exception Laravel 10
Read Also: How To Get Current Month Records In MySQL Query
Read Also: How to Validate Input Based on Condition in Laravel 10
Read Also: How To Force Redirect HTTP To HTTPS In Laravel