Transform Pandas DataFrame with SQL
Pandas syntax can be verbose. Use SQL to transform Pandas DataFrame to write concise code.
David @ WiseData · June 29, 2023
Introduction
As data analysts and scientists, we often use Pandas for data transformation. However, sometimes we might spend a lot of time searching for correct syntax to transform Pandas data. Coming from SQL background, I’ve always thought that SQL syntax is simple most of the times. Now, that's possible with WiseData.
In this blog post, we will explore how to use WiseData, a Python library, to transform Pandas data using SQL.
You can download the Jupyter notebook from HERE.
Usage Instructions
1. Obtain an API Key
To use WiseData, you need to obtain an API Key. Simply visit https://www.wisedata.app/, fill out your email address. And the API Key used for Python package will be delivered to your email.
2. Installation
To install the library, use the following pip command:
pip install wisedata pandas numpy
3. Instantiation
Instantiate the WiseData class with your API key:
from wisedata import WiseData
# TODO: Copy your API key which you've received in your email here
wd = WiseData(api_key="YOUR_API_KEY")
4. Transform Data with SQL
Let's say we have a Pandas DataFrame with the following data:
country | gdp | happiness_index |
---|---|---|
United States | 19294482071552 | 6.94 |
United Kingdom | 2891615567872 | 7.16 |
France | 2411255037952 | 6.66 |
Germany | 3435817336832 | 7.07 |
Italy | 1745433788416 | 6.38 |
Spain | 1181205135360 | 6.4 |
Canada | 1607402389504 | 7.23 |
Australia | 1490967855104 | 7.22 |
Japan | 4380756541440 | 5.87 |
China | 14631844184064 | 5.12 |
We want to count number of countries. We can use WiseData to apply an SQL query to achieve this transformation.
import pandas as pd
countries = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
"happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})
df = wd.sql("SELECT COUNT(country) FROM countries", {
"countries": countries
})
print(df)
The resulting DataFrame will look like this:
count | |
---|---|
0 | 10 |
5. Joining Data with SQL
country | population |
---|---|
United States | 1 |
United Kingdom | 2 |
France | 3 |
Germany | 4 |
Italy | 5 |
Spain | 6 |
Canada | 7 |
Australia | 8 |
Japan | 9 |
China | 10 |
Let’s join 2 datasets together:
country_populations = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"population": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
})
df_joined = wd.sql("SELECT * FROM countries LEFT JOIN country_populations ON countries.country = country_populations.country", {
"countries": countries,
"country_populations": country_populations
})
print(df_joined)
Now you will get a DataFrame with joined data!
6. Putting everything together
Let’s put everything together into code:
from wisedata import WiseData
# TODO: Copy your API key which you've received in your email here
wd = WiseData(api_key="YOUR_API_KEY")
import pandas as pd
countries = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
"happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})
df = wd.sql("SELECT COUNT(country) FROM countries", {
"countries": countries
})
print(df)
country_populations = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"population": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
})
df_joined = wd.sql("SELECT * FROM countries LEFT JOIN country_populations ON countries.country = country_populations.country", {
"countries": countries,
"country_populations": country_populations
})
print(df_joined)
Benefits of Using WiseData
Using WiseData to transform Pandas DataFrame with SQL has several benefits. First, it allows us to leverage our SQL skills to manipulate data in Pandas. SQL is a powerful language for data manipulation, and being able to use it with Pandas can help us achieve complex transformations more easily. Second, it allows us to write more concise and readable code. SQL queries are often shorter and more expressive than their equivalent Pandas code. Lastly, we can now write data transformation in Pandas syntax as well as in SQL syntax.
Conclusion
In this blog post, we learned how to use WiseData to transform Pandas data using SQL, and we discussed the benefits of using SQL for data manipulation. By using WiseData, we can leverage our SQL skills to achieve more complex transformations, and write more concise and readable code.
If you're looking for a way to streamline your data analysis workflow, give WiseData a try @ https://www.wisedata.app!