3 min read

Transform Pandas DataFrame with SQL


Pandas syntax can be verbose. Use SQL to transform Pandas DataFrame to write concise code.


David @ WiseData · June 29, 2023


Introduction

As data analysts and scientists, we often use Pandas for data transformation. However, sometimes we might spend a lot of time searching for correct syntax to transform Pandas data. Coming from SQL background, I’ve always thought that SQL syntax is simple most of the times. Now, that's possible with WiseData.

In this blog post, we will explore how to use WiseData, a Python library, to transform Pandas data using SQL.

You can download the Jupyter notebook from HERE.

Usage Instructions

1. Obtain an API Key

To use WiseData, you need to obtain an API Key. Simply visit https://www.wisedata.app/, fill out your email address. And the API Key used for Python package will be delivered to your email.

2. Installation

To install the library, use the following pip command:

pip install wisedata pandas numpy

3. Instantiation

Instantiate the WiseData class with your API key:

from wisedata import WiseData

# TODO: Copy your API key which you've received in your email here
wd = WiseData(api_key="YOUR_API_KEY")

4. Transform Data with SQL

Let's say we have a Pandas DataFrame with the following data:

country gdp happiness_index
United States 19294482071552 6.94
United Kingdom 2891615567872 7.16
France 2411255037952 6.66
Germany 3435817336832 7.07
Italy 1745433788416 6.38
Spain 1181205135360 6.4
Canada 1607402389504 7.23
Australia 1490967855104 7.22
Japan 4380756541440 5.87
China 14631844184064 5.12

We want to count number of countries. We can use WiseData to apply an SQL query to achieve this transformation.

import pandas as pd

countries = pd.DataFrame({
    "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
    "gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
    "happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})

df = wd.sql("SELECT COUNT(country) FROM countries", {
  "countries": countries
})
print(df)

The resulting DataFrame will look like this:

count
0 10

5. Joining Data with SQL

country population
United States 1
United Kingdom 2
France 3
Germany 4
Italy 5
Spain 6
Canada 7
Australia 8
Japan 9
China 10

Let’s join 2 datasets together:

country_populations = pd.DataFrame({
    "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
    "population": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
})

df_joined = wd.sql("SELECT * FROM countries LEFT JOIN country_populations ON countries.country = country_populations.country", {
  "countries": countries,
  "country_populations": country_populations
})
print(df_joined)

Now you will get a DataFrame with joined data!

6. Putting everything together

Let’s put everything together into code:

from wisedata import WiseData

# TODO: Copy your API key which you've received in your email here
wd = WiseData(api_key="YOUR_API_KEY")

import pandas as pd

countries = pd.DataFrame({
    "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
    "gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
    "happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})

df = wd.sql("SELECT COUNT(country) FROM countries", {
  "countries": countries
})
print(df)

country_populations = pd.DataFrame({
    "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
    "population": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
})

df_joined = wd.sql("SELECT * FROM countries LEFT JOIN country_populations ON countries.country = country_populations.country", {
  "countries": countries,
  "country_populations": country_populations
})
print(df_joined)

Benefits of Using WiseData

Using WiseData to transform Pandas DataFrame with SQL has several benefits. First, it allows us to leverage our SQL skills to manipulate data in Pandas. SQL is a powerful language for data manipulation, and being able to use it with Pandas can help us achieve complex transformations more easily. Second, it allows us to write more concise and readable code. SQL queries are often shorter and more expressive than their equivalent Pandas code. Lastly, we can now write data transformation in Pandas syntax as well as in SQL syntax.

Conclusion

In this blog post, we learned how to use WiseData to transform Pandas data using SQL, and we discussed the benefits of using SQL for data manipulation. By using WiseData, we can leverage our SQL skills to achieve more complex transformations, and write more concise and readable code.

If you're looking for a way to streamline your data analysis workflow, give WiseData a try @ https://www.wisedata.app!