Stronger together: Python, dataframes, and SQL
For years working in data and analytics engineering roles, I treasured the daily camaraderie sharing a small office space with talented folks using a range of tools - from analysts using SQL and Excel to data scientists working in Python. I always sensed that there was so much we could work on in collaboration with each other - but siloed data and tooling made this much more difficult. The diversity of our tools and languages made the potential for collaboration all the more interesting, since we could have folks with different areas of expertise each bringing their unique spin to the project. But logistically, it just couldn’t be done in a scalable way.
So I couldn’t be more excited about dbt’s polyglot capabilities arriving in dbt Core 1.3. This release brings Python dataframe libraries that are crucial to data scientists and enables general-purpose Python but still uses a shared database for reading and writing data sets. Analytics engineers and data scientists are stronger together, and I can’t wait to work side-by-side in the same repo with all my data scientist friends.
Going polyglot is a major next step in the journey of dbt Core. While it expands possibilities, we also recognize the potential for confusion. When combined in an intentional manner, SQL, dataframes, and Python are also stronger together. Polyglot dbt allows informed practitioners to choose the language that best fits your use case.
In this post, we’ll give you your hands-on experience and seed your imagination with potential applications. We’ll walk you through a demo that showcases string parsing - one simple way that Python can be folded into a dbt project.
We’ll also give you the intellectual resources to compare/contrast:
- different dataframe implementations within different data platforms
- dataframes vs. SQL
Finally, we’ll share “gotchas” and best practices we’ve learned so far and invite you to participate in discovering the answers to outstanding questions we are still curious about ourselves.
Based on our early experiences, we recommend that you:
✅ Do: Use Python when it is better suited for the job – model training, using predictive models, matrix operations, exploratory data analysis (EDA), Python packages that can assist with complex transformations, and select other cases where Python is a more natural fit for the problem you are trying to solve.
❌ Don’t: Use Python where the solution in SQL is just as direct. Although a pure Python dbt project is possible, we’d expect the most impactful projects to be a mixture of SQL and Python.
Polyglot dbt: An alloy of Python, dataframes, and SQL
dbt Core 1.3 gives you the ability to use Python models to materialize dataframes as tables within your dbt DAG. In combination, Python and SQL models form an alloy within a dbt project, yielding net new properties not found in isolation.
This is a simultaneous two-part unlock. One, we get a general purpose programming language in Python. Two, we get access to special-purpose dataframe libraries for reading from / writing to the database.
Some pretty cool functionality follows:
- Vast ecosystem of Python packages: The rich Python ecosystem of packages is the heart of varied use cases like machine learning, exploratory data analysis (EDA), data set generation, and many, many more.
- Dataframe syntax for data set manipulation: There’s a vibrant community of Python-first developers that can more easily contribute to dbt projects when they can use dataframe syntax. This is especially useful in conjunction with data science use-cases.
- Python workflows where your data lives: Most Python work being done today is done outside of the data platform that stores the source data, meaning developers have to transfer the data into the Python environment first, which adds significant friction to the development workflow. In the case of Snowpark DataFrames, data movement is reduced by pushing the computation down to the data platform.
Trade-offs between SQL and dataframe syntax
Once you’re set up - the next question is: what should you use Python for? How should you think about tradeoffs between SQL vs. dataframes? I haven’t personally pondered this deeply… but the folks at Ponder have 😉 They published a series of posts comparing the two in terms of:
- convenience - dataframes go well with data science libraries like Scikit-learn
- conciseness - dataframes have concise syntax for normalization, one-hot encoding, rolling averages, and other uses
- flexibility - rows and columns in dataframes can be transposed and functions can be applied along columns or rows
Gaining your own sense of these differences will empower you to create your own alloy mix of polyglot dbt models.
Comparing dataframe libraries
Before we get into our hands-on example, let’s take a look at the nuts and bolts of getting your project working with different dataframe types. Multiple data platforms and dataframe libraries are supported in dbt Core as of version 1.3, but not uniformly (see compatibility table below). See here for platform-specific setup instructions.
Type of data frame | Snowflake | Databricks | BigQuery |
---|---|---|---|
Snowpark DataFrames | ✅ | ❌ | ❌ |
pandas DataFrames | ✅ | ✅ | ✅ |
Spark DataFrames | ❌ | ✅ | ✅ |
pandas-on-Spark DataFrames | ❌ | ✅ | ✅ |
Koalas DataFrames | ❌ | ✅ | ✅ |
More detailed comparisons and trade-offs
First hands-on experience using a Python package
Now that we have sufficient background covered, let’s dive into a real-world use-case.
The full source code for this demo is available at:
This example is using DuckDB as the database backend. You can use this same code by copying it in your project that is using your database adapter of choice.
Our example will give you hands-on experience with three things:
- read data from a table into a dataframe
- parse uniform datetimes out of a variety of string formats using the
dateutil
library - writes the result back into a table
Date/time parsing is a common use-case because dates and times are ubiquitous in transactional source data. Often, the source format is unknown, and it may even be a mixture of multiple formats. The dateutil
package has a method that will do a best-effort attempt at parsing a string into a Python datetime object, and it will raise an exception when it is unable to parse the input. (Users of pandas may already be familiar with the pandas.to_datetime
method, which uses dateutil
). The following demo uses dateutil
to parse source data with unknown date/time formats.
In this example we’ll:
- Install the requirements in a virtual environment
- Build the dbt project
- Examine the output
Components of the dbt project
This dbt project has only two main pieces (besides our seed data to mimic source data):
- Transformation logic within the dbt Python model
- Configuration of data testing for quality assurance
The dbt Python model
import dateutil
def try_dateutil_parse(x):
try:
return dateutil.parser.parse(x)
except:
return
def model(dbt, session):
df = dbt.ref("source_data")
df['parsed_transaction_time'] = df['transaction_time'].apply(try_dateutil_parse)
return df
This model tries to parse the raw string value into a Python datetime. When not successful, it yields a None
value rather than raising an error. The dateutil
can handle a wider range of formats than most data platforms’ native functionality.
Testing the result
During the build process, dbt will check if any of the values are null. This is using the built-in not_null
test, which will generate and execute SQL in the data platform.
Our initial recommendation for testing Python models is to use generic and singular tests.
version: 2
models:
- name: transactions
columns:
- name: parsed_transaction_time
tests:
- not_null
Download the repository and install requirements
The full source code for this demo is available at:
The only prerequisites for this demo are python3
and git
. You can verify both are installed and you’re on the right version via these commands in your terminal:
python3 --version
git --version
Assuming both are availble in your system, then you can clone the example repository using your method of choice:
HTTPS
git clone https://github.com/dbt-labs/demo-python-blog.git
cd demo-python-blog
SSH
GitHub CLI
Then you'll create a virtual environment and install all the dependencies. Choose your shell / operating system from the list below and run the commands (defaulting to zsh
/bash
):
POSIX bash/zsh
python3 -m venv env
source env/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt
source env/bin/activate
POSIX fish
POSIX csh/tcsh
POSIX PowerShell Core
Windows cmd.exe
Windows PowerShell
Build it
Once the dependencies are all installed, we can build the project:
dbt build
Query the result of the dbt transformation
Congrats on successfully running your first dbt Python model! Let’s confirm the output visually by running the following query:
duckcli demo.duckdb --table --execute "select id, transaction_time, parsed_transaction_time from parse_datetimes order by id"
As you can see, each of the various input formats were successfully parsed into a uniform and standardized format.
id | transaction_time | parsed_transaction_time |
---|---|---|
1 | Fri, 16 Dec 2022 02:59:36 +0000 | 2022-12-16 02:59:36 |
2 | Sun, 25 Dec 22 02:59:36 +0000 | 2022-12-25 02:59:36 |
3 | Thursday, 31-Mar-22 02:59:36 UTC | 2022-03-31 02:59:36 |
4 | Thu, 02 Jun 22 02:59:36 +0000 | 2022-06-02 02:59:36 |
5 | Thu, 17 Feb 2022 02:59:36 +0000 | 2022-02-17 02:59:36 |
6 | 2022-03-28 02:59:36+00:00 | 2022-03-28 02:59:36 |
7 | 2022-10-22 02:59:36+00:00 | 2022-10-22 02:59:36 |
8 | 2022-10-02 02:59:36+00:00 | 2022-10-02 02:59:36 |
9 | Monday, 03-Jan-2022 02:59:36 UTC | 2022-01-03 02:59:36 |
10 | Thu, 25 Aug 2022 02:59:36 +0000 | 2022-08-25 02:59:36 |
11 | 2022-05-29T02:59:36+00:00 | 2022-05-29 02:59:36 |
12 | 2022-08-02 02:59:36+00:00 | 2022-08-02 02:59:36 |
13 | 2022-10-18 02:59:36+00:00 | 2022-10-18 02:59:36 |
⚠️ Remember to deactivate the environment as a final step:
deactivate
And there you go! A real live python example for a common analytics use case! Now let’s think about what we learned.
Best practice recommendations
Based on our early experiences using dbt Python models and our ongoing conversations with members of the dbt Community, here are some recommended “do’s and don'ts” we think will help set you up for success.
✅ Do:
- Use Python when it is better suited for the job – model training, using predictive models, matrix operations, exploratory data analysis (EDA), and Python packages that can assist with complex transformations.
- Use the native the dataframe type and syntax for your data platform. Use a notebook environment (and a small sample of your data set) for initial development before copying it into dbt.
❌ Don’t:
- Use Python where the solution in SQL is just as direct. Although a pure Python dbt project is possible, we’d expect the most impactful projects to be a mixture of SQL and Python.
- Perform web scraping or download data from the web.
- Use pandas unless absolutely necessary. Although pandas can be useful in the prototyping stage, scaling to larger data sets often requires a platform-native type like Snowpark, Spark, or pandas-on-Spark.
- Translate your entire project to be Python-only. Although possible, if models are already written in SQL and there’s not a specific benefit to it being in Python, just leave it as SQL.
🤷 We don’t know yet!
There are several outstanding questions where you are invited to contribute to the discussion:
- Reusable Python logic
- Debugging output
- Testing Python models
- Abstractions over different dataframe APIs
Conclusion
Python is particularly well suited for many use-cases within a dbt project, including:
- Data science model training or in-line deployment
- Utilizing packages within the Python ecosystem like ftfy, dateutil, etc.
- Exploratory data analysis (EDA) using functions and packages such as
pandas.describe
, Pandas profiling, Great Expectations, etc. - Generating synthetic data sets using Faker
- Linear programming libraries like CVXOPT, PULP, CVXPY, ECOS, Google OR-Tools, SciPy, etc.
- Simulation using SimPy, Monte Carlo simulation, what-if analysis
- More yet to come!
There is no one programming language to rule them all, so there’s no programming language hill that we are going to die on! We’ll make sure that dbt supports a wide variety of languages and let you make well-reasoned decisions for each individual use case.
We are excited to hear what you discover and build! Please share with us in the #dbt-core-python-models or #i-made-this channels in Slack.
Comments