Python Data Transformation Tools for ETL

Python Data Transformation Tools for ETL.The other day, I went on Reddit to ask if I should use Python for ETL related transformations, and the overwhelming response was yes

However, while my fellow Redditors enthusiastically supported using Python, they advised looking into libraries outside of Pandas — citing concerns about Pandas performance with large datasets.

After doing some research, I found a ton of Python libraries built for data transformation: some improve Pandas performance, while others offer their own solutions.

I couldn’t find a comprehensive list of these tools, so I thought I’d compile one using the research I did — if I missed something or got something wrong, please let me know!

Pandas

Website: https://pandas.pydata.org/

Overview

Pandas certainly doesn’t need an introduction, but I’ll give it one anyway.

Pandas adds the concept of a DataFrame into Python, and is widely used in the data science community for analyzing and cleaning datasets. It is extremely useful as an ETL transformation tool because it makes manipulating data very easy and intuitive.

Pros

Widely used for data manipulation
Simple, intuitive syntax
Integrated well with other Python tools including visualization libraries
Support for common data formats (read from SQL databases, CSV files, etc.)

Drawbacks

Since it loads all data into memory, it isn’t scalable and can be a bad choice for very large (larger than memory) datasets

Further Reading

Dask

Website: https://dask.org/

Overview

According to their website, “Dask is a flexible library for parallel computing in Python.”

Essentially, Dask extends common interfaces such as Pandas for use in distributed environments — for instance, the Dask DataFrame mimics Pandas.

Pros

Scalability — Dask can run on your local machine and scale up to a cluster
Ability to work with datasets that don’t fit in memory
Increased performance with the same functionality, even on the same hardware (thanks to parallel computing)
Minimal code changes to switch from Pandas
Designed to integrate with other Python libraries

Drawbacks

There are other ways to improve the performance of Pandas (often more significantly) than parallelism
There is little benefit if the computations you’re doing are small
Some functions aren’t implemented in the Dask DataFrame

Further Reading

modin

Website: https://github.com/modin-project/modin

Overview

Modin is similar to Dask in that it tries to increase the efficiency of Pandas by using parallelism and enabling distributed DataFrames. Unlike Dask, Modin is based on Ray, a task-parallel execution framework.

The main upside to Modin over Dask is that Modin automatically handles distributing your data across your machine’s cores (no configuration necessary).

Pros

Scalability — this is provided more so by Ray than Modin
Increased performance with exact same functionality, even on the same hardware
Minimal code changes to switch from Pandas (changing the import statement)
Provides all the Pandas functionality — more of a “drop-in” solution than Dask is

Drawbacks

There are other ways to improve the performance of Pandas (often more significantly) than parallelism
There is little benefit if the computations you’re doing are small

Further Reading

petl

Website: https://petl.readthedocs.io/en/stable/

Overview

petl includes many of the features pandas has, but is designed more specifically for ETL thus lacking extra features such as those for analysis. petl has tools for all three parts of ETL, but this post focuses solely on transforming data.

Although petl offers the ability to transform tables, other tools such as pandas seem to be more widely used for transformation and well-documented, making petl less appealing for this purpose.

Pros

Minimize use of system memory, enabling it to scale to millions of rows
Useful for migrating between SQL databases
Lightweight and efficient

Drawbacks

By minimizing use of system memory, petl executes slower — it is not recommended for applications where performance is important
Less used than the other solutions on this list for data manipulation

Further Reading

PySpark

Website: http://spark.apache.org/

Overview

Spark is designed for processing and analyzing big data, and offers APIs in numerous languages. The primary advantage of using Spark is that Spark DataFrames use distributed memory and make use of lazy execution, so they can process much larger datasets using a cluster — which isn’t possible with tools like Pandas.

Spark is a good choice for ETL if the data you’re working with is very large, and speed and size in your data operations.

Pros

Scalability and support for larger datasets
Spark DataFrames are very similar to those of Pandas in terms of syntax
Querying using SQL syntax via Spark SQL
Compatible with other popular ETL tools, including Pandas (you can actually convert a Spark DataFrame to a Pandas DataFrame, enabling you to work with all sorts of other libraries)
Compatible with Jupyter Notebooks
Built-in support for SQL, streaming, and graph processing

Drawbacks

Requires a distributed file system such as S3
Using data formats like CSVs limits lazy execution, requiring transforming the data to other formats like Parquet
Lack of direct support for data visualization tools like Matplotlib and Seaborn, both of which are well-supported by Pandas

Further Reading

Apache Spark in Python: A Beginner’s Guide
A Brief Introduction to PySpark
PySpark Documentation (especially to see the syntax)

Notable Mentions

Although I wanted this to be a comprehensive list, I didn’t want this post to become too long!

There really are many, many Python tools for data transformation, so I have included this section to at least mention other projects I missed (I might explore these further in a second part to this post).

Bonobo https://www.bonobo-project.org/
bubbles http://bubbles.databrewery.org/
pygrametl http://chrthomsen.github.io/pygrametl/

Conclusion

I hope this list helped you at least get an idea of what tools Python has to offer for data transformation. After doing this research I am confident that Python is a great choice for ETL — these tools and their developers have made it an amazing platform to use.

As I said at the beginning of this post, I’m not an expert in this field — please feel free to comment if you have something to add!

Python Data Transformation Tools for ETL

Pandas

Dask

modin

petl

PySpark

Notable Mentions

Conclusion

Recommended Reading

Suggest: