However, while my fellow Redditors enthusiastically supported using Python, they advised looking into libraries outside of Pandas — citing concerns about Pandas performance with large datasets.
After doing some research, I found a ton of Python libraries built for data transformation: some improve Pandas performance, while others offer their own solutions.
I couldn’t find a comprehensive list of these tools, so I thought I’d compile one using the research I did — if I missed something or got something wrong, please let me know!
Website: https://pandas.pydata.org/
Overview
Pandas certainly doesn’t need an introduction, but I’ll give it one anyway.
Pandas adds the concept of a DataFrame into Python, and is widely used in the data science community for analyzing and cleaning datasets. It is extremely useful as an ETL transformation tool because it makes manipulating data very easy and intuitive.
Pros
Drawbacks
Further Reading
Website: https://dask.org/
Overview
According to their website, “Dask is a flexible library for parallel computing in Python.”
Essentially, Dask extends common interfaces such as Pandas for use in distributed environments — for instance, the Dask DataFrame mimics Pandas.
Pros
Drawbacks
Further Reading
Website: https://github.com/modin-project/modin
Overview
Modin is similar to Dask in that it tries to increase the efficiency of Pandas by using parallelism and enabling distributed DataFrames. Unlike Dask, Modin is based on Ray, a task-parallel execution framework.
The main upside to Modin over Dask is that Modin automatically handles distributing your data across your machine’s cores (no configuration necessary).
Pros
Drawbacks
Further Reading
Website: https://petl.readthedocs.io/en/stable/
Overview
petl includes many of the features pandas has, but is designed more specifically for ETL thus lacking extra features such as those for analysis. petl has tools for all three parts of ETL, but this post focuses solely on transforming data.
Although petl offers the ability to transform tables, other tools such as pandas seem to be more widely used for transformation and well-documented, making petl less appealing for this purpose.
Pros
Drawbacks
Further Reading
Website: http://spark.apache.org/
Overview
Spark is designed for processing and analyzing big data, and offers APIs in numerous languages. The primary advantage of using Spark is that Spark DataFrames use distributed memory and make use of lazy execution, so they can process much larger datasets using a cluster — which isn’t possible with tools like Pandas.
Spark is a good choice for ETL if the data you’re working with is very large, and speed and size in your data operations.
Pros
Drawbacks
Further Reading
Although I wanted this to be a comprehensive list, I didn’t want this post to become too long!
There really are many, many Python tools for data transformation, so I have included this section to at least mention other projects I missed (I might explore these further in a second part to this post).
I hope this list helped you at least get an idea of what tools Python has to offer for data transformation. After doing this research I am confident that Python is a great choice for ETL — these tools and their developers have made it an amazing platform to use.
As I said at the beginning of this post, I’m not an expert in this field — please feel free to comment if you have something to add!
☞ Scraping with Scrapy and Django Integration
☞ 7 lines of Python code to show your webcam in a GUI window using OpenCV
☞ Learn Machine Learning with Python for Absolute Beginners
☞ An In-Depth Comparison – Flask vs Django
☞ Python Stddev() Example | Standard Deviation In Python Tutorial
☞ Python Tutorials for Beginners - Learn Python Online
☞ Python Tutorial for Data Science
☞ Learn Python in 12 Hours | Python Tutorial For Beginners
☞ Complete Python Tutorial for Beginners (2019)
☞ Python Programming Tutorial | Full Python Course for Beginners 2019