Title of Video
Subscribe Now
Stay in the know on all things CODE. Updates are delivered to your inbox.
Speed Up pandas Data Processing with Modin*
Subscribe Now
Stay in the know on all things CODE. Updates are delivered to your inbox.
Overview
Preparing and manipulating large AI datasets can be time-consuming when using the popular pandas library. This is because pandas can only run on one CPU core at a time. Modin* is an open source, drop-in replacement for pandas that uses all your available cores to parallelize operations. Simply change your import statement and Modin automatically distributes your workloads, whether on your laptop or in the cloud.
This video walks through an example that shows how to get started with Modin. It also covers when to not use Modin,and how to apply Modin and pandas selectively to get a faster overall turnaround time.
Highlights
00:05 Because pandas is limited to using one core at a time, compute-intensive operations could get bottlenecks.
00:16 Modin is an alternative to pandas. Modin is open source and automatically parallelizes DataFrame operations across processor cores.
00:30 Learn how to install and set up Modin.
00:48 Find where to download the dataset and script from this video.
01:05 Watch a demonstration of Modin creating a bottleneck.
01:38 Watch a demonstration on how pandas manages the same process.
02:04 Get more information on when Modin is appropriate for your work.
Featured Software
Download Modin as part of AI Tools.
Additional Resources
Transcript
Are compute-intensive pandas operations causing bottlenecks in your AI data preparation and manipulation steps? It's probably because pandas is limited to using one core at a time.
Modin is a drop-in replacement for pandas that automatically parallelizes DataFrame operations across all available processor cores. It's open source and offers a choice of back-end compute engines.
To get started, just install the package with pip or conda. You can specify which back end to install or install them all. Then just change one line of code, import modin.pandas
instead of pandas, and all your existing pandas calls will use Modin's parallel processing.
The dataset and script for this example are from this article, which you can access with this QR code or the link in the description. Note that you will also need to install NLTK and Intel® Extension for scikit-learn*, which speeds up the prediction operation.
I'm running on my Intel® Core™ i7 laptop with eight physical cores with Modin running the group by call using the ray execution engine. Automatically, all the cores are being utilized and jumping ahead to the runtime results.
The group by call is about 6x faster overall, but some of the shorter operations actually slowed down and this is because there's upfront preparation in order to parallelize the operation, which outweighs the benefit of parallelization for those short operations.
If you want to optimize every step, you can take finer-grain control and mix mode and processing with pandas. Here's the code wherein mixed mode (it uses pandas by default) but then converts from pandas to Modin for the group by operation, then back to pandas for the rest, and as expected, running in mixed mode. The runtime results show that this gives you the best of both worlds.
So, for more details about how, when, and when not to use Modin, check out the link below to the full article or go to developer.intel.com/modin to learn how to get started.