How to process large dataset using Polars GPU engine?
Everyday incomprehensible amount of data is generated, for businesses maintaining and processing this data becomes a challenge. As the volume of the data increases data processing speed decreases which becomes a challenge for data analysts and data scientist.
Pandas library is the default library for data processing and analysis tasks but it has some limitations in handling large datasets but do not worry we have the Polars library which is aptly suited for handling complex and large datasets.
Polars library supports GPUs hence making it a suitable choice for handling massive datasets.
In this guide, we will learn about why to use Polars, how to set up the Polars GPU engine, how to run SQL functions and how to do visualization with the Polars library.
Why use Polars?
Polars is a fast DataFrame library powered by OLAP Query Engine designed for efficient data handling on a single machine. It operates on a query engine that can use Nvidia GPUs for higher performance through its GPU engine (powered by RAPIDS cuDF).
Designed to make processing 10–100+ GBs of data feel interactive with just a single GPU, this new engine is built directly into the Polars Lazy API — pass engine="gpu" to the collect
operation.
Setting up Polars GPU engine
To get started, you need the Polars version 1.5 installed on your computer.
To use the built-in data visualization capabilities of Polars, you’ll need to install a few additional dependencies. We’ll also install pynvml to help us determine which dataset size to use.
Data loading and Testing with CPU vs GPU
Loading data: We are using a 22GB Kaggle dataset, to increase the speed of download we will download a copy of this dataset from a GCS bucket hosted by NVIDIA. This should take about 30 seconds.
import pynvml
pynvml.nvmlInit()
pynvml.nvmlDeviceGetName(pynvml.nvmlDeviceGetHandleByIndex(0))
mem = pynvml.nvmlDeviceGetMemoryInfo(pynvml.nvmlDeviceGetHandleByIndex(0))
mem = mem.total/1e9
if mem < 24:
!wget https://storage.googleapis.com/rapidsai/polars-demo/transactions-t4-20.parquet -O transactions.parquet
else:
!wget https://storage.googleapis.com/rapidsai/polars-demo/transactions.parquet -O transactions.parquet
!wget https://storage.googleapis.com/rapidsai/polars-demo/rainfall_data_2010_2020.csv
Now to read the parquet we will need to import the libraries and look at the schema of the dataset.
Reducing the time of data processing
Polars can switch between the CPU and GPU engine, so if you have a small query you can use the CPU and for a complex query GPU engine can be utilized. We can observe the difference between the time taken by the CPU and the GPU engine.
As we can see with the CPU the Wall time taken is 7.22 seconds whereas with the GPU the process got accelerated and we got a result in only 497 milliseconds i.e., about 93% reduced processing time.
Advanced Use — SQL Queries and Multiple Datasets
Polars also supports SQL-like queries, making it easy for users familiar with SQL to perform complex analyses without switching between languages. You can also work with multiple datasets, performing tasks like joins and group by operations, and see even more pronounced speed improvements on GPUs.
query = """
SELECT CUST_ID, SUM(AMOUNT) as sum_amt
FROM transactions
GROUP BY CUST_ID
ORDER BY sum_amt desc
LIMIT 5
"""
%time pl.sql(query).collect()
%time pl.sql(query).collect(engine=gpu_engine)
Visualization with Polars
Polars library also supports GPU-powered visualization, which can help you visualize large datasets quickly. Thus making visualization efficient for high-dimensional data.
(
res
.with_columns(
pl.date(pl.col("YEAR"), pl.col("MONTH"), 1).alias("date-month"),
pl.col("Rainfall (inches)")*100,
)
.hvplot.line(
x="date-month", y=["AMOUNT", "Rainfall (inches)"],
by=['EXP_TYPE'],
rot=45,
)
)
Final Thoughts
If you’re looking to speed up data processing and analysis, especially with very large datasets, try Polars with GPU support. With its ability to switch between CPU and GPU seamlessly, you can work with large data while minimizing setup complexity. To learn more about Polars GPU engine visit https://rapids.ai/polars-gpu-engine/.
Comments
Post a Comment