Training time comparison
This section compares the model training time using conventional Pandas compared to Dask. For Pandas, we loaded a smaller amount of data due to the nature of slower processing time to avoid memory overflow. Therefore, we interpolated the results to offer a fair comparison.
The following table shows the raw training time comparison when there is significantly less data used for the Pandas random forest model (50 million rows out of 20 billion per day15 of the dataset). This sample is only using less than 0.25% of all available data. Whereas for Dask-cuML we trained the random forest model on all 20 billion available rows. The two approaches yielded comparable training time.
Approach | Training time |
---|---|
Scikit-learn: Using only 50M rows in day15 as the training data |
47 minutes and 21 seconds |
RAPIDS-Dask: Using all 20B rows in day15 as the training data |
1 hour, 12 minutes, and 11 seconds |
If we interpolate the training time results linearly, as shown in the following table, there is a significant advantage to using distributed training with Dask. It would take the conventional Pandas scikit-learn approach 13 days to process and train 45GB of data for a single day of click logs, whereas the RAPIDS-Dask approach processes the same amount of data 262.39 times faster.
Approach | Training time |
---|---|
Scikit-learn: Using all 20B rows in day15 as the training data |
13 days, 3 hours, 40 minutes, and 11 seconds |
RAPIDS-Dask: Using all 20B rows in day15 as the training data |
1 hour, 12 minutes, and 11 seconds |
In the previous table, you can see that by using RAPIDS with Dask to distribute the data processing and model training across multiple GPU instances, the run time is significantly shorter compared to conventional Pandas DataFrame processing with scikit-learn model training. This framework enables scaling up and out in the cloud as well as on-premises in a multinode, multi-GPU cluster.