Click-through rate prediction use case summary

Contributors Download PDF of this page

This use case is based on the publicly available Terabyte Click Logs dataset from Criteo AI Lab. With the recent advances in ML platforms, a lot of attention is now on learning at scale. In this context, Terabyte Click Logs is now the dataset of reference in assessing the scalability of ML platforms and algorithms. By predicting the click-through rate, an advertiser can select the most plausible visitors who are most likely to respond to the ads, analyzing their browsing history, and showing the most relevant ads based on the interests of the user.

This solution provided in this technical report highlights the following benefits:

  • Azure NetApp Files advantages in distributed or large-scale training

  • RAPIDS CUDA-enabled data processing (cuDF, cuPy, and so on) and ML algorithms (cuML)

  • Dask parallel computing framework for distributed training

The collaborative relationship between Dask and Azure NetApp Files demonstrates the drastic improvement in random forest model training time by two orders of magnitude. This improvement is comparable to the conventional Pandas approach when dealing with real-world click logs data with 45GB of text (on average) each day.