本繁體中文版使用機器翻譯,譯文僅供參考,若與英文版本牴觸,應以英文版本為準。
Load Criteo按一下「日誌第15天」(日誌第15天)、然後訓練sci套 件學習隨機樹系模式
貢獻者
-
此文件 PDF 的網站
![](https://docs.netapp.com/common/images/pdf-zip.png)
個別的 PDF 文件集合
Creating your file...
This may take a few minutes. Thanks for your patience.
Your file is ready
本節說明我們如何使用Pandas和dask DataFrames從Criteo Terabyte資料集載入Click記錄資料。使用案例與數位廣告有關、可預測是否要點選廣告、或是Exchange未在自動化管道中使用準確的模式、藉此建立使用者的設定檔。
我們從Click Logs資料集載入第15天的資料、總計45GB。在Jupyter筆記型電腦「CTer-andasRF-collated .ipynb」中執行下列儲存格、會建立一個包含前5、000萬列的大圓子資料框架、並產生scipit-記憶 隨機樹系模型。
%%time import pandas as pd import numpy as np header = ['col'+str(i) for i in range (1,41)] #note that according to criteo, the first column in the dataset is Click Through (CT). Consist of 40 columns first_row_taken = 50_000_000 # use this in pd.read_csv() if your compute resource is limited. # total number of rows in day15 is 20B # take 50M rows """ Read data & display the following metrics: 1. Total number of rows per day 2. df loading time in the cluster 3. Train a random forest model """ df = pd.read_csv(file, nrows=first_row_taken, delimiter='\t', names=header) # take numerical columns df_sliced = df.iloc[:, 0:14] # split data into training and Y Y = df_sliced.pop('col1') # first column is binary (click or not) # change df_sliced data types & fillna df_sliced = df_sliced.astype(np.float32).fillna(0) from sklearn.ensemble import RandomForestClassifier # Random Forest building parameters # n_streams = 8 # optimization max_depth = 10 n_bins = 16 n_trees = 10 rf_model = RandomForestClassifier(max_depth=max_depth, n_estimators=n_trees) rf_model.fit(df_sliced, Y)
若要使用受過訓練的隨機樹系模型來執行預測、請執行本筆記型電腦的下一段。我們從第15天開始採用最後一百萬列作為測試集、以避免任何重複資料。儲存格也會計算預測的準確度、定義為模型準確預測使用者是否點選廣告的發生百分比。若要檢閱此筆記型電腦中任何不熟悉的元件、請參閱 "正式的sci套 件學習文件"。
# testing data, last 1M rows in day15 test_file = '/data/day_15_test' with open(test_file) as g: print(g.readline()) # dataFrame processing for test data test_df = pd.read_csv(test_file, delimiter='\t', names=header) test_df_sliced = test_df.iloc[:, 0:14] test_Y = test_df_sliced.pop('col1') test_df_sliced = test_df_sliced.astype(np.float32).fillna(0) # prediction & calculating error pred_df = rf_model.predict(test_df_sliced) from sklearn import metrics # Model Accuracy print("Accuracy:",metrics.accuracy_score(test_Y, pred_df))