Notebooks are not enough for ML at scale
15 min read · 15 hours ago
Photo by Sylvain Mauroux on Unsplash
All images, unless otherwise noted, are by the author
There is a misunderstanding (not to say fantasy) which keeps coming back in companies whenever it comes to AI and Machine Learning. People often misjudge the complexity and the skills needed to bring Machine Learning projects to production, either because they do not understand the job, or (even worse) because they think they understand it, whereas they don’t.
Their first reaction when discovering AI might be something like “AI is actually pretty simple, I just need a Jupyter Notebook, copy paste code from here and there — or ask Copilot — and boom. No need to hire Data Scientists after all…” And the story always end badly, with bitterness, disappointment and a feeling that AI is a scam: difficulty to move to production, data drift, bugs, unwanted behavior.
So let’s write it down once and for all: AI/Machine Learning/any data-related job, is a real job, not a hobby. It requires skills, craftsmanship, and tools. If you think you can do ML in production with notebooks, you are wrong.
This article aims at showing, with a simple example, all the effort, skills and tools, it takes to move from a notebook to a real pipeline in production. Because ML in production is, mostly, about being able to automate the run of your code on a regular basis, with automation and monitoring.
And for those who are looking for an end-to-end “notebook to vertex pipelines” tutorial, you might find this helpful.
Let’s imagine you are a Data Scientist working at an e-commerce company. Your company is selling clothes online, and the marketing team asks for your help: they are preparing a special offer for specific products, and they would like to efficiently target customers by tailoring email content that will be pushed to them to maximize conversion. Your job is therefore simple: each customer should be assigned a score which represents the probability he/she purchases a product from the special offer.
The special offer will specifically target those brands, meaning that the marketing team wants to know which customers will buy their next product from the below brands:
Allegra K, Calvin Klein, Carhartt, Hanes, Volcom, Nautica, Quiksilver, Diesel, Dockers, Hurley
We will, for this article, use a publicly available dataset from Google, the `thelook_ecommerce` dataset. It contains fake data with transactions, customer data, product data, everything we would have at our disposal when working at an online fashion retailer.
To follow this notebook, you will need access to Google Cloud Platform, but the logic can be replicated to other Cloud providers or third-parties like Neptune, MLFlow, etc.
As a respectable Data Scientist, you start by creating a notebook which will help us in exploring the data.
We first import libraries which we will use during this article:
<import catboost as cb>
<import pandas as pd>
<import sklearn as sk>
<import numpy as np>
<import datetime as dt>
<from dataclasses import dataclass>
<from sklearn.model_selection import train_test_split>
<from google.cloud import bigquery>
<%load_ext watermark>
<%watermark --packages catboost,pandas,sklearn,numpy,google.cloud.bigquery>
Getting and preparing the data
We will then load the data from BigQuery using the Python Client. Be sure to use your own project id:
<query = \"\"\"SELECT transactions.user_id,products.brand,products.category,products.department,products.retail_price,users.gender,users.age,users.created_at,users.country,users.city,transactions.created_at
FROM `bigquery-public-data.thelook_ecommerce.order_items` as transactions
LEFT JOIN `bigquery-public-data.thelook_ecommerce.users` as users
ON transactions.user_id = users.id
LEFT JOIN `bigquery-public-data.thelook_ecommerce.products` as products
ON transactions.product_id = products.id
WHERE status <> \'Cancelled\'\"\"\">
client = bigquery.Client()
df = client.query(query).to_dataframe()
You should see something like that when looking at the dataframe:
These represent the transactions / purchases made by the customers, enriched with customer and product information.
Given our objective is to predict which brand customers will buy in their next purchase, we will proceed as follows:
Group purchases chronologically for each customer
If a customer has N purchases, we consider the Nth purchase as the target, and the N-1 as our features.
We therefore exclude customers with only 1 purchase
Let’s put that into code:
<# Compute recurrent customers
recurrent_customers = df.groupby(\'user_id\')[\'created_at\'].count().to_frame(\"n_purchases\")
# Merge with dataset and filter those with more than 1 purchase
df = df.merge(recurrent_customers, left_on=\'user_id\', right_index=True, how=\'inner\')
df = df.query(\'n_purchases > 1\')
# Fill missing values
df.fillna(\'NA\', inplace=True)
target_brands = [\'Allegra K\', \'Calvin Klein\', \'Carhartt\', \'Hanes\', \'Volcom\', \'Nautica\', \'Quiksilver\', \'Diesel\',\'Dockers\', \'Hurley\']
aggregation_columns = [\'brand\', \'department\', \'category\']
# Group purchases by user chronologically
df_agg = (df.sort_values(\'created_at\').groupby([\'user_id\', \'gender\', \'country\', \'city\', \'age\'], as_index=False)[[\'brand\', \'department\', \'category\']].agg({k: \";\".join for k in [\'brand\', \'department\', \'category\']}))
# Create the target
df_agg[\'last_purchase_brand\'] = df_agg[\'brand\'].apply(lambda x: x.split(\";\")[-1])
df_agg[\'target\'] = df_agg[\'last_purchase_brand\'].isin(target_brands)*1
df_agg[\'age\'] = df_agg[\'age\'].astype(float)
# Remove last item of sequence features to avoid target leakage :
for col in aggregation_columns:
df_agg[col] = df_agg[col].apply(lambda x: \";\".join(x.split(\";\")[:-1]))
Notice how we removed the last item in the sequence features: this is very important as otherwise we get what we call a “data leakage”: the target is part of the features, the model is given the answer when learning.
We now get this new df_agg dataframe:
Comparing with the original dataframe, we see that user_id 2 has indeed purchased IZOD, Parke & Ronen, and finally Orvis which is not in the target brands.
Splitting into train, validation and test
As a seasoned Data Scientist, you will now split your data into different sets, as you obviously know that all three are required to perform some rigorous Machine Learning. (Cross-validation is out of the scope for today folks, let’s keep it simple.)
One key thing when splitting the data is to use the not-so-well-known stratify parameter from the scikit-learn train_test_split() method. The reason for that is because of class-imbalance: if the target distribution (% of 0 and 1 in our case) differs between training and testing, we might get frustrated with poor results when deploying the model.
ML 101 kids: keep you data distributions as similar as possible between training data and test data:
<# Remove unecessary features
df_agg.drop(\'last_purchase_category\', axis=1, inplace=True)
df_agg.drop(\'last_purchase_brand\', axis=1, inplace=True)
df_agg.drop(\'user_id\', axis=1, inplace=True)
# Split the data into train and eval
df_train, df_val = train_test_split(df_agg, stratify=df_agg[\'target\'], test_size=0.2)
print(f\"{len(df_train)} samples in train\")
df_train, df_val = train_test_split(df_agg, stratify=df_agg[\'target\'], test_size=0.2)
print(f\"{len(df_train)} samples in train\") # 30950 samples in train
df_val, df_test = train_test_split(df_val, stratify=df_val[\'target\'], test_size=0.5)
print(f\"{len(df_val)} samples in val\")
print(f\"{len(df_test)} samples in test\")
Now this is done, we will gracefully split our dataset between features and targets:
<X_train, y_train = df_train.iloc[:, :-1], df_train[\'target\']
X_val, y_val = df_val.iloc[:, :-1], df_val[\'target\']
X_test, y_test = df_test.iloc[:, :-1], df_test[\'target\']
Among the feature are different types. We usually separate those between:
numerical features: they are continuous, and reflect a measurable, or ordered, quantity.
categorical features: they are usually discrete, and are often represented as strings (ex: a country, a color, etc…)
text features: they are usually sequences of words.
Of course there can be more like image, video, audio, etc.
The model: introducing CatBoost
For our classification problem (you already knew we were in a classification framework, didn’t you?), we will use a simple yet very powerful library: CatBoost. It is built and maintained by Yandex, and provides a high-level API to easily play with boosted trees. It is close to XGBoost, though it does not work exactly the same under the hood.
CatBoost offers a nice wrapper to deal with features from different kinds. In our case, some features can be considered as “text” as they are the concatenation of words, such as “Calvin Klein;BCBGeneration;Hanes”. Dealing with this type of features can sometimes be painful as you need to handle them with text splitters, tokenizers, lemmatizers, etc. Hopefully, CatBoost can manage everything for us!
<# Define features
features = {\'numerical\': [\'retail_price\', \'age\'],
\'static\': [\'gender\', \'country\', \'city\'],
\'dynamic\': [\'brand\', \'department\', \'category\']}
# Build CatBoost \"pools\", which are datasets
train_pool = cb.Pool(X_train,y_train,cat_features=features.get(\"static\"),text_features=features.get(\"dynamic\"),)
validation_pool = cb.Pool(X_val,y_val,cat_features=features.get(\"static\"),text_features=features.get(\"dynamic\"),)
# Specify text processing options to handle our text features
text_processing_options = {\"tokenizers\": [{\"tokenizer_id\": \"SemiColon\", \"delimiter\": \";\", \"lowercasing\": \"false\"}],
\"dictionaries\": [{\"dictionary_id\": \"Word\", \"gram_order\": \"1\"}],
\"feature_processing\": {\"default\": [{\"dictionaries_names\": [\"Word\"],
\"feature_calcers\": [\"BoW\"],
\"tokenizers_names\": [\"SemiColon\"],}],
},}
We are now ready to define and train our model. Going through each and every parameter is out of today’s scope as the number of parameters is quite impressive, but feel free to check the API yourself.
And for brevity, we will not perform hyperparameter tuning today, but this is obviously a large part of the Data Scientist’s job!
<# Train the model
model = cb.CatBoostClassifier(iterations=200,
loss_function=\"Logloss\",
random_state=42,
verbose=1,
auto_class_weights=\"SqrtBalanced\",
use_best_model=True,
text_processing=text_processing_options,
eval_metric=\'AUC\')
model.fit(train_pool, eval_set=validation_pool, verbose=10)
And voila, our model is trained. Are we done?
No. We need to check that our model’s performance between training and testing is consistent. A huge gap between training and testing means our model is overfitting (i.e. “learning the training data by heart and not good at predicting unseen data”).
For our model evaluation, we will use the ROC-AUC score. Not deep-diving on this one either, but from my own experience this is a generally quite robust metric and way better than accuracy.
A quick side note on accuracy: I usually do not recommend using this as your evaluation metric. Think of an imbalanced dataset where you have 1% of…