In this tutorial, we'll cover the basics of LabSDK to write features and models.
Install the LabSDK
In order to use the LabSDK, you need to install it. The recommended way is to use
the pip
utility:
pip install --upgrade raptor-labsdk
To use the LabSDK, you need to import it:
from raptor import *
Hello world feature
The very first feature we'll build is the Hello world
feature. It's a simple feature that simply
returns the string "Hello world"
:
@feature(keys=[])
@freshness(max_age='1m', max_stale='10h')
def hello_world(_, _) -> str:
"""this is a hello world feature"""
return f"hello world"
Pretty simple right? let's go through what we did here line by line:
-
We started by declaring the
@feature
. For now, we're not using any keys, so we're passing an empty list. -
We define the feature's freshness:
- We set the
max_age
to1m
- that means that feature value that calculated1 minute
ago, is considered as fresh, and doesn't need to be recalculated. - We set the
max_stale
to10h
- that means that if we can't have a fresh value, and can't calculate a fresh value fast enough, we'll use a stale value as long as it's not older than10 hours
.
FreshnessThe Freshness defines a contract between the feature's author and the production engine. By setting the freshness, you can specify the amount of time that the value is considered fresh, and when it's must be discarded and recalculated.
For more information about feature freshness, check out the Freshness section.
- We set the
-
We defined the feature's logic, and naming our feature
hello_world
and returningstr
primitive type. -
We defined the feature's description. This is a good place to put a description of the feature's logic that can help our colleagues to understand it.
-
We write our business-logic, and simply returned the string
"hello world"
.
User age feature
The next feature we'll build is the user_age
feature. It's a simple feature that simply returns the user's age in
years.
It's using the birthdate
of the user_id
to calculate the feature value.
@feature(keys=['user_id'], data_sources=[User])
@freshness(max_age='10d', max_stale='30d')
def user_age(this_row, ctx: Context) -> int:
"""age of the user in years"""
return (ctx.timestamp - this_row['birthdate']).days // 365
The calculation context is a providing information about the calculation process. It contains the following fields:
fqn
- the feature's fully qualified name.keys
- a dictionary of the feature's keys.timestamp
- the timestamp of the calculation.
For more information about the calculation context, check out the Context section.
Connecting the feature to data source
You might have noticed that we're using the User
data source in the user_age
feature. The data source is the way we
connect the feature to the data that we need to calculate the feature value.
from typing_extensions import TypedDict
@data_source(
training_data=df,
keys=['user_id'], # Optional
production_config=StreamingConfig() # Optional
)
class User(TypedDict):
user_id: str
first_name: str
last_name: str
birthdate: datetime
Data sources are composed of:
-
Declaring the data source using the
@data_source
decorator.-
In line 4, we're configuring the
training_data
to be a pandas dataframe, that contains the data that we'll use to train the model. We can use pandas to import the data from any format that pandas supports, such as CSV, Parquet, JSON, etc -
In line 5, we're configuring the
keys
of the data source. The keys are the fields that we'll use to identify the data source's rows. In this case, we're using theuser_id
field.Although the keys are optional, it's highly recommended to use them, because it will help the Raptor engine to optimize the feature calculations.
-
In line 6, we're configuring the
production_config
to be aStreamingConfig
. This configuration is used to create a configuration stub for the production environment. This is basically hinting our DevOps team how to configure and connect our data source in production.
-
-
Then, we're defining the data source's schema using the
TypedDict
class. This is a python class that defines the data source's schema. In this case, we're defining theUser
data source to have the following fields:user_id
- a string that contains the user's ID.first_name
- a string that contains the user's first name.last_name
- a string that contains the user's last name.birthdate
- a datetime that contains the user's birthdate.
Defining the schema is optional, but it's highly recommended to do so, since it can help your IDE to provide you with auto-completion and type hints, and can help you make the most of Raptor in the future (i.e. to discover drifts in the data).
Replaying the feature calculations
When we're in development, we need to replay the feature calculations from an historical data we have to get a sense of how the feature will behave in production.
To do that, we can use the replay
function. It's a function that allows us to replay the feature calculations on the
DataSource's training data.
ages_df = user_age.replay()
# Let's see the result
ages_df.head()
Model training
Now that we have the feature values, we can train a model.
To do that, we'll define a training function, and specify it's input features, label(also a feature).
@model(
keys=['customer_id'],
input_features=['total_spend+sum'],
input_labels=[amount],
model_framework='sklearn',
model_server='sagemaker-ack',
)
@freshness(max_age='1h', max_stale='100h')
def amount_prediction(ctx: TrainingContext):
from sklearn.linear_model import LinearRegression
df = ctx.features_and_labels()
trainer = LinearRegression()
trainer.fit(df[ctx.input_features], df[ctx.input_labels])
return trainer
In this example, we're defining a linear regression model that predicts the amount
feature, using the total_spend
feature.
You can see that we're also specifying the model_framework
and model_server
that we're using. This will instruct
Raptor how to deploy the model in production.
The training context is a providing information about the training process. It contains the following fields:
fqn
- the model's fully qualified name.keys
- a dictionary of the model's keys.input_features
- a list of the model's input features.input_labels
- a list of the model's input labels.features_and_labels()
- a function that returns a pandas dataframe that contains the features and labels.
For more information about the model training, check out the Models section.
Exporting our model
Once we're done with our model, we can export it, and all of it's dependencies, to the out/
directory.
This will encapsulate our model, and transform our features and data-sources as production artifacts.
Also, it will create a Makefile
so it would be easy for DevOps to integrate it in their regular CI/CD pipeline.
amount_prediction.export()