In this tutorial, we'll cover the basics of LabSDK to write features and models.
Install the LabSDK
In order to use the LabSDK, you need to install it. The recommended way is to use
pip install --upgrade raptor-labsdk
To use the LabSDK, you need to import it:
from raptor import *
Hello world feature
The very first feature we'll build is the
Hello world feature. It's a simple feature that simply
returns the string
def hello_world(_, _) -> str:
"""this is a hello world feature"""
return f"hello world"
Pretty simple right? let's go through what we did here line by line:
We started by declaring the
@feature. For now, we're not using any keys, so we're passing an empty list.
We define the feature's freshness:
- We set the
1m- that means that feature value that calculated
1 minuteago, is considered as fresh, and doesn't need to be recalculated.
- We set the
10h- that means that if we can't have a fresh value, and can't calculate a fresh value fast enough, we'll use a stale value as long as it's not older than
The Freshness defines a contract between the feature's author and the production engine. By setting the freshness, you can specify the amount of time that the value is considered fresh, and when it's must be discarded and recalculated.
For more information about feature freshness, check out the Freshness section.
- We set the
We defined the feature's logic, and naming our feature
We defined the feature's description. This is a good place to put a description of the feature's logic that can help our colleagues to understand it.
We write our business-logic, and simply returned the string
User age feature
The next feature we'll build is the
user_age feature. It's a simple feature that simply returns the user's age in
It's using the
birthdate of the
user_id to calculate the feature value.
def user_age(this_row, ctx: Context) -> int:
"""age of the user in years"""
return (ctx.timestamp - this_row['birthdate']).days // 365
The calculation context is a providing information about the calculation process. It contains the following fields:
fqn- the feature's fully qualified name.
keys- a dictionary of the feature's keys.
timestamp- the timestamp of the calculation.
For more information about the calculation context, check out the Context section.
Connecting the feature to data source
You might have noticed that we're using the
User data source in the
user_age feature. The data source is the way we
connect the feature to the data that we need to calculate the feature value.
from typing_extensions import TypedDict
keys=['user_id'], # Optional
production_config=StreamingConfig() # Optional
Data sources are composed of:
Declaring the data source using the
In line 4, we're configuring the
training_datato be a pandas dataframe, that contains the data that we'll use to train the model. We can use pandas to import the data from any format that pandas supports, such as CSV, Parquet, JSON, etc
In line 5, we're configuring the
keysof the data source. The keys are the fields that we'll use to identify the data source's rows. In this case, we're using the
Although the keys are optional, it's highly recommended to use them, because it will help the Raptor engine to optimize the feature calculations.
In line 6, we're configuring the
production_configto be a
StreamingConfig. This configuration is used to create a configuration stub for the production environment. This is basically hinting our DevOps team how to configure and connect our data source in production.
Then, we're defining the data source's schema using the
TypedDictclass. This is a python class that defines the data source's schema. In this case, we're defining the
Userdata source to have the following fields:
user_id- a string that contains the user's ID.
first_name- a string that contains the user's first name.
last_name- a string that contains the user's last name.
birthdate- a datetime that contains the user's birthdate.
Defining the schema is optional, but it's highly recommended to do so, since it can help your IDE to provide you with auto-completion and type hints, and can help you make the most of Raptor in the future (i.e. to discover drifts in the data).
Replaying the feature calculations
When we're in development, we need to replay the feature calculations from an historical data we have to get a sense of how the feature will behave in production.
To do that, we can use the
replay function. It's a function that allows us to replay the feature calculations on the
DataSource's training data.
ages_df = user_age.replay()
# Let's see the result
Now that we have the feature values, we can train a model.
To do that, we'll define a training function, and specify it's input features, label(also a feature).
def amount_prediction(ctx: TrainingContext):
from sklearn.linear_model import LinearRegression
df = ctx.features_and_labels()
trainer = LinearRegression()
In this example, we're defining a linear regression model that predicts the
amount feature, using the
You can see that we're also specifying the
model_server that we're using. This will instruct
Raptor how to deploy the model in production.
The training context is a providing information about the training process. It contains the following fields:
fqn- the model's fully qualified name.
keys- a dictionary of the model's keys.
input_features- a list of the model's input features.
input_labels- a list of the model's input labels.
features_and_labels()- a function that returns a pandas dataframe that contains the features and labels.
For more information about the model training, check out the Models section.
Exporting our model
Once we're done with our model, we can export it, and all of it's dependencies, to the
This will encapsulate our model, and transform our features and data-sources as production artifacts.
Also, it will create a
Makefile so it would be easy for DevOps to integrate it in their regular CI/CD pipeline.