Data sources are the way we connect the features to the data that we need to calculate the feature value.
Data sources are composed of:
- Declaring the data source using the
- The data source class itself which represents the data source's schema (specifying the schema is optional, you
passthe class body).
Declaring the data source
In order to declare a data source, we use the
from typing_extensions import TypedDict
keys=['user_id'], # Optional
timestamp='timestamp', # Optional
production_config=StreamingConfig() # Optional
@data_source decorator accepts the following arguments:
training_data- the data that we'll use to train the model. We can use pandas to import the data from any format that pandas supports, such as CSV, Parquet, JSON, etc
keys- the fields that we'll use to identify the data source's rows. In this case, we're using the
Although the keys are optional, it's highly recommended to use them, because it will help the Raptor engine to optimize the feature calculations.
timestamp- the timestamp field of the data source. If the data source doesn't have a timestamp field, you can simply pass
production_config- the production configuration of the data source. The production configuration is the way we configure the data source to be used in production.