Data Sources
Data sources are the way we connect the features to the data that we need to calculate the feature value.
Data sources are composed of:
- Declaring the data source using the
@data_sourcedecorator. - The data source class itself which represents the data source's schema (specifying the schema is optional, you
can simply
passthe class body).
Declaring the data source
In order to declare a data source, we use the @data_source decorator.
from typing_extensions import TypedDict
@data_source(
training_data=df,
keys=['user_id'], # Optional
timestamp='timestamp', # Optional
production_config=StreamingConfig() # Optional
)
class User(TypedDict):
user_id: str
first_name: str
last_name: str
birthdate: datetime
The @data_source decorator accepts the following arguments:
-
training_data- the data that we'll use to train the model. We can use pandas to import the data from any format that pandas supports, such as CSV, Parquet, JSON, etc -
keys- the fields that we'll use to identify the data source's rows. In this case, we're using theuser_idfield.Although the keys are optional, it's highly recommended to use them, because it will help the Raptor engine to optimize the feature calculations.
-
timestamp- the timestamp field of the data source. If the data source doesn't have a timestamp field, you can simply passNoneorpassthe argument. -
production_config- the production configuration of the data source. The production configuration is the way we configure the data source to be used in production.