DataSources

DataSources are the glue that connect your Feature Definitions to the production data sources (such as streaming, databases, CRM systems, etc.).

The DataSources takes care of the production concerns of handling high-volume data. And responsible for many related tasks, such as: Authentication, Rate-limiting, Schema Normalization, Retry, etc.

DataSource definition

DataSources are usually configured by DevOps and are defined as a Kubernetes resource:

apiVersion: k8s.raptor.ml/v1alpha1
kind: DataSource
metadata:
  name: clicks
spec:
  kind: streaming
  config:
    - name: kind
      value: kafka
    - name: brokers
      value: :9093
    - name: topics
      value: clickstream
    - name: consumer_group
      value: clicks-consumer-group
    - name: tls_disable
      value: "true"
  keyFields:
    - client_id
  timestampField: timestamp
  schema: https://raw.githubusercontent.com/raptor-ml/massivedynamic-protos/master/click.proto#Click

The DataSource definition is composed by the metadata(which defines its name), the kind of this connector, and the config of this particular kind.

For more information, see the relevant DataConnector documentation.

DataSource usage

They are then referenced in your Feature Definitions:

apiVersion: k8s.raptor.ml/v1alpha1
kind: Feature
metadata:
  name: clicks
  namespace: default #production
  annotations:
    a8r.io/owner: "@AlmogBaku"
    a8r.io/description: "Demonstration of a simple aggr function"
spec:
  primitive: int
  freshness: 10s
  staleness: 1m
  dataSource:
    name: clicks
  keys:
    - client_id
  builder:
    aggr:
      - sum
      - count
    code: |
      def handler(data, ctx) -> int:
        return 1, ctx.timestamp, ctx.keys["client_id"].split(":")[1]

tip

If you are not defining the namespace, the Feature's namespace will be used.

DataSources

DataSource definition​

DataSource usage​

DataSource definition

DataSource usage