Tecton has two data source classes:
BatchSource: Stores the parameters needed to connect to a batch source, such as a Hive table, a data warehouse table, or a file.
StreamSource: Stores the parameters needed to connect to a stream source (such as a Kafka topic or a Kinesis Stream). Also stores the parameters for a batch source, which contains the stream's historical event log.
Instances of the these data source classes are used by Tecton Feature Views to generate feature values from raw data in the sources.
BatchFeatureView definition specifies one or more
which indicates the source from which the feature view generates feature values.
batch_config object specified in a
BatchSource object definition may
optionally contain a timestamp column representing the time of each record.
Values in the timestamp column must be one of the following formats:
- A native TimestampType object.
- A string representing a timestamp that can be parsed by default Spark SQL
- A customized string representing a timestamp, for which you can provide a custom timestamp_format to parse the string. The format has to follow this guideline.
A timestamp column must be specified in the
batch_config object if any
BatchFeatureViews use a
FilteredSource with a
BatchSource specified that
Declare a configuration object that is an instance of a configuration class specific to your source. Tecton supports these configuration classes:
FileConfig: File source (such as a file on S3)
HiveConfig: Hive (or Glue) Table
RedshiftConfig: Redshift Table or Query
SnowflakeConfig: Snowflake Table or Query
Tecton on Snowflake only supports
The complete list of configurations can be found in API Reference.
As an alternative to using a configuration object, you can use a Data Source Function, which offers more flexibility. However, if you do not require the additional flexibility, using a configuration object is recommended.
BatchSourceobject that references the configuration defined in the previous step:
name: A unique identifier for the batch source. For example,
batch_config: The configuration created in the step above.
See the Data Source API reference for detailed descriptions of Data Source attributes.
The following example declares a
BatchSource object that contains a
configuration for connecting to Snowflake.
click_stream_snowflake_ds = SnowflakeConfig(
clickstream_snowflake_ds = BatchSource(
StreamSource contains these configurations:
stream_config: The configuration for a stream source, which contains parameters for connecting to Kinesis or Kafka.
batch_config: The configuration for a batch source that backs the stream source; the batch source contains the stream's historical data.
The value of these
configs can be the name of an object (such as
KafkaConfig) or a
Data Source Function. A
Data Source Function offers more flexibility than an object. However, if you do
not require the additional flexibility, using an object is recommended.
StreamSource is used by a
StreamFeatureView to generate feature values
using data from both the stream and batch sources.
StreamFeatureView applies the same transformation to both data sources. This
is possible because the
StreamFeatureView uses a post processor function
referenced in a
StreamConfig definition, which maps the fields of the stream
source to the batch source.
Create a Streaming Data Source
for a description of how to iteratively develop a