Data Sources
Tecton has two data source classes:
BatchSource
: Stores the parameters needed to connect to a batch source, such as a Hive table, a data warehouse table, or a file.StreamSource
: Stores the parameters needed to connect to a stream source (such as a Kafka topic or a Kinesis Stream). Also stores the parameters for a batch source, which contains the stream's historical event log.
Instances of the these data source classes are used by Tecton Feature Views to generate feature values from raw data in the sources.
Batch Sources​
A BatchFeatureView
definition specifies one or more BatchSource
objects,
which indicates the source from which the feature view generates feature values.
The batch_config
object specified in a BatchSource
object definition may
optionally contain a timestamp column representing the time of each record.
Values in the timestamp column must be one of the following formats:
- A native TimestampType object.
- A string representing a timestamp that can be parsed by default Spark SQL
yyyy-MM-dd'T'hh:mm:ss.SSS'Z'
. - A customized string representing a timestamp, for which you can provide a custom timestamp_format to parse the string. The format has to follow this guideline.
A timestamp column must be specified in the batch_config
object if any
BatchFeatureView
s use a FilteredSource
with a BatchSource
specified that
uses the batch_config
object.
Defining a BatchSource
​
-
Declare a configuration object that is an instance of a configuration class specific to your source. Tecton supports these configuration classes:
FileConfig
: File source (such as a file on S3)HiveConfig
: Hive (or Glue) TableRedshiftConfig
: Redshift Table or QuerySnowflakeConfig
: Snowflake Table or Query
noteTecton on Snowflake only supports
SnowflakeConfig
.The complete list of configurations can be found in API Reference.
As an alternative to using a configuration object, you can use a Data Source Function, which offers more flexibility. However, if you do not require the additional flexibility, using a configuration object is recommended.
-
Declare a
BatchSource
object that references the configuration defined in the previous step:name
: A unique identifier for the batch source. For example,"click_event_log"
.batch_config
: The configuration created in the step above.
See the Data Source API reference for detailed descriptions of Data Source attributes.
Example​
The following example declares a BatchSource
object that contains a
configuration for connecting to Snowflake.
click_stream_snowflake_ds = SnowflakeConfig(
url="https://[your-cluster].eu-west-1.snowflakecomputing.com/",
database="YOUR_DB",
schema="CLICK_STREAM_SCHEMA",
warehouse="COMPUTE_WH",
table="CLICK_STREAM",
)
clickstream_snowflake_ds = BatchSource(
name="click_stream_snowflake_ds",
batch_config=click_stream_snowflake_ds,
)
Stream Sources​
A StreamSource
contains these configurations:
stream_config
: The configuration for a stream source, which contains parameters for connecting to Kinesis or Kafka.batch_config
: The configuration for a batch source that backs the stream source; the batch source contains the stream's historical data.
The value of these config
s can be the name of an object (such as HiveConfig
or KafkaConfig
) or a
Data Source Function. A
Data Source Function offers more flexibility than an object. However, if you do
not require the additional flexibility, using an object is recommended.
A StreamSource
is used by a StreamFeatureView
to generate feature values
using data from both the stream and batch sources.
A StreamFeatureView
applies the same transformation to both data sources. This
is possible because the StreamFeatureView
uses a post processor function
referenced in a StreamConfig
definition, which maps the fields of the stream
source to the batch source.
See
Create a Streaming Data Source
for a description of how to iteratively develop a StreamSource
.