Connecting to a Snowflake Data Source Using Spark
Tecton can use Snowflake as a source of batch data for feature materialization with Spark. This page explains how to set up Tecton to use Snowflake as a data source.
This page is for connecting to a Snowflake data source via Spark. It does not apply to Tecton on Snowflake. If you are using Tecton on Snowflake, see Data Sources for information on using a Snowflake data source.
Prerequisites​
To set up Tecton to use Snowflake as a data source, you need the following:
- A notebook connection to Databricks or EMR.
- The URL for your Snowflake account.
- The name of the virtual warehouse Tecton will use for querying data from Snowflake.
- A Snowflake username and password. We recommend you create a new user in Snowflake configured to give Tecton read-only access. This user needs to have access to the warehouse. See Snowflake documentation on how to configure this access.
- A Snowflake Read-only role for Spark, granted to the user created above. See the Snowflake documentation for the required grants.
If you're using different warehouses for different data sources, the username /
password above needs to have access to each warehouse. Otherwise, you'll run
into the following exception when running get_historical_features()
or
run()
:
net.snowflake.client.jdbc.SnowflakeSQLException: No active warehouse selected in the current session. Select an active warehouse with the 'use warehouse' command.
Configuring Secrets​
To enable the Spark jobs managed by Tecton to read data from Snowflake, you will configure secrets in your secret manager.
For EMR users, follow the instructions to add a secret to the AWS Secrets Manager. For Databricks users, follow the instructions for creating a secret with Databricks secret management.
Note that if your deployment name starts with tecton- already, the prefix would
merely be your deployment name. The deployment name is typically the name used
to access Tecton, i.e. https://<deployment-name>.tecton.ai
.
- Add a secret named
tecton-<deployment-name>/SNOWFLAKE_USER
, and put the Snowflake user name you configured above. - Add a secret named
tecton-<deployment-name>/SNOWFLAKE_PASSWORD
, and put the Snowflake password you configured above.
Verifying​
To verify the connection, add a Snowflake-backed Data Source. Do the following:
-
Add a
SnowflakeConfig
Data Source Config object in your feature repository. Here's an example:from tecton import SnowflakeConfig, BatchSource
# Declare SnowflakeConfig instance object that can be used as an argument in BatchSource
snowflake_config = SnowflakeConfig(
url="https://<your-cluster>.<your-snowflake-region>.snowflakecomputing.com/",
database="CLICK_STREAM_DB",
schema="CLICK_STREAM_SCHEMA",
warehouse="COMPUTE_WH",
table="CLICK_STREAM_FEATURES",
)
# Use in the BatchSource
snowflake_ds = BatchSource(name="click_stream_snowflake_ds", batch_config=snowflake_config) -
Run
tecton plan
.
The Data Source is added to Tecton. A misconfiguration results in an error message.