Connecting to an S3 Data Source
Overview​
To grant Tecton access to your S3 data sources, use AWS IAM role-based permissions. This requires setting bucket policies for the S3 buckets you want to use with Tecton.
To add S3 buckets:
- Add bucket policies
- Register the S3 data source
- Test the S3 data source
If your S3 data source is partitioned in a directory structure, we recommend that you register the data source with your AWS Glue Data Catalog and add the data as a Hive data source.
Adding a Bucket Policy​
This AWS blog post explains how to configure bucket policies using IAM roles. An example bucket policy is shown below. The bucket policy gives permissions to the IAM role that Tecton uses to run Spark jobs.
The Principal agent in the policy below is used to add S3 data sources to Tecton's Free Trial.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GrantResourceAccess",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::{YOUR-TECTON-AWS-ACCOUNT}:root"
},
"Action": [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::{YOUR-BUCKET-NAME-HERE}/*",
"arn:aws:s3:::{YOUR-BUCKET-NAME-HERE}"
]
}
]
}
Adding permissions to the Spark Role​
If you have a paid version of Tecton, you must also grant access to the Spark Role you configured Tecton with (Databricks EMR) to read from your S3 Bucket. You can do so by creating and attaching a policy to the role with the following permissions
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GrantRoleAccess",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::{YOUR-BUCKET-NAME-HERE}/*",
"arn:aws:s3:::{YOUR-BUCKET-NAME-HERE}"
]
}
]
}
Registering an S3 Data Source​
Once Tecton has access, register the data sources with Tecton as part of the
data_sources.py
file in your Feature Repository.
Create a config object using FileConfig
and place it in a BatchSource
object
with metadata to discover the new data source. For example:
sample_data_config = FileConfig(uri="s3://{YOUR-BUCKET-NAME-HERE}/{YOUR-FILENAME}.pq", file_format="parquet")
sample_data_vds = BatchSource(
name="sample_data",
batch_config=sample_data_config,
)
After you have created these objects in your local Feature Repository, call
tecton apply
to submit them to the production Feature Store.
Testing an S3 Data Source​
To test that the connection to the S3 data source has been made correctly, open the interactive notebook that you use for Tecton development and preview the data:
ds = tecton.get_data_source("sample_data")
ds.get_dataframe().to_pandas().head(10).show()
If you get a 403 ERROR
when calling the get_dataframe command, Tecton does not
have permission to access the data. Check the bucket policy and the AWS setup.
If you continue to get errors, contact Tecton support.