Skip to main content
Version: 0.8

Offline Retrieval Methods

Public Preview

This feature is currently in Public Preview.

This feature has the following limitations:
  • Available for Tecton on Databricks and EMR. Coming to Tecton Rift in a future release.
  • Output format is subject to change based on user feedback.
If you have questions or want to share feedback, please file a feature request.

Background​

Tecton 0.8+ includes three methods that improve the offline feature retrieval experience:

  • get_features_in_range(start_time, end_time, ...)
  • get_features_for_events(events, ...)
  • run_transformation(start_time, end_time, mock_inputs)

In this section, we will explore the new behavior of these methods with some examples and learn how to leverage them for training using point-in-time joins, feature data analytics, and monitoring.

These methods are in Public Preview in 0.8 and will replace the existing methods for offline feature retrieval (.get_historical_features() and .run()) in a future version of Tecton's SDK.

get_features_in_range(start_time, end_time, ...)​

For BatchFeatureView, StreamFeatureView and FeatureTable.

Overview​

This method retrieves feature values that are valid between the input start_time (inclusive) and end_time (exclusive).

It returns a Dataframe containing the following:

  • Entity Join Key Columns
  • Feature Value Columns
  • The columns _valid_from and _valid_to, which specify the time range for which the row of feature values is valid. The time range defined by [_valid_from, _valid_to) will never overlap with any other rows for the same join keys.

When is a feature value "valid"?​

A feature value is considered to be valid at a specific point in time if the Online Store would have returned that value if queried at that moment in time.

When does a feature value change or stop being valid?​

Let's take the example of an entity A that has the following transaction events: Events

note

For Non-Aggregate Feature Views, the feature value will cease to be valid if:

  • It is overwritten by a new event with a different value.
  • ttl is set and the event expires since it has been in the online store for longer than ttl. Non-Aggregate Example

Note: If an entity has multiple events within the same batch_schedule interval, Tecton will write the last value to the online store and this is the value that is valid.

fv.get_features_in_range(start_time=datetime(day 1), end_time=datetime(day 9))

entity_idamount_valid_from_valid_toNotes
A5Day 2Day 3Expires due to a new event.
A10Day 3Day 5Expires due to TTL.
A20Day 7Day 8Expires due to a new event.
A10Day 8Day 9May continue to be valid beyond Day 9.
note

For Aggregate Feature Views, the feature value will cease to be valid if:

  • A new event enters the sliding aggregation window and changes the aggregated feature value.
  • An old event exits the sliding aggregation window and stops contributing to the aggregated feature value.

Non-Aggregate Example

fv.get_features_in_range(start_time=datetime(day 1), end_time=datetime(day 9))

entity_idamount_sum_5d_1d_valid_from_valid_toNotes
A5Day 2Day 3Expires due to a new event entering the sliding window.
A15Day 3Day 7Expires due to a new event entering and an old event exiting the sliding window.
A30Day 7Day 9A new event with the value 10 enters the window on Day 8, but an old event with the same value exits the window and the value remains unchanged till Day 9.

How does this method differ from passing start_time and end_time into get_historical_features() instead of a spine?​

The .get_historical_features(start_time, end_time, ...) method on Feature Views and Feature Tables returns feature values as of each event in the raw data between start_time and end_time. This means that feature values are updated when there is a new event in the raw data.

The .get_features_in_range(start_time, end_time) method returns feature values between start_time and end_time, independently of when events occurred in the raw data. For example, it returns an updated feature value when an previous event exits an aggregation window or expires due to ttl, even if there is no new event in the raw data.

Tecton recommends using .get_features_in_range. In 0.8, this method is in Public Preview and will replace .get_historical_features(start_time, end_time) in a future version of Tecton.

Example Feature Views​

For Example 1 and Example 2, we will use following Feature Views that compute features based on User Transaction Data.

user_transaction_amount is a Batch Feature View that stores the Transaction Amount as a feature.

@batch_feature_view(
description="User transaction metrics",
sources=[transactions],
entities=[user],
mode="spark_sql",
batch_schedule=timedelta(days=1),
feature_start_time=datetime(2010, 1, 1),
)
def user_transaction_amount(transactions):
return f"""
SELECT user_id, timestamp, amount
FROM {transactions}
"""

user_transaction_amount_agg is a Batch Feature View that uses Tecton-Managed Aggregations to compute the sum and max of the transaction amount over 1, 7and 30 day time windows.

@batch_feature_view(
description="User transaction metrics over 1, 7 and 30 days",
sources=[transactions],
entities=[user],
mode="spark_sql",
aggregation_interval=timedelta(days=1),
feature_start_time=datetime(2010, 1, 1),
aggregations=[
Aggregation(function="max", column="amount", time_window=timedelta(days=1)),
Aggregation(function="max", column="amount", time_window=timedelta(days=7)),
Aggregation(function="max", column="amount", time_window=timedelta(days=30)),
Aggregation(function="sum", column="amount", time_window=timedelta(days=1)),
Aggregation(function="sum", column="amount", time_window=timedelta(days=7)),
Aggregation(function="sum", column="amount", time_window=timedelta(days=30)),
],
)
def user_transaction_amount_agg(transactions):
return f"""
SELECT user_id, timestamp, amount
FROM {transactions}
"""

get_features_in_range() provides accurate feature values for a given time range. Each row in the result includes the time period the feature values are valid for. This helps us observe how feature values are changing over time which can be useful for testing, monitoring and analytical use cases.

# Fetch features for a 1 month period
start = datetime(2021, 1, 1)
end = datetime(2021, 2, 1)
user_transaction_amount_results = user_transaction_amount.get_features_in_range(start_time=start, end_time=end)
user_transaction_amount_agg_results = user_transaction_amount_agg.get_features_in_range(start_time=start, end_time=end)
display(user_transaction_amount_results)
user_idamount_valid_from_valid_to
user_128.262021-01-01T00:00:002021-01-10T00:00:00
user_135.442021-01-10T00:00:002021-01-15T00:00:00
user_135.442021-01-15T00:00:002021-02-01T00:00:00
user_242.262021-01-01T00:00:002021-01-02T00:00:00
user_21.132021-01-02T00:00:002021-01-28T00:00:00
user_2401.442021-01-28T00:00:002021-02-01T00:00:00
display(user_transaction_amount_agg_results)
user_idamount_max_1d_1damount_max_7d_1damount_max_30d_1damount_sum_1d_1damount_sum_7d_1damount_sum_30d_1d_valid_from_valid_to
user_11010101010102022-01-01T00:00:002022-01-02T00:00:00
user_1null1010null10102022-01-02T00:00:002022-01-08T00:00:00
user_1nullnull10nullnull102022-01-08T00:00:002022-01-10T00:00:00
user_12020202020302022-01-10T00:00:002022-01-11T00:00:00
user_1null2020null20302022-01-11T00:00:002022-01-15T00:00:00
user_13030303050602022-01-15T00:00:002022-01-16T00:00:00
user_1null3030null50602022-01-16T00:00:002022-01-17T00:00:00
user_1null3030null30602022-01-17T00:00:002022-01-22T00:00:00
user_1nullnull30nullnull602022-01-22T00:00:002022-01-31T00:00:00
user_1nullnull30nullnull502022-01-31T00:00:002022-02-01T00:00:00
user_25555552022-01-01T00:00:002022-01-02T00:00:00
user_21515151520202022-01-02T00:00:002022-01-03T00:00:00
user_2null1515null20202022-01-03T00:00:002022-01-08T00:00:00
user_2null1515null15202022-01-08T00:00:002022-01-09T00:00:00
user_2nullnull15nullnull202022-01-09T00:00:002022-01-28T00:00:00
user_22525252525452022-01-28T00:00:002022-01-29T00:00:00
user_2null2525null25452022-01-29T00:00:002022-01-31T00:00:00
user_2null2525null25402022-01-31T00:00:002022-02-01T00:00:00

The results from get_features_in_range() can now be visualized to observe feature trends or tracked for monitoring purposes.

Feature Data Graph

Example 2: Point-in-time Joins​

You can also use the results of get_features_in_range() to implement your own custom point-in-time join.

We can implement this by building a dataframe with join key columns and timestamps for which we would like to fetch values and performing a join operation against the results of get_features_in_range().

Assume you have a table called events with the join keys and timestamps for which we would like to retrieve features.

user_idtimestamp
user_12022-01-10T12:00:00
user_12022-01-15T01:00:00
user_22022-01-02T10:00:00
user_22022-01-28T20:00:00

We can now join this table against the results of get_features_in_range() to get point-in-time accurate feature values

# `events` is the "input" dataframe described above
# `aggregate_results` is the output of `get_features_in_range()` from Example 1
result = events.join(
aggregate_results,
(events.timestamp >= aggregate_results._valid_from)
& (spine.timestamp < aggregate_results._valid_to)
& (spine.user_id == aggregate_results.user_id),
"left",
).drop("_valid_from", "_valid_to")
display(result)
user_idtimestampamount_max_1d_1damount_max_7d_1damount_max_30d_1damount_sum_1d_1damount_sum_7d_1damount_sum_30d_1d
user_12022-01-10T12:00:00202020202030
user_12022-01-15T01:00:00303030305060
user_22022-01-02T10:00:00151515152020
user_22022-01-28T20:00:00252525252545

This operation is functionally equivalent to get_features_for_events().

get_features_for_events(events, ...)​

For BatchFeatureView, StreamFeatureView, OnDemandFeatureView, FeatureTable and FeatureService.

Overview​

This method is used to retrieve historical feature values by joining an input DataFrame events with the feature data.

The events DataFrame contains all the join key columns and a timestamp column. The join key combinations identify the entities for which we would like to retrieve features. The timestamp column specifies the timestamp to which we would like to time-travel and compute feature values.

For more details on how this method works, see Construct Training Data. This method is functionally equivalent to get_historical_features(spine, ...) and will replace it in a future version of Tecton.

Example​

We will reuse the user_transaction_amount_agg Feature View and events DataFrame from the examples above

# Using the same `events` dataframe described above
result = user_transaction_amount_agg.get_features_for_events(events).to_spark()
display(result)
user_idtimestampamount_max_1d_1damount_max_7d_1damount_max_30d_1damount_sum_1d_1damount_sum_7d_1damount_sum_30d_1d
user_12022-01-10T12:00:00202020202030
user_12022-01-15T01:00:00303030305060
user_22022-01-02T10:00:00151515152020
user_22022-01-28T20:00:00252525252545

run_transformation(start_time, end_time, mock_inputs)​

For BatchFeatureView, StreamFeatureView and OnDemandFeatureView

Overview​

This is a simpler version of the .run() method that runs a Feature View's transformation and returns the result. It can be useful to quickly iterate and test a Feature View's Transformation logic before the effect of any Tecton-Managed Aggregation.

Examples​

Example running an On-Demand Feature View with mock data​

@on_demand_feature_view(
sources=[transaction_request, user_transaction_amount_metrics],
mode="python",
schema=output_schema,
description="The transaction amount is higher than the 1 day average.",
)
def transaction_amount_is_higher_than_average(request, user_metrics):
return {"higher_than_average": request["amt"] > user_metrics["daily_average"]}

You can retrieve and run the Feature View in a notebook using mock data:

import tecton

fv = tecton.get_workspace("prod").get_feature_view("transaction_amount_is_higher_than_average")
input_data = {"request": {"amt": 100}, "user_metrics": {"daily_average": 1000}}

result = fv.run_transformation(input_data=input_data)
print(result) # {'higher_than_average': False}:

Example running a Batch Feature View with mock data​

import tecton
import pandas
from datetime import datetime

feature_view = tecton.get_workspace("my_workspace").get_feature_view("my_feature_view")

mock_fraud_user_data = pandas.DataFrame(
{
"user_id": ["user_1", "user_2", "user_3"],
"timestamp": [datetime(2022, 5, 1, 0), datetime(2022, 5, 1, 2), datetime(2022, 5, 1, 5)],
"credit_card_number": [1000, 4000, 5000],
}
)

result = feature_view.run_transformation(
start_time=datetime(2022, 5, 1),
end_time=datetime(2022, 5, 2),
mock_inputs={"fraud_users_batch": mock_fraud_user_data},
) # `fraud_users_batch` is the name of this FeatureView's data source parameter.

Was this page helpful?

🧠 Hi! Ask me anything about Tecton!

Floating button icon