Version: Beta 🚧

Secondary Key Aggregations

By default, Tecton's Aggregation Engine groups the raw data by the join keys of a Feature View's entity (or group of entities).

While this approach works effectively when retrieving features for a known set of keys, it is insufficient in scenarios where you need to fetch features for an unknown (and possibly indefinite) set of keys. This situation often arises in use cases such as Recommendation Systems.

With Secondary Key Aggregation, you can instruct Tecton to aggregate not only over a Feature View's entity's join keys, but also over a secondary key. At feature request time, you will only need to specify the entity's join keys.

Example

Let's assume you are modeling a use case that recommends advertisements to show to a given user. Let's further assume that you have an historical event log of ad impressions for a given user. Let's further assume that you want to develop the following 2 features:

Let's take a look at a couple features we may want to create for an ad prediction problem. The data source for these features is a historical event log of ad impressions.

For a given UserID, how many times have they watched each AdIds in the last (1 day, 7 days)
For a given UserID, what are the total seconds they have watched each AdId in the last (1 day, 7 days)

Example mocked data source

from tecton import pandas_batch_config, BatchSource
from datetime import datetime, timedelta


@pandas_batch_config(timestamp_field="timestamp")
def mock_data(context):
    import pandas as pd

    cols = ["user_id", "ad_id", "timestamp", "seconds_watched"]
    data = [
        ["user_1", "ad_1", "2022-05-14 00:00:00", 1],
        ["user_1", "ad_1", "2022-05-14 00:00:00", 1],
        ["user_1", "ad_1", "2022-05-14 12:00:00", 2],
        ["user_1", "ad_1", "2022-05-14 23:59:59", 3],
        ["user_1", "ad_2", "2022-05-15 00:00:00", 4],
        ["user_1", "ad_3", "2022-05-15 12:00:00", 5],
        ["user_1", "ad_4", "2022-05-15 23:59:59", 6],
        ["user_1", "ad_5", "2022-05-16 00:00:00", 7],
        ["user_1", "ad_5", "2022-05-16 12:00:00", 8],
        ["user_1", "ad_5", "2022-05-16 23:59:59", 9],
        ["user_1", "ad_5", "2022-05-17 00:00:00", 10],
        ["user_1", "ad_6", "2022-05-17 00:00:00", 10],
        ["user_1", "ad_7", "2022-05-17 12:00:00", 11],
        ["user_1", "ad_8", "2022-05-17 23:59:59", 12],
        ["user_1", "ad_9", "2022-05-18 00:00:00", 13],
        ["user_1", "ad_9", "2022-05-18 12:00:00", 14],
        ["user_1", "ad_9", "2022-05-18 23:59:59", 15],
        ["user_1", "ad_10", "2022-05-19 00:00:00", 16],
        ["user_1", "ad_11", "2022-05-19 12:00:00", 17],
        ["user_1", "ad_12", "2022-05-19 23:59:59", 18],
        ["user_2", "ad_13", "2022-05-19 23:59:59", 20],
    ]

    df = pd.DataFrame(data, columns=cols)

    df["timestamp"] = pd.to_datetime(df["timestamp"])


ds = BatchSource(name="mock_data", batch_config=mock_data)

Example Feature View

from tecton import Entity, batch_feature_view, Aggregation
from tecton.types import Field, String, Timestamp, Int64

user_entity = Entity(name="user", join_keys=["user_id"])

# Leverage Tecton's Secondary Key Aggregations to get per-ad metrics
@batch_feature_view(
    mode="pandas",
    sources=[ds],
    entities=[user_entity],
    aggregation_secondary_key="ad_id",
    aggregation_interval=timedelta(days=1),
    timestamp_field="timestamp",
    offline=True,
    online=True,
    feature_start_time=datetime(2022, 5, 1),
    aggregations=[
        Aggregation(
            column="impression", function="count", time_window=timedelta(days=1), name="impression_count_per_ad_1d"
        ),
        Aggregation(
            column="seconds_watched",
            function="sum",
            time_window=timedelta(days=1),
            name="sum_seconds_watched_per_ad_1d",
        ),
        Aggregation(
            column="impression", function="count", time_window=timedelta(days=7), name="impression_count_per_ad_7d"
        ),
        Aggregation(
            column="seconds_watched",
            function="sum",
            time_window=timedelta(days=7),
            name="sum_seconds_watched_per_ad_7d",
        ),
    ],
    schema=[
        Field("user_id", String),
        Field("ad_id", String),
        Field("timestamp", Timestamp),
        Field("seconds_watched", Int64),
        Field("impression", Int64),
    ],
)
def user_ad_watched_features(input_table):
    input_table["impression"] = 1
    return input_table[["user_id", "ad_id", "timestamp", "seconds_watched", "impression"]]

Pay attention to the aggregation_secondary_key parameter. This parameter instructs Tecton to group the raw data not only by user_id, but also by ad_id.

note

You may wonder why you would not just specify 2 entities on the Feature View: user_entity = Entity(name="user", join_keys=["user_id"])

ad_entity = Entity(name="user", join_keys=["ad_id"])

entities=[user_entity, ad_entity]

The difference is in how you want to retrieve the feature data at request time. If you want to retrieve an aggregation for a (user, ad) tuple, you don't need secondary key aggregates.

If you want to retrieve the aggregations for all ads for a given user, you do want to use secondary key aggregates.

Example Output

At request time, you can now query feature values only by user_id, without having to specify an ad_id.

Tecton will return the aggregations for every single ad_id that the user you specify has interacted with in the specified time window.

import pandas as pd

training_events = pd.DataFrame(
    {
        "user_id": ["user_1", "user_1", "user_2"],
        "timestamp": [datetime(2022, 5, 19), datetime(2022, 5, 15), datetime(2022, 5, 20)],
    }
)

df = user_ad_watched_features.get_historical_features(training_events).to_pandas()
display(df)

Output Format

The format of the output includes a "keys" column for each aggregation window length containing a list of all keys found in that window. The corresponding aggregate feature values for each set of keys can be found in the remaining columns. Together these form map of keys and values.

If needed, these columns can easily be zipped into a map.

	user_id	timestamp	user_ad_watched_features__ad_id_keys_1d	user_ad_watched_features__ad_id_keys_7d	user_ad_watched_features__impression_count_per_ad_1d	user_ad_watched_features__sum_seconds_watched_per_ad_1d	user_ad_watched_features__impression_count_per_ad_7d	user_ad_watched_features__sum_seconds_watched_per_ad_7d
0	user_1	2022-05-15 00:00:00	['ad_1']	['ad_1']	4	7	[4]	[7]
1	user_1	2022-05-19 00:00:00	['ad_9']	['ad_1' 'ad_2' 'ad_3' 'ad_4' 'ad_5' 'ad_6' 'ad_7' 'ad_8' 'ad_9']	3	42	[4 1 1 1 4 1 1 1 3]	[ 7 4 5 6 34 10 11 12 42]
2	user_2	2022-05-20 00:00:00	['ad_13']	['ad_13']	1	20	[1]	[20]

Secondary Key Aggregations

Example​

Example mocked data source​

Example Feature View​

Example Output​

Was this page helpful?

Example

Example mocked data source

Example Feature View

Example Output