FAQ: get_historical_features vs. run
Feature Views expose get_historical_features
and run
methods.
Method: get_historical_features
โ
get_historical_features
should be used to compute or retrieve pre-computed
offline feature data. This method will always produce accurate feature
values for a requested time range or spine. get_historical_features
will
selectively retrieve pre-computed features from the offline store or compute
them from raw event data depending on whether offline materialization is
enabled. This can be explicitly overridden using from_source=True
.
get_historical_features
can be used for the following workflows:
- Generating historical training data using
get_historical_features(spine=training_events)
, wheretraining_events
is a dataframe including historical timestamps for specific entities. This produces feature values as of a particular time for each requested entity, which can be used for model training. - Generating batch inference data using
get_historical_features(spine=inference_join_keys)
whereinference_join_keys
is a dataframe including entities and the current timestamp, which produces the most recent feature data for requested entities. - Inspecting offline data for a time range using
get_historical_features(start_time=t1, end_time=t2)
.
Method: run
โ
run
should only be used when interactively testing or debugging a
Feature View. run
quite literally runs a Feature View transformation. run
is based on raw event data, but also provides the option to specify mocked data
sources.
Do not use run
to generate training data since it is not guaranteed to
produce accurate feature values.
test_run
is nearly identical to run
, but is intended for use in unit testing
since it explicitly requires mocked data sources, a local spark session, and
does not make any network requests. Most of this document will focus on run
,
but the concepts extend to test_run
.
๐ย Key Concept: get_historical_features
one-to-many relationship with run
โ
Hereโs another way of considering the differences between the two methods: in
order to materialize offline data for a Feature Views, the Feature View pipeline
is run on a scheduled interval (based on batch_schedule
or
aggregation_interval
) in a materialization job. ** run
mimics the query that
would be run for a single materialization job for some time range**. This is
why run
requires a start_time
and end_time
, which should be aligned to 1
scheduled interval (the SDK will emit warnings if a specified time range does
not align with 1 scheduled interval).
Finally, using the results of multiple runs, training data produced
byget_historical_features
is based on one or more materialization job
runs.