get_historical_features() Runs Slowly or Fails
Scope
This troubleshooting article covers how to diagnose slow or failing
get_historical_features()
(GHF). Some of the resolutions apply only to
Spark-based retrieval, while others apply only to Snowflake-based retrieval.
Symptoms
This issue can manifest itself via the following symptoms:
-
Materialization jobs are cancelled after an hour if you use the default cluster configuration (relying on spot instances).
-
Pyspark times out in an EMR notebook (Livy connection fails)
-
Out-of-memory issues
-
Timeout failure in Snowflake or Athena
Prerequisites for troubleshooting
Review
Methods for calling_get_historical_features(),
which also helps you determine whether you are running GHF using
pre-materialized feature data (offline=True
).
Resolution
If you run into slow get_historical_features()
, here are some possible causes,
ways to test, and resolutions. We’ve sorted them into most to least common, so
we suggest investigating these possible causes in order.
Isolating a slow feature view (Not pre-materialized feature data only)
If you are running GHF from a feature service, it may be that only one of your feature views is executing slowly, causing the whole feature service GHF to run slowly. By isolating the slow feature view, you can focus your troubleshooting.
- Testing: Instead of running
<feature_service>.get_historical_features()
, try instead to run<feature_view>.get_historical_features()
for each feature view that is contained in the feature service.
Feature view transformation logic (Not pre-materialized)
Your feature view transformation logic may be written in such a way that it is performing expensive joins or scans across a large dataset. This can cause GHF to run very slow or run out of memory.
-
Testing :
- Inspect your transformation logic for joins or expensive reads from large tables.
-
Resolution :
-
We recommend simplifying your feature view logic as much as possible to make it clear where you may be doing expensive joins. For complex pipeline transformations, it can be difficult to assess what is happening. You can also use
.explain()
on the resulting dataframe from GHF to inspect the physical plan of what Spark will try to execute to look for inefficiencies. -
(v0.3 SDK with BatchFeatureViews): If you are using
tecton_sliding_window()
and joining one or more other batch tables, runtecton_sliding_window()
outside the join, as it will explode the number of rows.
-
Very large or slow data source
If you are running GHF on non-materialized feature views (from_source=True
),
then you may be running a well-written feature view against a very large and/or
slow data source that takes time to process. This will be exacerbated if you
have non-optimized Feature View logic. Note that Snowflake and Redshift tend to
be faster than Hive and, especially, File data sources.
-
Testing :
- Tecton on Spark : Try substituting your data source for a smaller sample
FileDSConfig
consisting of a single parquet file. - Tecton on Snowflake : Try selecting a smaller sample of data from your
Snowflake table. You can do this by setting the query param for
SnowflakeConfig
asSELECT * FROM some_table LIMIT 10
.
- Tecton on Spark : Try substituting your data source for a smaller sample
-
Resolution : If you are not able to speed up the data source, we recommend using the smaller data source mentioned above when developing features in a notebook as it can significantly speed up Tecton commands while iterating. You can scale up to the larger, production data source when your features are ready.
Tecton on Snowflake Only: Using a very large Feature Service
If you are running Snowflake-based GHF on a Feature Service with many features, you may be hitting "query length limit" or "compilation memory exhausted" errors from Snowflake. Tecton generates a single SQL string for Snowflake to execute, but in the case of large Feature Services this SQL string may be too large or complex for Snowflake to handle.
-
Testing : Check your Snowflake query history to see the reason your GHF call failed. Confirm the failure is not due to a syntax error, but rather a resource issue or hitting an internal Snowflake limit.
-
Resolution : Run
tecton.conf.set('QUERYTREE_SHORT_SQL_ENABLED', True)
in your notebook and trying running GHF again. Setting this conf to True will cause Tecton to break up the long SQL string into multiple queries.
Tecton on Spark Only: Using a “File” data source (Not pre-materialized)
We include the FileConfig
data source only for development and testing, as it
does not include many basic speed improvements that the HiveConfig
includes.
For example, it does not understand directory partitioning, and Spark scans each
file in the file source to infer the schema of the source. While Tecton will
work with a FileConfig
, it will run slowly if you attempt to use it on a large
collection of files.
-
Testing : Try changing your
uri
parameter to a single parquet file if it is pointed at a large directory of files. -
Resolution : Add a Glue catalog entry (via a Glue crawler) for this file source, and convert your
FileSource
to aHiveSource
. Ensure that you specify any file partitions in yourHiveConfig
.
Tecton on Spark Only: Hive partitions not specified (Not pre-materialized)
If you are using a HiveConfig
data source, Tecton does not by default assume a
partition scheme, however, most data lake partitions are partitioned by
date/time.
-
Testing : Check if you have passed in the date/time partition structure via the
DatetimePartitionColumn
option in your feature repository. -
Resolution: Add the partition columns via the DatetimePartitionColumn option. Here is an example.