Skip to content

tsfresh

functime has rewritten most of the time-series features extractors from tsfresh into Polars. Approximately 80% of the implementations are optimized lazy queries.

The rest are eager implementations. The overall performance improvements compared to tsfresh ranges between 5x to 50x. Speed ups depend on the size of the input, the feature, and whether common subplan elimination is invoked (i.e. multiple lazy features are collected together). Moreover, windowed / grouped features in functime can be a further 100x faster than tsfresh.

Usage Example

import numpy as np
import polars as pl

from functime.feature_extraction.tsfresh import (
    approximate_entropy
    benford_correlation,
    binned_entropy,
    c3
)

sin_x = np.sin(np.arange(120))

# Pass series directly
entropy = approximate_entropy(
    x=pl.Series("ts", sin_x),
    run_length=5,
    filtering_level=0.0
)

# Lazy operations
features = (
    pl.LazyFrame({"ts": sin_x})
    .select(
        approximate_entropy=approximate_entropy(
            pl.col("ts"),
            run_length=5,
            filtering_level=0.0
        ),
        benford_correlation=benford_correlation(pl.col("ts")),
        binned_entropy=binned_entropy(pl.col("ts"), bin_count=10),
        c3=c3(),
    )
    .collect()
)

absolute_energy(x)

Compute the absolute energy of a time series.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
float | Expr

absolute_maximum(x)

Compute the absolute maximum of a time series.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
float | Expr

absolute_sum_of_changes(x)

Compute the absolute sum of changes of a time series.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
float | Expr

approximate_entropy(x, run_length, filtering_level, scale_by_std=True)

Approximate sample entropies of a time series given the filtering level. This only works for Series input right now.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required
run_length int

Length of compared run of data. This is m in the wikipedia article.

required
filtering_level float

Filtering level, must be positive. This is r in the wikipedia article.

required
scale_by_std bool

Whether to scale filter level by std of data. In most applications, this is the default behavior, but not in some other cases.

True

Returns:

Type Description
float

augmented_dickey_fuller(x, n_lags)

Calculates the Augmented Dickey-Fuller (ADF) test statistic. This only works for Series input right now.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required
n_lags int

The number of lags to include in the test.

required

Returns:

Type Description
float

autocorrelation(x, n_lags)

Calculate the autocorrelation for a specified lag.

The autocorrelation measures the linear dependence between a time-series and a lagged version of itself.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required
n_lags int

The lag at which to calculate the autocorrelation. Must be a non-negative integer.

required

Returns:

Type Description
float | Expr

Autocorrelation at the given lag. Returns None, if lag is less than 0.

autoregressive_coefficients(x, n_lags)

Computes coefficients for an AR(n_lags) process. This only works for Series input right now. Caution: Any Null Value in Series will replaced by 0!

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required
n_lags int

The number of lags in the autoregressive process.

required

Returns:

Type Description
list of float

benford_correlation(x)

Returns the correlation between the first digit distribution of the input time series and the Newcomb-Benford's Law distribution.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
float | Expr

benford_correlation2(x)

Returns the correlation between the first digit distribution of the input time series and the Newcomb-Benford's Law distribution. This version may hit some float point precision issues for some rare numbers.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
float | Expr

binned_entropy(x, bin_count=10)

Calculates the entropy of a binned histogram for a given time series.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required
bin_count int

The number of bins to use in the histogram. Default is 10.

10

Returns:

Type Description
float | Expr

c3(x, n_lags)

Measure of non-linearity in the time series using c3 statistics.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required
n_lags int

The lag that should be used in the calculation of the feature.

required

Returns:

Type Description
float | Expr

change_quantiles(x, q_low, q_high, is_abs)

First fixes a corridor given by the quantiles ql and qh of the distribution of x. It will return a list of changes coming from consecutive values that both lie within the quantile range. The user may optionally get abssolute value of the changes, and compute stats from these changes. If q_low >= q_high, it will return null.

Parameters:

Name Type Description Default
x Expr | Series

A single time-series.

required
q_low float

The lower quantile of the corridor. Must be less than q_high.

required
q_high float

The upper quantile of the corridor. Must be greater than q_low.

required
is_abs bool

If True, takes absolute difference.

required

Returns:

Type Description
list of float | Expr

cid_ce(x, normalize=False)

Computes estimate of time-series complexity[^1].

A more complex time series has more peaks and valleys. This feature is calculated by:

Parameters:

Name Type Description Default
x Expr | Series

A single time-series.

required
normalize bool

If True, z-normalizes the time-series before computing the feature. Default is False.

False

Returns:

Type Description
float | Expr

count_above(x, threshold=0.0)

Calculate the percentage of values above or equal to a threshold.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required
threshold float

The threshold value for comparison.

0.0

Returns:

Type Description
float | Expr

count_above_mean(x)

Count the number of values that are above the mean.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
int | Expr

count_below(x, threshold=0.0)

Calculate the percentage of values below or equal to a threshold.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required
threshold float

The threshold value for comparison.

0.0

Returns:

Type Description
float | Expr

count_below_mean(x)

Count the number of values that are below the mean.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
int | Expr

cwt_coefficients(x, widths=(2, 5, 10, 20), n_coefficients=14)

Calculates a Continuous wavelet transform for the Ricker wavelet.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required
widths Sequence[int]

The widths of the Ricker wavelet to use for the CWT. Default is (2, 5, 10, 20).

(2, 5, 10, 20)
n_coefficients int

The number of CWT coefficients to return. Default is 14.

14

Returns:

Type Description
list of float

energy_ratios(x, n_chunks=10)

Calculates sum of squares over the whole series for n_chunks equally segmented parts of the time-series. E.g. if n_chunks = 10, values are [0, 1, 2, 3, .. , 999], the first chunk will be [0, .. , 99].

Parameters:

Name Type Description Default
x list of float

The time-series to be segmented and analyzed.

required
n_chunks int

The number of equally segmented parts to divide the time-series into. Default is 10.

10

Returns:

Type Description
list of float | Expr

fft_coefficients(x)

Calculates Fourier coefficients and phase angles of the the 1-D discrete Fourier Transform. This only works for Series input right now.

Parameters:

Name Type Description Default
x Expr | Series

Input time series.

required
n_threads int

Number of threads to use. If None, uses all threads available. Defaults to None.

required

Returns:

Type Description
dict of list of floats | Expr

first_location_of_maximum(x)

Returns the first location of the maximum value of x. The position is calculated relatively to the length of x.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
float | Expr

first_location_of_minimum(x)

Returns the first location of the minimum value of x. The position is calculated relatively to the length of x.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
float | Expr

fourier_entropy(x, n_bins=10)

Calculate the Fourier entropy of a time series. This only works for Series input right now.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required
n_bins int

The number of bins to use for the entropy calculation. Default is 10.

10

Returns:

Type Description
float

friedrich_coefficients(x, polynomial_order=3, n_quantiles=30)

Calculate the Friedrich coefficients of a time series.

Parameters:

Name Type Description Default
x TIME_SERIES_T

The time series to calculate the Friedrich coefficients of.

required
polynomial_order int

The order of the polynomial to fit to the quantile means. Default is 3.

3
n_quantiles int

The number of quantiles to use for the calculation. Default is 30.

30

Returns:

Type Description
list of float

harmonic_mean(x)

Returns the harmonic mean of the of the time series.

Parameters:

Name Type Description Default
x Expr | Series

Input time series.

required

Returns:

Type Description
float | Expr

has_duplicate(x)

Check if the time-series contains any duplicate values.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
bool | Expr

has_duplicate_max(x)

Check if the time-series contains any duplicate values equal to its maximum value.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
bool | Expr

has_duplicate_min(x)

Check if the time-series contains duplicate values equal to its minimum value.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
bool | Expr

index_mass_quantile(x, q)

Calculates the relative index i of time series x where q% of the mass of x lies left of i. For example for q = 50% this feature calculator will return the mass center of the time series.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required
q float

The quantile.

required

Returns:

Type Description
float | Expr

large_standard_deviation(x, ratio=0.25)

Checks if the time-series has a large standard deviation: std(x) > r * (max(X)-min(X)).

As a heuristic, the standard deviation should be a forth of the range of the values.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required
ratio float

The ratio of the interval to compare with.

0.25

Returns:

Type Description
bool | Expr

last_location_of_maximum(x)

Returns the last location of the maximum value of x. The position is calculated relatively to the length of x.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
float | Expr

last_location_of_minimum(x)

Returns the last location of the minimum value of x. The position is calculated relatively to the length of x.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
float | Expr

lempel_ziv_complexity(x, threshold, as_ratio=True)

Calculate a complexity estimate based on the Lempel-Ziv compression algorithm. The implementation here is currently taken from Lilian Besson. See the reference section below. Instead of return the complexity value, we return a ratio w.r.t the length of the input series.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required
threshold Union[float, Expr]

Either a number, or an expression representing a comparable quantity. If x > value, then it will be binarized as 1 and 0 otherwise. If x is eager, then value must also be eager as well.

required
as_ratio bool

If true, return the complexity / length of sequence

True

Returns:

Type Description
float

Reference

https://github.com/Naereen/Lempel-Ziv_Complexity/tree/master https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv_complexity

linear_trend(x)

Compute the slope, intercept, and RSS of the linear trend.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
Mapping[str, float] | Expr

longest_losing_streak(x)

Returns the longest losing streak of the time series. A loss is counted when (x_t+1 - x_t) <= 0

Parameters:

Name Type Description Default
x Expr | Series

Input time series.

required

Returns:

Type Description
float | Expr

longest_streak_above(x, threshold)

Returns the longest streak of changes >= threshold of the time series. A change is counted when (x_t+1 - x_t) >= threshold. Note that the streaks here are about the changes for consecutive values in the time series, not the individual values.

Parameters:

Name Type Description Default
x Expr | Series

Input time series.

required
threshold float

The threshold value for comparison.

required

Returns:

Type Description
float | Expr

longest_streak_above_mean(x)

Returns the length of the longest consecutive subsequence in x that is > mean of x. If all values in x are null, 0 will be returned. Note: this does not measure consecutive changes in time series, only counts the streak based on the original time series, not the differences.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
int | Expr

longest_streak_below(x, threshold)

Returns the longest streak of changes <= threshold of the time series. A change is counted when (x_t+1 - x_t) <= threshold. Note that the streaks here are about the changes for consecutive values in the time series, not the individual values.

Parameters:

Name Type Description Default
x Expr | Series

Input time series.

required
threshold float

The threshold value for comparison.

required

Returns:

Type Description
float | Expr

longest_streak_below_mean(x)

Returns the length of the longest consecutive subsequence in x that is < mean of x. If all values in x are null, 0 will be returned. Note: this does not measure consecutive changes in time series, only counts the streak based on the original time series, not the differences.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
int | Expr

longest_winning_streak(x)

Returns the longest winning streak of the time series. A win is counted when (x_t+1 - x_t) >= 0

Parameters:

Name Type Description Default
x Expr | Series

Input time series.

required

Returns:

Type Description
float | Expr

max_abs_change(x)

Compute the maximum absolute change from X_t to X_t+1.

Parameters:

Name Type Description Default
x Expr | Series

A single time-series.

required

Returns:

Type Description
float | Expr

mean_abs_change(x)

Compute mean absolute change.

Parameters:

Name Type Description Default
x Expr | Series

A single time-series.

required

Returns:

Type Description
float | Expr

mean_change(x)

Compute mean change.

Parameters:

Name Type Description Default
x Expr | Series

A single time-series.

required

Returns:

Type Description
float | Expr

mean_n_absolute_max(x, n_maxima)

Calculates the arithmetic mean of the n absolute maximum values of the time series.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required
n_maxima int

The number of maxima to consider.

required

Returns:

Type Description
float | Expr

mean_second_derivative_central(x)

Returns the mean value of a central approximation of the second derivative.

Parameters:

Name Type Description Default
x Series

A time series to calculate the feature of.

required

Returns:

Type Description
Series

number_crossings(x, crossing_value=0.0)

Calculates the number of crossings of x on m, where m is the crossing value.

A crossing is defined as two sequential values where the first value is lower than m and the next is greater, or vice-versa. If you set m to zero, you will get the number of zero crossings.

Parameters:

Name Type Description Default
x Expr | Series

A single time-series.

required
crossing_value float

The crossing value. Defaults to 0.0.

0.0

Returns:

Type Description
float | Expr

number_cwt_peaks(x, max_width=5)

Number of different peaks in x.

To estimate the numbers of peaks, x is smoothed by a ricker wavelet for widths ranging from 1 to n. This feature calculator returns the number of peaks that occur at enough width scales and with sufficiently high Signal-to-Noise-Ratio (SNR)

Parameters:

Name Type Description Default
x Series

A single time-series.

required

max_width : int maximum width to consider

Returns:

Type Description
float

number_peaks(x, support)

Calculates the number of peaks of at least support n in the time series x. A peak of support n is defined as a subsequence of x where a value occurs, which is bigger than its n neighbours to the left and to the right.

Hence in the sequence

x = [3, 0, 0, 4, 0, 0, 13]

4 is a peak of support 1 and 2 because in the subsequences

[0, 4, 0] [0, 0, 4, 0, 0]

4 is still the highest value. Here, 4 is not a peak of support 3 because 13 is the 3th neighbour to the right of 4 and its bigger than 4.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required
support int

Support of the peak

required

Returns:

Type Description
int | Expr

percent_reoccurring_points(x)

Returns the percentage of non-unique data points in the time series. Non-unique data points are those that occur more than once in the time series.

The percentage is calculated as follows:

# of data points occurring more than once / # of all data points

This means the ratio is normalized to the number of data points in the time series, in contrast to the percent_reoccuring_values function.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
float

percent_reoccurring_values(x)

Returns the percentage of values that are present in the time series more than once.

The percentage is calculated as follows:

# (distinct values occurring more than once) / # of distinct values

This means the percentage is normalized to the number of unique values in the time series, in contrast to the percent_reocurring_points function.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
float | Expr

permutation_entropy(x, tau=1, n_dims=3, base=math.e)

Computes permutation entropy.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required
tau int

The embedding time delay which controls the number of time periods between elements of each of the new column vectors.

1
n_dims int, > 1

The embedding dimension which controls the length of each of the new column vectors

3
base float

The base for log in the entropy computation

e

Returns:

Type Description
float | Expr

range_change(x, percentage=True)

Returns the maximum value range. If percentage is true, will compute (max - min) / min, which only makes sense when x is always positive.

Parameters:

Name Type Description Default
x Expr | Series

Input time series.

required
percentage bool

compute the percentage if set to True

True

Returns:

Type Description
float | Expr

range_count(x, lower, upper, closed='left')

Computes values of input expression that is between lower (inclusive) and upper (exclusive).

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required
lower float

The lower bound, inclusive

required
upper float

The upper bound, exclusive

required
closed ClosedInterval

Whether or not the boundaries should be included/excluded

'left'

Returns:

Type Description
int | Expr

range_over_mean(x)

Returns the range (max - min) over mean of the time series.

Parameters:

Name Type Description Default
x Expr | Series

Input time series.

required

Returns:

Type Description
float | Expr

ratio_beyond_r_sigma(x, ratio=0.25)

Returns the ratio of values in the series that is beyond r*std from mean on both sides.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required
ratio float

The scaling factor for std

0.25

Returns:

Type Description
float | Expr

ratio_n_unique_to_length(x)

Calculate the ratio of the number of unique values to the length of the time-series.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
float | Expr

root_mean_square(x)

Calculate the root mean square.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
float | Expr

sample_entropy(x, ratio=0.2, m=2)

Calculate the sample entropy of a time series. This only works for Series input right now.

Parameters:

Name Type Description Default
x Expr | Series

The input time series.

required
ratio float

The tolerance parameter. Default is 0.2.

0.2
m int

Length of a run of data. Most common run length is 2.

2

Returns:

Type Description
float | Expr

spkt_welch_density(x, n_coeffs=None)

This estimates the cross power spectral density of the time series x at different frequencies. This only works for Series input right now.

Parameters:

Name Type Description Default
x Expr | Series

The input time series.

required
n_coeffs Optional[int]

The number of coefficients you want to take. If none, will take all, which will be a list as long as the input time series.

None

Returns:

Type Description
list of floats

streak_length_stats(x, above, threshold)

Returns some statistics of the length of the streaks of the time series. Note that the streaks here are about the changes for consecutive values in the time series, not the individual values.

The statistics include: min length, max length, average length, std of length, 10-percentile length, median length, 90-percentile length, and mode of the length. If input is Series, a dictionary will be returned. If input is an expression, the expression will evaluate to a struct with the fields ordered by the statistics.

Parameters:

Name Type Description Default
x Expr | Series

Input time series.

required
above bool

Above (>=) or below (<=) the given threshold

required
threshold float

The threshold for the change (x_t+1 - x_t) to be counted

required

Returns:

Type Description
float | Expr

sum_reoccurring_points(x)

Returns the sum of all data points that are present in the time series more than once.

For example, sum_reocurring_points(pl.Series([2, 2, 2, 2, 1])) returns 8, as 2 is a reoccurring value, so all 2's are summed up.

This is in contrast to the sum_reocurring_values function, where each reoccuring value is only counted once.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
float | Expr

sum_reoccurring_values(x)

Returns the sum of all values that are present in the time series more than once.

For example, sum_reocurring_values(pl.Series([2, 2, 2, 2, 1])) returns 2, as 2 is a reoccurring value, so it is summed up with all other reoccuring values (there is none), so the result is 2.

This is in contrast to the sum_reocurring_points function, where each reoccuring value is only counted as often as it is present in the data.

Parameters:

Name Type Description Default
x Expr | Series

Input time-series.

required

Returns:

Type Description
float | Expr

symmetry_looking(x, ratio=0.25)

Check if the distribution of x looks symmetric.

A distribution is considered symmetric if: | mean(X)-median(X) | < ratio * (max(X)-min(X))

Parameters:

Name Type Description Default
x Series

Input time-series.

required
ratio float

Multiplier on distance between max and min.

0.25

Returns:

Type Description
bool | Expr

time_reversal_asymmetry_statistic(x, n_lags)

Returns the time reversal asymmetry statistic.

Parameters:

Name Type Description Default
x Series

Input time-series.

required
n_lags int

The lag that should be used in the calculation of the feature.

required

Returns:

Type Description
float | Expr

var_gt_std(x, ddof=1)

Is the variance >= std? In other words, is var >= 1?

Parameters:

Name Type Description Default
x Expr | Series

Input time series.

required
ddof int

Delta Degrees of Freedom used when computing var.

1

Returns:

Type Description
bool | Expr

variation_coefficient(x)

Calculate the coefficient of variation (CV).

Parameters:

Name Type Description Default
x Expr | Series

Input time series.

required

Returns:

Type Description
float | Expr