databrickslabs / tempo Goto Github PK

API for manipulating time series on top of Apache Spark: lagged time values, rolling statistics (mean, avg, sum, count, etc), AS OF joins, downsampling, and interpolation

Home Page: https://pypi.org/project/dbl-tempo

License: Other

Python 9.68% Jupyter Notebook 90.32%

timeseries timeseries-data timeseries-analysis python scala time-series data-science pandas data-analysis

tempo's Introduction

tempo - Time Series Utilities for Data Teams Using Databricks

Project Description

Welcome to Tempo: timeseries manipulation for Spark. This project builds upon the capabilities of PySpark to provide a suite of abstractions and functions that make operations on timeseries data easier and highly scalable.

NOTE that the Scala version of Tempo is now deprecated and no longer in development.

Tempo Project Documentation

tempo's People

Contributors

Stargazers

Watchers

tempo's Issues

Rename module to tempo

We should probably do a big search-and-replace from the existing 'tca' name to the new module name 'tempo'

Unit tests for window builder functions

The window builder functions on TSDF should have test cases

Functions to transform into label / feature-vector format for model training & scoring

We need some helper functions that can easily group timeseries and roll them into a tensor/vector format suitable as a feature vector for model training & scoring.

The functions should allow for collection of values / aggregation across 2 dimensions:

group & aggregate across multiple timeseries (multivariate inputs)
build up tensor / vector representations across a window over timeseries
- eg: a 60-day lookback feature tensor across 5 related timeseries (producing 60x5 tensors for each training sample)

This function may be a special case of a more generic rolling-window type aggregation.

Refactor test code into unit tests

We should refactor the test code using the unittest framework. This will integrate better with IDEs and CI/CD tools. We should have tests automatically run on PRs in GitHub

Interpolate: Support "limit"

Pandas supports a parameter limit : " Maximum number of consecutive NaNs to fill. Must be greater than 0."

It would be usefull if tempo supported something simililar. The pandas version is a bit weird (imo) in that it still interpolates up to that nr of NaNs, and then just stops. So if you have a resampling of 1 min, a gap of 1,5 hours (90 buckets) and a limit of 1 hour (60 buckets) it will interpolate out 60 bucketes into the gap, and leave 30 NaNs. To me the reasonable implementation of "limt" will avoid the whole stretch, and interpolate none of the NaNs, since the reasonable interpretation of the limit is something like "if the gap is this large we have to little information".

[Brainstorm] Support Waterfall join?

Consider two spark dataframes:
df_a =

timestamp	partition_col	val_col_a
t0	val_id1	val_a1
t2	val_id1	val_a2
t3	val_id1	val_a3

df_b =

timestamp	partition_col	val_col_b
t1	val_id1	val_b1
t4	val_id1	val_b2
t5	val_id1	val_b3

The desired result is
df_ab =

timestamp	partition_col	val_col_a	val_col_b
t0	val_id1	val_a1	null
t1	val_id1	val_a1	val_b1
t2	val_id1	val_a2	val_b1
t3	val_id1	val_a3	val_b1
t4	val_id1	val_a3	val_b2
t5	val_id1	val_a3	val_b3

Effectively, this is like merging two time series, where each row in the result dataframe reflects the latest unioned state of both time series, as of that timestamp.
Is there an existing way in spark to do this efficiently? Or is this something tempo could help solve?

withRangeStats references hard-coded

Noticed something I missed in the review for PR #14 -
there are 2 hardcoded references to the timeseries column name "EVENT_TS":

line 187 in the construction of the Window
line 198 in the filter on the columns to summarize

Implement `show` and other data-preview methods on TSDF

As a convenience, we should have show and other data-preview methods available directly on TSDF

Use logging library

We should use the python logging library throughout our code, and log messages at various levels (DEBUG, ERROR, WARN, etc.) https://docs.python.org/3/library/logging.html

Generalize rolling window calculations

We've already implemented a number of different rolling-window calculations, such as EMA, withLookbackFeatuers, withRangeStats. I expect there will be many more in the future.

I think we should refactor the rolling window functions into a more generic and extensible API. In this model, there would be a generic rollApply(...) function on the TSDF class, which would take a specific rolling-window function to apply over the timeseries data.

Examples:

tsdf.rollApply( EMA(...) )
tsdf.rollApply( LookbackFeatures(...) )
tsdf.rollApply( RangeStats(...) )

We would provide collections of such functions, but end-users could also define their own.

Standardize parameter names

For example, the asofJoin parameters do not follow the same convention. Fix this across the project.

More flexible constructor functions

We should make the constructor functions more flexible, applying fewer assumptions about the structure of the input Dataframe (and/or configurable to accept other dataframe layouts).

implement in scala

We want to implement the existing code in Scala, and continue core development there, with Python as a wrapper around the Scala implementation

Build script not correctly packaging up module

There seems to be a reversion somewhere and the built wheel files are not correctly loading the tempo module.

ForwardFill for subsampling

Hi,

I have a use case, where I want to regularize irregular sensor data, so that I have a reading for every minute.

Currently I use

tsdf.resample(freq='min', func='ceil', fill = True)

and get this result

I would like to have the previous value instead of the NULL. As a workaround I can now use the ffill operator on the result, but this is a bit inconvenient. Any suggestions?

Greetings

Support Resampled Summary stats

Fourier Transform functionality

Tempo is a time series-focused library.

So need to have the ability to transform time-series data sequences to their Fourier equivalents in the frequency domain

Remove joining back columns for narrow dataframes

Define a heuristic for deciding "what is a narrow dataframe?"
Keep columns when dataframe is decided to be narrow, not when wide.

Integrate unit tests into github

We should have github actions that automatically run the unit tests for every new PR

[Question] support metrics of different frequency

Hi,
This is a great work. I'm interested in delta lake to store time series data. I'm working in Autonomous Driving field. Currently we are using Ros Bag and store it in S3 bucket from each data collection.

The characteristic of autonomous driving data is data emitted from different devices (say from different sensors on the car) have different frequencies. Some are 100HZ while other's are 10HZ. Usually for analytics, we have need to find all the data belong to one specific collection.

My question is if we store timestamp in milliseconds level, will that be an issue ? Any suggestions on how many delta table need to be created ?
Further I'm interested in analytics performance if we store billions or even trillions of rows from all the collections into delta lake ？ What will be recommended way to do that in order to achieve best performance ?

Thanks a lot,

Weide

AsOfJoin with subset of partitioning columns on right side

Hi,

I have a timeseries with sensor readings for many different motors of several machines and a timeseries with the products that run on the respective machine. The sensor readings are partitioned by machine and motor and the products are partitioned by machine only.

Currently I have to transform the tsdf object of the sensor readings back to a normal data frame object and make a new tsdf object that is only partitioned by the machine to make the asof join. I assume this is a common use case and allowing less partitions on the right side of the asof join would be very convenient.

Cheers,
Martin

Support for structured streaming

Hi team, I just watched a talk from Ricardo and Tristan on Mar 16, 2021. At the end there was a question from the audience about support for streaming. It seemed the answer was not at this time, and Tempo was primarily intended for batch use cases at the time. Does Tempo now support structured streaming? If not, is it on the road map?

refactor as-of join

Break down the internal methods used for asof join in simpler methods, and make them more readable.
See if some of the timestamp name columns can be more generalised. Currently I am using 3 different timestamp columns, can this be brought back to 1?

Support higher resample resample frequency

The lowest supported resample frequency unit is seconds. Luckily it is possible to use "0.001 seconds " to represent e.g. milliseconds, but that is a bit clunky, so it would be nice if at least miliseconds (and maybe microseconds (and nanoseconds?)) would be supported as units.

Join Asof dropping null timestamps

I am testing the JoinAsof and it works great. Just one problem.
Rows on the left dataframe with null values in the time columns are dropped.
Any way of fixing this problem.
One workaround would be to replace nulls with some dummy value.

left = spark.createDataFrame(
    [
        [dt.datetime(2021, 1, 1, 10, 30), "x", 1],
        [dt.datetime(2021, 1, 1, 10, 30, 10), "x", 2],
        [None, "x", 3],
        [dt.datetime(2021, 1, 1, 10, 40, 10), "x", 3],
    ],
    "ts timestamp, col1 string, col2 int"
)
right = spark.createDataFrame(
    [
        [dt.datetime(2021, 1, 1, 10, 29), "x", "a"],
        [dt.datetime(2021, 1, 1, 10, 40, 20), "x", "b"],
    ],
    "ts timestamp, col1 string, col3 string"
)
left_ts = TSDF(left, ts_col="ts", partition_cols=["col1"])
right_ts = TSDF(right, ts_col="ts", partition_cols=["col1"])
left_ts.asofJoin(right_ts).df.show()
+-------------------+----+----+-------------------+----------+
|                 ts|col1|col2|           right_ts|right_col3|
+-------------------+----+----+-------------------+----------+
|2021-01-01 10:30:00|   x|   1|2021-01-01 10:29:00|         a|
|2021-01-01 10:30:10|   x|   2|2021-01-01 10:29:00|         a|
|2021-01-01 10:40:10|   x|   3|2021-01-01 10:29:00|         a|
+-------------------+----+----+-------------------+----------+

Interpolating without avering first.

This is exactly the same question as this stackoverflow question (named "Pandas timeseries resampling and interpolating together"), except I am wondering if it is solvable with tempo instead of pandas.

Here is the gist of it:
Data looks like this

    tstamp               val
0  2016-09-01 00:00:00  57
1  2016-09-01 00:01:00  57
2  2016-09-01 00:02:23  57
3  2016-09-01 00:03:04  57
4  2016-09-01 00:03:58  58
5  2016-09-01 00:05:00  60

We want to resample to every minute. Notice that 00:04:00 is missing, so interpolation is needed. BUT we want to use the fact that 2 seconds before (00:03:58) the value was 58, and 60 seconds later it is 60, so the value at 00:04:00 should be 58+((2/62)*2) = 58.064516.
So we do not want to first resample with e.g. mean into 1-min buckets and then interpolate between them, we instead want to find the "correct" value (by interpolating the values we have) at every minute point.

The pandas solution is relatively easy:

import pandas as pd
from datetime import datetime

df = pd.DataFrame({"tstamp": [
    datetime(2016, 9, 1, 0, 0, 0),
    datetime(2016, 9, 1, 0, 1, 0),
    datetime(2016, 9, 1, 0, 2, 23),
    datetime(2016, 9, 1, 0, 3, 4),
    datetime(2016, 9, 1, 0, 3, 58),
    datetime(2016, 9, 1, 0, 5, 0)], 
    "val": [57, 57, 57, 57, 58, 60]})


d = df.set_index('tstamp')
t = d.index

r = pd.date_range(t.min(), t.max(), freq='T')

d = d.reindex(t.union(r)).interpolate('index').loc[r]

                           val
2016-09-01 00:00:00  57.000000
2016-09-01 00:01:00  57.000000
2016-09-01 00:02:00  57.000000
2016-09-01 00:03:00  57.000000
2016-09-01 00:04:00  58.064516
2016-09-01 00:05:00  60.000000

Error on display TSDF

Hey team,

I was following the tutorial from the README with a private data, that is data stored as delta and got the following error on display:

Command:

data_tsdf = TSDF(data, ts_col="used_at")
display(data_tsdf)

Output:

Exception: Cannot call display(<class 'tempo.tsdf.TSDF'>)
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<command-1016310932093716> in <module>
      1 data_tsdf = TSDF(data, ts_col="used_at")
----> 2 display(data_tsdf)

/databricks/python_shell/scripts/PythonShellImpl.py in display(self, input, *args, **kwargs)
   1216             self.displayHTML(input._repr_html_())
   1217         else:
-> 1218             raise Exception(genericErrorMsg)
   1219 
   1220     def displayHTML(self, html):

Exception: Cannot call display(<class 'tempo.tsdf.TSDF'>)
 Call help(display) for more info.

But it works when I run: display(data_tsdf.df)!

Package version: 0.1.1
Databricks cluster: DBR 8.3 | Spark 3.1.1

Thanks for the support and great work!

Refactor partition cols as class-level attribute

Partition columns are currently an argument to most functions. These are really a way of describing the structure of the underlying data frame, and as such really should be a class-level attribute. We should refactor these in this way, as well as providing some internal utility functions to construct appropriate windows based on all factors relevant to the data structure (ts_cols, partition columns, etc.)

Avoid expensive collect calls for skew optimized AsOfJoin

https://github.com/databrickslabs/tempo/blob/master/python/tempo/tsdf.py#L127
When doing asOfJoin for a very wide large table, the collect of the min is actually a bit expensive - caused OOM errors
java.lang.OutOfMemoryError: Java heap space

Since this collect is solely for warning about potential null values when the window size is rather tight, proposal is to make this collect() call configurable.

Two approaches:

add a flag to asOfJoin method ignore_null_warning that defaults to False, so users that are well-aware of the risk of nulls can set the flag to True, and tempo can skip the collect calls.
check the logging level for the tempo module, and avoid collect calls when the logging level is above warning. https://docs.python.org/3/howto/logging.html#optimization

I have my own branch that does 1) approach https://github.com/CTCC1/tempo/pull/1/files
(will add 2 after holiday too)

Optimization on event_time fails if number of columns>34

Unable to alias left and right tables, causing inability to disambiguate column names in projection

Steps to reproduce: https://demo.cloud.databricks.com/#notebook/7807112/command/7818936

When left and right tables have the same column name, its impossible to use a select statement to select that column. There is no way to alias the tables like in a standard join (e.g. select a.columnname from foo a join bar b)

Offer to Connect on APIs & Forecasting Workflows

@rportilla-databricks @tnixon @MaxDBX

This isn't an issue, but an offer to connect - I wasn't sure of the best way to get in touch.

I just watched the YouTube video on tempo after learning about it at the Data& AI Summit, and I'm really excited about this. It was mentioned that you'd be open to connecting on other APIs for future development, as well as incorporating tempo into existing forecasting workflows. For API's, I'm an R-focused data scientist so I'd love to connect on an R API. And for forecasting, I'm a heavy user of the modeltime workflow - which basically centralizes various time-based algorithms (arima, exponential smoothing, prophet, xgboost, random forest, etc.) and heavily leverages the tidymodels system (tidymodels is R's version of scikit).

Tempo with LSTM

I have a usage question.

If I have a (2D) time series that I want to use for e.g. an LSTM model. Then I first convert it to a 3D array and then pass it to the model. This is normally done in memory with numpy. But what happens when I manage my BIG file with Spark? The solutions I've seen so far all do it by working with Spark and then converting the 3D data in numpy at the end. And that puts everything in memory.... or am I thinking wrong?

A common Spark LSTM solution is looks like this:

# create fake dataset
import random 
from keras import models
from keras import layers
 
 
 
data = []
for node in range(0,100):
    for day in range(0,100):
        data.append([str(node),
                     day,
                     random.randrange(15, 25, 1),
                     random.randrange(50, 100, 1),
                     random.randrange(1000, 1045, 1)])
        
df = spark.createDataFrame(data,['Node', 'day','Temp','hum','press'])
 
# transform the data
df_trans  = df.groupBy('day').pivot('Node').sum()
df_trans = df_trans.orderBy(['day'], ascending=True)
 
#make tran/test data
trainDF = df_trans[df_trans.day < 70]
testDF = df_trans[df_trans.day > 70]
 
 
################## we lost the SPARK #############################
# create train/test array
trainArray = np.array(trainDF.select(trainDF.columns).collect())
testArray = np.array(testDF.select(trainDF.columns).collect())
 
# drop the target columns
xtrain = trainArray[:, 0:-1]
xtest = testArray[:, 0:-1]
# take the target column
ytrain = trainArray[:, -1:]
ytest = testArray[:, -1:]
 
# reshape 2D to 3D
xtrain = xtrain.reshape((xtrain.shape[0], 1, xtrain.shape[1]))
xtest = xtest.reshape((xtest.shape[0], 1, xtest.shape[1]))
 
# build the model
model = models.Sequential()
model.add(layers.LSTM(1, input_shape=(1,400)))
model.add(layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
 
 
# train the model
loss = model.fit(xtrain, ytrain, batch_size=10, epochs=100)

My problem with this is: if my Spark data uses millions of rows and thousands of columns, then when the # create train/test array program line tries to transform the data, it causes a memory overflow.

How do I use Tempo with LSTM to solve the 2D - > 3D problem?

Coding style?

Wondering if we should adopt & enforce any coding style rules

Not sure if there is an internal Databricks style. I believe the Spark project has one?

I just saw this article on Palantir's style guide for PySpark:
https://medium.com/palantir/a-pyspark-style-guide-for-real-world-data-scientists-1727fda397e9

Enhancement - Include BYO Function for resampling

Documentation for AsOfJoin is wrong

Examples of the AsOfJoin function in the README show an argument partitionCols that does not exist in this function.
These docs should be rewritten anyway based on what happens in #35

Typo in interpolation code

event_ts is hardcoded here

tempo/python/tempo/interpol.py

Lines 303 to 313 in 9213797

 # Generate surrogate timestamps for each target column 

 # This is required if multuple columns are being interpolated and may contain nulls 

 add_column_time: DataFrame = time_series_filled 

 for column in target_cols: 

 add_column_time = add_column_time.withColumn( 

 f"event_ts_{column}", 

 when(col(column).isNull(), None).otherwise(col(ts_col)), 

 ) 

 add_column_time = self.__generate_column_time_fill( 

 add_column_time, partition_cols, ts_col, column 

 )

But later it is requested as f"{ts_col}_{target_col}"

tempo/python/tempo/interpol.py

Lines 357 to 364 in 9213797

 interpolated_result = interpolated_result.drop( 

 f"previous_timestamp_{target_col}", 

 f"next_timestamp_{target_col}", 

 f"previous_{target_col}", 

 f"next_{target_col}", 

 f"next_null_{target_col}", 

 f"{ts_col}_{target_col}", 

 )

This results in a error if the ts_col is different from event_ts such as:

AnalysisException: cannot resolve 'timestamp2_value' given input columns: [meta.asset_property_name, event_ts_value, next_timestamp, previous_timestamp, timestamp2, value];

Custom partitioner for underlying Dataframes

Since we understand the structure of the data more than is typical for arbitrary Dataframes. We can apply a hybridized partitioning scheme:

Hash partitioning on partition columns
Range partitioning on time index and ordering columns

In principle this will give us optimal performance of various expensive operations. However, there can be an added cost of repartitioning. We should test the performance improvements of various operations (as-of joins, rolling window calculations, etc.) both with and without such a custom partitioning scheme on a range of data sizes.

partitioned ASOF benchmarking

@rportilla-databricks reported potential sub-par performance of the partitioned asof join on a benchmark time-series dataset that should be investigated.

Notebook: https://field-eng.cloud.databricks.com/#notebook/2404179/command/2404207

comparison with Flint

I am aware of flint (https://github.com/twosigma/flint) and the internal optimizations for time-series (how the data is laid out in memory). How does tempo compare to Flint? Is a similar restructuring performed to allow for fast temporal operations? Or do you solely rely on DeltaLkae partitioning & z-ordering?

Add optional prefix for as of join

Currently, when the as-of joins are computed, duplicate fields may end up in the returned schema. Add an option to prefix the returned as of fields.

Example Error:

AnalysisException: Reference 'trade_dt' is ambiguous, could be: trade_dt, trade_dt.;

range stats does not correctly identify columns to summarize

I discovered a few issues in the withRangeStats function:

ignores provided colsToSummarize argument and auto-discover columns even if it is provided
will attempt to summarize partition columns, when provided (if not of string type). It should never summarize a partition column
will attempt to summarize various non-string types that result in failures (ie timestamp columns)

Functions for missing data / up-scaling interpolation

We should have some functions to help interpolate data to both:

fill in missing data
up-scaling timeseries to finer scales
- eg. going from daily to hourly data

We should support interpolation along time axis under different window constraints (forward-backward looking or trailing only). Also perhaps methods that look across series (ie PCA)

other_cols in init method?

Issue to make sure we have a discussion on whether we want to add subset_columns argument to init method. If we do not I'll create a quick PR to address this. @tnixon @rportilla-databricks

asofJoin extremely slow when result has 99 or more columns

asofJoins that result in a dataframe with >=99 columns seem to cause unnecessary spark operations compared to asofJoins that result in <99 columns. These extraneous operations are associated with a massive increase in runtime. Below is an exported source from a databricks notebook that describes and replicates the issue. With 10,000 rows, the example with 99 columns took more than 10x the time to run compared to the example with 98 columns. Performing asofJoins with larger datasets (billions of rows, hundreds of columns) is pretty much infeasible. Tests were conducted on a databricks cluster with the following: DBR 9.1 LTS ML | Spark 3.1.2 | Scala 2.12.

# Databricks notebook source
# MAGIC %md
# MAGIC 
# MAGIC # Tempo Inefficiency Bug
# MAGIC 
# MAGIC There seems to be a bug where performing an asofJoin with too many columns runs differently under the hood in a way that is more than 10x slower than an asofJoin with fewer columns. This notebook replicates the bug.

# COMMAND ----------

# MAGIC %md
# MAGIC ### Install tempo

# COMMAND ----------

# MAGIC %pip install pip==20.2.4

# COMMAND ----------

# MAGIC %pip install -e git+https://github.com/databrickslabs/tempo.git#"egg=tempo&#subdirectory=python"

# COMMAND ----------

import re

import numpy as np
import pandas as pd

import tempo

# COMMAND ----------

# MAGIC %md
# MAGIC 
# MAGIC ### Define Utility Functions

# COMMAND ----------

def to_tsdf(df):
  return tempo.TSDF(df.withColumnRenamed('packet_seq', 'event_ts'), ts_col="event_ts", partition_cols=["partition_col"])

# COMMAND ----------

def make_df(nrows, ncols, name):
  """
  Utility to make dataframes of arbitrary size.
  We will need partition_col as a partition column and event_ts as the time-series column, mimicking the schema that we typically use.
  """
  mat = np.random.randint(low=0, high=3, size=(nrows, ncols))
  df = pd.DataFrame(mat)
  df = df.rename(columns={0:'event_ts', 1:'partition_col'})
  df = df.rename(columns={x:f"{str(x)}_{name}" for x in range(2,ncols)})
  df = df.sort_values(by=['partition_col', 'event_ts'])
  return spark.createDataFrame(df)

# COMMAND ----------

# MAGIC %md
# MAGIC ### Example 1 (no bug)
# MAGIC We asofJoin a 50-column df with a 49-column df for a result containing 50 + 49 - 1 = 98 columns (the event_ts column is counted twice). Look at the spark ui to see what a normal asofJoin does under the hood.

# COMMAND ----------

# NUM_ROWS = 10  # runs fast and let's you see the difference in spark ui
NUM_ROWS = 10000  # noticeable runtime difference with this many rows

# COMMAND ----------

tsdf1 = to_tsdf(make_df(NUM_ROWS, 50, 'left'))
tsdf2 = to_tsdf(make_df(NUM_ROWS, 49, 'right'))

# COMMAND ----------

tsdf3 = tsdf1.asofJoin(tsdf2)
print(len(tsdf3.df.columns))
df = tsdf3.df.toPandas()

# COMMAND ----------

# MAGIC %md
# MAGIC ### Example 2 (bugged)
# MAGIC Now we try two 50-feature dataframes for a result with 99 columns. The runtime for this will be much longer than the previous asofJoin, despite only having one additional column. (This difference is more noticeable with higher values of NUM_ROWS.)
# MAGIC 
# MAGIC My guess is that tempo hits some threshold of 100 under the hood, at which point something dumb happens. The spark ui should show something quite different compared to before. (Note the giant stack of `Project` and `RunningWindowFunction` blocks in the second spawned job.)

# COMMAND ----------

tsdf1 = to_tsdf(make_df(NUM_ROWS, 50, 'left'))
tsdf2 = to_tsdf(make_df(NUM_ROWS, 50, 'right'))  # we now have one more feature column

# COMMAND ----------

tsdf3 = tsdf1.asofJoin(tsdf2)
print(len(tsdf3.df.columns))
df = tsdf3.df.toPandas()

# COMMAND ----------

# MAGIC %md
# MAGIC 
# MAGIC A potential workaround for a tempo user might be to split the right dataframe column-wise and perform two asofJoins using only the time-series + partition cols of the left dataframe. This would pick the needed rows out of the right dataframe, at which point we could perform normal joins to get the final result.

WARNING:root:Column _c had no values within the lookback window. Consider using a larger window to avoid missing values. If this is the first record in the data frame, this warning can be ignored.

Note the "root" here.

https://docs.python.org/3.9/howto/logging.html#advanced-logging-tutorial
Based on my own experience, instead of log every line directly using logging.info, using a module level logger is preferred, such as:

logger = logging.getLogger(__name__)

Then the log lines will be more clear, and say sth like WARNING:tempo.tsdf, and user can also set logging level in their code, e.g.

import logging
logging.basicConfig(level=logging.INFO)
logging.getLogger("py4j").setLevel(logging.WARNING)
logging.getLogger("tempo").setLevel(logging.WARNING)

Originally posted by @CTCC1 in #83 (comment)

Add coverage reports to unit tests

We should have coverage reports for our test suites

Tempo Java support

Hi Team,
we want to use tempo in the java platform ,can you please confirm how to achieve the same.

	# Generate surrogate timestamps for each target column
	# This is required if multuple columns are being interpolated and may contain nulls
	add_column_time: DataFrame = time_series_filled
	for column in target_cols:
	add_column_time = add_column_time.withColumn(
	f"event_ts_{column}",
	when(col(column).isNull(), None).otherwise(col(ts_col)),
	)
	add_column_time = self.__generate_column_time_fill(
	add_column_time, partition_cols, ts_col, column
	)

	interpolated_result = interpolated_result.drop(
	f"previous_timestamp_{target_col}",
	f"next_timestamp_{target_col}",
	f"previous_{target_col}",
	f"next_{target_col}",
	f"next_null_{target_col}",
	f"{ts_col}_{target_col}",
	)

databrickslabs / tempo Goto Github PK

tempo's Introduction

tempo - Time Series Utilities for Data Teams Using Databricks

Project Description

tempo's People

Contributors

Stargazers

Watchers

Forkers

tempo's Issues

Recommend Projects

Recommend Topics

Recommend Org