Running large window aggregations in databricks spark structured streaming with small slide duration

Sergey A. Volkov 21 Reputation points
2022-06-10T08:34:23.28+00:00

I want to run aggregations on large windows (90 days) with small slide duration (5 minutes).

Straightforward solution leads to giant state around hundreds of gigabytes, which doesn't look acceptable.

Is there any best practices doing this?

Now I consider following scenarios:

  1. Use flatMapGroupsWithState and implement EWMA (exponentially weighted moving average) instead of average to reduce state. Is there good library for EWMA?
  2. Somehow join data from two streams - e.g. 90 day window with 1 day slide and 1 day window with 5 minute slide

Any other ideas?

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,938 questions
0 comments No comments
{count} votes

Accepted answer
  1. KranthiPakala-MSFT 46,422 Reputation points Microsoft Employee
    2022-06-13T17:18:49.017+00:00

    Hello @Sergey A. Volkov ,

    Thanks for the question and using MS Q&A platform.

    As per the discussion with databricks team, their public Time Series library tempo has EWMA capabilities - https://github.com/databrickslabs/tempo - and if your use case can't be met with what's there today, they have suggested to post a request on the repo, as their team is eager to expand use cases.

    But also, broadly agree with your 2nd point, mixing those two-time granularities doesn't make a ton of sense, the 89 days in the middle of the slide aren't changing at all, they can be represented as a single point with a weight.

    As your ask is more related to core concepts of Databricks, if you have further questions regarding the same, I recommend reaching out in Databricks community forum which is better suited for this topic: https://community.databricks.com/s/

    But feel free to let us know if you have any specific queries related to Azure side.

    Hope this info helps.

    ----------

    ------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    0 comments No comments

0 additional answers

Sort by: Most helpful