Avatar photo

Upcoming Changes to Cached Datasets

Here’s the tl;dr:

We all know that customer-facing metrics improve the user experience but that experience can quickly be soured by long loading times. Cached Datasets pre-compute and index data for hundreds or thousands of entities (e.g. users or accounts), and are a great way to improve your query efficiency, minimize costs, and enhance user experience. Next month we’re planning to roll out two changes that should provide an easier and more flexible Cached Datasets experience:

  • The limit on the number of sub-timeframes will increase from 500 to 2000. This will allow daily datasets going back up to 5 years.
  • New fields in the “GET Dataset definition” response will provide clear feedback if/when any errors are encountered while building the cache data.

Read on for details, and let us know what you think of these changes!

Sub-timeframes Limit

One of the common frustrations we’ve heard from customers is that they want to display daily data going back multiple years, but a `this_500_days` timeframe is only a little over one year. To address this feedback we’ve loosened this limit to 2000 sub-timeframes, which for a daily interval translates to more than 5 years.

While this adds a lot of flexibility, there is an important caveat: each index_by value’s result must still fit within the DynamoDB limit of 400KB per row. More sub-timeframes means more data per row, so you will need to use extra care. While it can be difficult to calculate the exact row size ahead of time, here are some general guidelines:

  • For simple Cached Datasets (with no group_by and a scalar analysis type like `count` or `sum`) the result size per sub-timeframe is typically small so the total row size should be well under the limit.
  • One or more `group_by` properties will greatly increase the result size, because each sub-timeframe has one result per unique group_by value.
  • Other factors such as multi_analysis, select_unique, or the presence of JSON arrays/objects in property values can all increase the result size as well.

We plan to publish some additional guidance on estimating result size as part of our rollout of these changes. Keep an eye on the Cached Dataset documentation.

Cached Dataset Status

Another challenge we’ve heard many of you face is in knowing if/when your Cached Dataset is hitting a limit or edge case and is failing to fully update properly. Example scenarios that can cause updates to fail include:

  • One or more index_by values are exceeding the 400KB DynamoDB row size limit.
  • One or more sub-timeframes are exceeding the maximum number of groups (1M).
  • The queries for one or more sub-timeframes are consistently timing out.

Currently these scenarios, while rare, can result in a Cached Dataset missing data for some index_by values and/or sub-timeframes. This can be hard to discover without careful review. To alleviate this pain point we’re making the bootstrapping process (wherein a newly created Cached Dataset’s cache data is populated asynchronously by our background workers) explicit. We’re also exposing new information in the “GET Dataset definition” operation that will help diagnose bootstrapping errors, as well as provide warnings for Cached Datasets that start encountering errors after bootstrapping has been completed.

Status for New Cached Datasets

When a Cached Dataset is created it will be initialized with status=Created. Shortly after (typically within a minute) it will be picked up by our scheduler and start its bootstrapping process (status=Bootstrapping), which may last anywhere from a few seconds to multiple hours depending on the amount of data to crunch. Here is an example response showing the new “status” field:

{
  "project_id":"5011efa95f546f2ce2000000",
  "organization_id":"4f3846eaa8438d17fb000001",
  "dataset_name":"count-purchases-gte-100-by-country-daily",
  "display_name":"Count Daily Product Purchases Over $100 by Country",
  "query": { … },
  "index_by":["product.id"],
  "last_scheduled_date":"2016-11-04T18:52:36.323Z",
  "latest_subtimeframe_available":"2016-11-05T00:00:00.000Z",
  "milliseconds_behind": 3600000,
  "status": "Bootstrapping"
}

During the bootstrapping phase the “GET Dataset results” operation will return 503 SERVICE_UNAVAILABLE and provide an error message indicating that bootstrapping hasn’t finished yet.

{
  "message": "The Cached Dataset hasn't finished bootstrapping yet. The result is not available. Monitor the status using GET Dataset definition."
}

Only after the bootstrapping phase finishes successfully (status=OK) will the Cached Dataset be available for retrieving results. In the rare case of a Cached Dataset bootstrapping failure (status=BootstrappingFailed), the GET Dataset definition operation provides a descriptive error message. It is also impossible to retrieve a specific index_by value’s result from a Cached Dataset that failed.

Status for existing Cached Datasets

The GET Dataset Definition operation will return a new “status” field for existing Cached Datasets, as well as for Cached Datasets that have successfully finished the bootstrapping phase. A status of “OK” indicates that the Cached Dataset did not encounter any errors during internal updates. A status of “Warn” indicates that a Cached Dataset is facing some issues; in this case a descriptive error message will be provided to help diagnose and fix the issue.

A Cached Dataset that is in the “Warn” state may still be queried, but care should be taken as some data might be incomplete.

Fixing Warnings and Errors

If a Cached Dataset fails bootstrapping you will need to review and address the errors. This can be done by explicitly excluding certain index_by values and/or sub-timeframes using filters, or by removing group_by properties.

Summary

We do hope that our effort in making Cached Datasets easier to work with will make building your customer-facing metrics with Keen an awesome experience. If you have any suggestions feel free to use our Canny board.