Cached Datasets

The purpose of this guide is to help you understand what caching is and when you may want to cache your query to speed up your analysis response. Part One to this guide on Cached Datasets is the guide on the topic: Cached Queries.

What is a Cached Dataset?

Keen IO automatically indexes all compute analyses against the property keen.timestamp. This means that any query you run is automatically optimized via the timestamp property.

Cached Datasets is a feature which allows you to specify your own property or any property other than timestamp to index by. In fact, you can specify more than one property to index_by!

Once you pre-define the query and property to index as a Dataset, we begin pre-computing and optimizing your query results. This pre-computation on a regular time interval allows you to retrieve results for specific customers, products, campaigns, A/B test experiments, or any particular segment you need, quickly. You can retrieve specific index values and arbitrary sub-timeframes on-demand. So instead of getting a large result with all customers with dimensions you don’t care about, you can receive the exact results you need for each specific customer’s dashboard.

This flexibility allows you to build a highly interactive and responsive dashboard for each of your customers and allows you to surface the power to explore.

Cached Dataset Viz with Slider

Use Cases for Cached Datasets

Use Cached Datasets to power dashboards or applications that demand sub-second response times. You specify a query and then we regularly pre-compute it across a huge timeframe.

Here are some cases where you would want to Cache your Dataset:

  • You’re building a customer-facing dashboard and want a responsive, interactive experience. Customers will be able to interact with their data via timesliders, datepickers, and drill-downs for specific segments or timeframes such as weekly, daily, or minutely.
  • You’re building an internal tool on top of your data to run comparison analysis between particular product, campaign, or headline performance. You’re looking for sub-second performance on data for specific IDs or identifiers.

Pros and Cons of Cached Query vs Cached Dataset

Cached Queries:

  • Supports funnels
  • Doesn’t require interval
  • Retrieving the result for an individual customer_id requires creating a unique cached query for each customer. Because all events must be scanned to return the result for an individual customer, this generates much more properties scanned.

Cached Datasets

  • Requires “index_by”
  • Requires interval
  • Lets you use index to grab specific property values
  • Let’s you retrieve subsets of the overall timeframe
  • Optimizes Properties Scanned to minimize compute cost (scan events only once to retrieve the query result for each customer_id)

How to Create a Cached Dataset

To create a Cached Dataset, specify the data you’d like to analyze, the overall timeframe to include, and the interval granularity you want. Most importantly, you will specify an “index_by” field.

Based on your defined “index_by” properties, we will pre-compute and optimize query performance for all existing values of your indexes.

Keep in mind that in addition to the properties you want to “index_by” you can still specify “group_by” on other properties. For example, you might want to index your purchases by product and group them by state or province.

Ready to try it out? Create a Cached Dataset or view any current Cached Datasets that you’ve created via code samples in our API Reference Guide.