Avatar photo

Keen Cached Datasets: A Primer

I’m going to assume if you’re reading this, you’re interested in cached datasets. Maybe you’re trying to figure out if Keen is the right tool for your data project. Maybe you already use Keen, and are wondering if this will help you get the most out of your implementation. Maybe you’ve just watched the movie Primer, and have stumbled here looking for an explanation of what the heck just happened. In that case, I will refer you (SPOILER ALERT) here.

What are Cached Datasets used for?

Essentially, Cached Datasets are a way to pre-compute data for hundreds or thousands of entities at once. Once setup, you can retrieve results for any one of those entities instantly.

We use it internally for usage reporting, billing, monitoring, and our customer-success dashboard.

Ok, that’s wonderful, but can you give me an example?

Let’s say you are tracking e-book downloads for lots of different authors in a storefront like Amazon. You want to create a report for each author showing how many people are viewing their books, like this:

 

1_p7qaza9k07VUswh9lP6yIA

To create this, you would first need to track events each time someone viewed a book like this:

book_download = {
   title: "On the Pulse of Morning",
   author: "Maya Angelou",
   timestamp: "2017-05-18T21:23:49:000Z"
}

Now, we could create a query for each author, where each one would have a filter for that specific author’s name. Here’s one that would retrieve data for Margaret Atwood’s dashboard.

$ curl https://api.keen.io/3.0/projects/PROJECT_ID/queries/count \
     -H "Authorization: READ_KEY" \
     -H 'Content-Type: application/json' \
     -d "{
         \"event_collection\": \"download_book\",
         \"timeframe\": \"this_10_months\",
         \"filters\": [
           {
             \"property_name\" : \"author\",
             \"operator\" : \"eq\",
             \"property_value\" : \"Margaret Atwood\"
           }
         ],
         \"interval\": \"monthly\",,
         \"group_by\": \"title\",
     }"

But that would require a lot of overhead. For one, you would have to get the results fresh each time an author requested their dashboard, causing them to have to wait for the dashboard to load. Lastly, the administrative cost is high, because you’d have to create a new query anytime you needed a dashboard for a new author.

Enter Cached Datasets! With Cached Datasets, we can show every author a dashboard that lets them see how each of their books are performing over time, all with a single Keen data structure. It will also just stay up to date behind the scenes, including when new authors are added, and scaling to thousands of authors.

That’s pretty neat, how do I set one up?

Step 1: First, define the query. The basics of the query would be simply counting these book_download events. You will also need to provide an indextimeframe, and interval.

  • index: In our example, the index would be author, since that’s the property we want to use to separate the results. (In an actual implementation, you would probably have both the author’s name, and some kind of ID, in which case, you would index on the ID, to keep your data cleaner. But, for the sake of keeping this example simple, we’re just going to use the name.)
  • timeframe: This is going to bound the results you can retrieve from the dataset. It should be as broad as you expect ever needing. (Anything outside this timeframe will never be retrievable out of this dataset). In our example, we’re going to go back 24 months.
  • interval: This defines how you want your data to be bucketed over time (ex: minutely, hourly, daily, monthly, yearly). In our example, we’re going to do monthly.

Step 2: Make an API request to kick off that query. Once you do that, Keen will run the query and update it every hour.

Here’s an example request that would create a dataset based on those book_download events. The group_by is what allows us to separate out the views by book title. The PROJECT_ID and all keys will be provided when you create a Keen account.

# PUT
$ curl https://api.keen.io/3.0/projects/PROJECT_ID/datasets/book-downloads-by-author \
    -H "Authorization: MASTER_KEY" \
    -H 'Content-Type: application/json' \
    -X PUT \
    -d '{
    "display_name": "Book downloads for each author",
    "query": {
        "analysis_type": "count",
        "event_collection" : "book_download",
        "group_by" : "title",
        "timeframe": "this_24_months",
        "interval": "monthly"
    },
    "index_by": ["author"]
}'

Step 3: Get lightning fast results. Now you can instantly get results for any author. For example, here’s the query you would use to retrieve results for Maya Angelou’s dashboard, showing the last 2 months of downloads:

# GET
$ curl https://api.keen.io/3.0/projects/PROJECT_ID/datasets/book-downloads-by-author/results?api_key=READ_KEY&index_by="Maya Angelou"&timeframe=this_2_months

The results you’d get would look like this (This query was run on April 19th, so the most recent “month” is only 19 days long)

{
  "result": [
    {
      "timeframe": {
        "start": "2017-03-01T00:00:00.000Z",
        "end": "2017-04-01T00:00:00.000Z"
      },
      "value": [
        {
          "title": "I Know Why the Caged Bird Sings",
          "result":10564
        },
        {
          "title": "And I Still Rise",
          "result":1823
        },
        {
          "title": "The Heart of a Woman",
          "result":5329
        },
        {
          "title": "On the Pulse of Morning",
          "result":9234
        }
      ]
    },
    {
      "timeframe": {
        "start":"2017-04-01T00:00:00.000Z",
        "end":"2017-04-19T00:00:00.000Z"
      },
      "value": [
        {
          "title": "I Know Why the Caged Bird Sings",
          "result":9184
        },
        {
          "title": "And I Still Rise",
          "result":7395
        },
        {
          "title": "The Heart of a Woman",
          "result":4637
        },
        {
          "title": "On the Pulse of Morning",
          "result":8571
        }
      ]
    }
  ]
}

And that’s it! This query, with a timeframe of this_10_months (and different authors) is exactly what was used (along with keen-dataviz.js) to create those awesome dashboards. Here they are again, in case you forgot:

 

1_p7qaza9k07VUswh9lP6yIA

For more information, and some implementation details, check the docs here.

Lastly, we’re still building out monitoring for this Early Release feature. If you have any questions, or run into any issues while using this feature, please drop us a line.