Compute Pricing Guide

At the core, the Keen Compute API is your method of asking questions about the data you have collected. It is priced in a simple pay-as-you-go model. This guide will explain how those prices are calculated as well as demonstrate the cost-saving capabilities of our advanced compute features: Cached Queries and Cached Datasets.

What does “Properties Scanned” mean?

In short, it represents the amount of data we had to process to answer the query you requested. For the rest of this document we’ll use the following definitions:

N: the total properties scanned.
E: the number of events that exist within the timeframe you provided.
P: the number of properties per event required to calculate the query.

The formula used to calculate the number of properties scanned per query is:

N = E * P

To calculate P for a given query, first find the set of unique properties referenced in the filters, group_by, or target_property parameters. If the query has a timeframe (which almost all should) then include keen.timestamp in this set. P is then equal to the cardinality, or number of elements, in the set. This means that if you’re filtering and grouping by the same property, that only increases P by 1.

Here are some examples of calculating P (all assuming that a timeframe is present):

Query Definition P
count collection C 1
count collection C, filter on property A 2
sum on property A, filter on property B 3
sum on property A, filter on property A, group_by property A 2

Working Example

To illustrate this formula let’s consider an example query: a sum of property x, with a filter on property y being equal (eq) to a specific value, a timeframe of this_90_days, and an interval of daily. Assume that the collection being queried gets a steady 10K events per day. For this query P will be 3 (the properties are x, y, and keen.timestamp) and E will be 900K so N = 900K * 3 = 2.7M. For the sake of example, if the cost is $1 per 10M properties scanned this query would cost $0.27. If this query is powering a KPI in a dashboard that is viewed 20 times per day, or 600 times per month, the monthly cost of that KPI would be 600 * 2.7M = 1.62B properties scanned or $162. Read on to see how caching can help bring that price down.

Why is Keen Compute usage priced based on properties scanned?

You can think of Keen as a columnar database indexed on keen.timestamp.

“Columnar database” means that when we are reading data to evaluate a query, we can read just the subset of properties (or columns) that are relevant to the query and skip the others. All other factors being equal, a query that reads 10 properties requires 5 times as much work as one that reads 2 properties.

“Indexed on keen.timestamp” means that we can efficiently look up the subset of events whose keen.timestamp falls within a given range, also known as a timeframe. All other factors being equal, a query whose timeframe includes 10 million events requires 10 times as much work as one whose timeframe includes 1 million events. Importantly, this is true even if filters on other properties dramatically reduce the number of events that are actually used to compute the result. A query with a timeframe that includes 10 million events and a filter that matches just 10 will still have to read all 10 million.

While these two factors are not the only contributors to how much it costs Keen to execute a query, they tend to dominate in most cases and provide a good approximation. We use these factors to compute your usage in order to align your costs with ours, which incentivizes efficient implementations and allows us to lower costs for everyone.

How are Extractions priced?

Giving you access to your raw data is very important for us at Keen. We provide the ability to extract chunks of your data in CSV or JSON format via our Extraction API.

The pricing for this follows the same pricing formula mentioned above. The P in this case is equal to the number of properties per event you want to extract. By default this is all of the properties in the schema for the given event collection (similar to a SELECT * SQL query).

You can limit the properties retrieved by an extraction using the property_names parameter. If you only need a small subset of the properties in the schema for your use case then this can result in a large cost savings (and performance improvement).

How are funnels priced?

Funnels are a powerful tool for analyzing data. They allow you to analyze a cohort’s behavior across multiple events.

The calculations for E and P are slightly different for funnels than for other queries:

E is calculated as the sum of the number of events that matched the timeframe of each step in the funnel. For example if step 1 is over collection foo with a timeframe of this_30_days, step 2 is over collection foo with a timeframe of this_10_days, and step 3 is over collection bar with a timeframe of this_30_days, the E will be equal to:

[# of events in foo in this_30_days]
  + [# of events in foo in this_10_days]
  + [# of events in bar in this_30_days]

P is calculated based on the set of properties that appear in any step: all filters, all actor_id properties, plus keen.timestamp. Note that properties with the same name but in different collections are currently considered to be the same property for the purposes of this calculation.

The total properties scanned is still just computed as N = E * P.

Finding the Cost of a Query at Execution Time

We provide you with an option to enrich query results with the detailed number of scanned events (E) and properties (P), as well as the total N. Please read how to modify your query in order to see the execution details.

How Caching Saves on Query Costs

By default our compute API calculates answers at the time of request. These “ad hoc” queries are great for exploration, but if similar or identical queries are made frequently it can drive up costs. Caching is an effective way to make common queries both cheaper and faster.

Cached Queries

A Cached Query is a query that Keen automatically runs periodically according to a specified refresh_rate (configurable between 4 and 48 hours). The result is then kept in a cache so when it is retrieved, it is pulled from the cache instead of recomputing it.

Cached Query pricing is based purely on the queries that Keen runs to update the cache; there is no cost to you for retrieving the cached result.

If a hypothetical dashboard is viewed 100 times per day and its queries are all being calculated from scratch every time, the total properties scanned usage will rise very quickly. If instead the same dashboard uses Cached Queries they will only be calculated once per refresh_rate period, thus reducing the amount of compute that needs to be done. On top of that, the data required to power the dashboard will be served from the cache for increased speed.

To estimate the total monthly properties scanned for a Cached Query, simply compute the properties scanned for a single execution (using the N = E * P formula from above) and then multiply by R = the number of times the query will be run per month. With a refresh_rate of 4 hours and a 30-day month, for example, R will be around 180.

Working Example, revisited

In our example above we considered a query run 20 times per day that would generate 1.62B or $162 worth of properties scanned usage per month. If that same query was migrated to a Cached Query with a refresh_rate of 4 hours then it would generate 2.7M * 180 = 486M properties scanned per month, or $48.60, a 70% savings. The more frequently the query result is retrieved, the bigger the savings of a Cached Query over ad hoc. Conversely if the ad hoc query is only run once or twice per day then Cached Queries are probably not a good cost-saving opportunity.

Cached Datasets

Cached Datasets allow you to precompute results for a query for every value of an index_by property (or combination of properties) and then quickly look up the result of that query for a specific value.

Like Cached Queries, Cached Datasets are priced based on the queries that Keen runs in the background to build and update the cache. There is no cost for retrieving results. To estimate how much a Cached Dataset will cost per month, we need to understand a little more about how it is updated.

Under the hood Keen logically stores the cached dataset as a matrix whose rows are the values of the index_by property and whose columns are the intervals. Each cell in the matrix represents the result of the query for that index_by value in that interval. Once per hour Keen checks which columns are due to be refreshed by finding (a) any columns that have never been computed before; (b) the column that contains the current time, if any; and (c) the column that contains the time 48 hours ago, if any. (This 48 hour “trailing update” is to catch any late-arriving data.) Note that (b) and (c) may be the same column, e.g. because the Cached Dataset has a monthly interval and it is the middle of the month.

To update a column Keen runs a query that is similar to the one in the Cached Dataset definition, but it is modified in the following ways:

  • The interval is removed.
  • The timeframe is set to be the absolute boundaries of the interval corresponding to the column being updated. For example if the Cached Dataset has a daily interval and we are updating the column for the current day, then the timeframe for the update query will be set to something like {"start":"2019-01-01T00:00:00Z","end":"2019-01-02T00:00:00Z"}.
  • The index_by property (or properties) are added to the list of group_by properties.

Once this query completes it is parsed into individual results for each index_by value and the appropriate cells in the matrix are updated.

With that background in mind the monthly cost of a Cached Dataset can be estimated as follows. First, compute the cost of a single column update query. To do this follow the same N = E * P formula as before, using the typical number of events per unit of the Cached Dataset’s interval for E and computing P including the index_by properties. Then based on the interval and timeframe being used look up how many column update queries will be run per month in the following table (note that a 30-day month is assumed for simplicity):

Interval Timeframe keyword Column updates per month (approx.) Explanation
minutely this_N_minutes 60 * 24 * 30 = 43200 Each hour the latest ~60 minutes are computed. (Note: 48 hours = 2880 minutes > 500, so the “trailing update” will always be outside of the timeframe.)
minutely previous_N_minutes 60 * 24 * 30 = 43200 Same as this_N_minutes.
hourly this_N_hours 2 * 24 * 30 = 1440 Each hour is computed twice: once when it first comes into the timeframe and once by the “trailing update” pass.
hourly previous_N_hours 2 * 24 * 30 = 1440 Same as this_N_hours.
daily this_N_days 2 * 24 * 30 = 1440 Each hour the current day and the day before yesterday are computed.
daily previous_N_days 30 + (24 * 30) = 750 Each day is computed once within the hour after it ends, then again every hour the 2nd day afterwards by the “trailing update” pass.
weekly this_N_weeks (24 * 30) + (4 * 48) = 912 Each hour the current week is computed. For the 48 hours at the start of a week, the previous week is computed by the “trailing update” pass. (Note: some months will compute 5 weeks instead of 4.)
weekly previous_N_weeks 4 * 48 = 192 For the 48 hours at the start of a week, the previous week is computed by the “trailing update” pass. (Note: some months will compute 5 weeks instead of 4.)
monthly this_N_months (24 * 30) + 48 = 768 Each hour the current month is computed. For 48 hours at the start of a month, the previous month is computed by the “trailing update” pass.
monthly previous_N_months 48 For the 48 hours at the start of a month, the previous month is computed by the “trailing update” pass.
yearly this_N_years 24 * 30 = 720 Each hour the current year is computed. (Note: the previous year is recomputed for the 48 hours of the year, so this will be slightly higher in January.)
yearly previous_N_years 48 in January; 0 thereafter The previous year is re-evaluated for the first 48 hours in January but after that no updates are necessary.

Then multiply together the cost per update query times the number of column updates per month to get a rough estimate of total properties scanned usage. Note that for this_N_* timeframes with longer-duration intervals (daily and up) this will usually be an over-estimate because the column update query for an in-progress interval won’t read as much data as for an already-finished interval. For example when first updating the column for the current day given a this_N_days interval, there will only be 1/24th of a day’s worth of events to scan. To get a precise estimate you will need to take this into account, but in most cases even the conservative estimate will be good enough.

It is also important to note that there will be an initial bootstrapping phase when a Cached Dataset is first created. During this phase Keen will need to run queries for every column in the matrix, i.e. every interval in the timeframe. To estimate the cost of this bootstrapping phase just estimate the cost of a single column update query (as described above) and multiply by the number of columns/intervals (e.g. 500 for a this_500_days timeframe).

Working Example, re-revisited

Building upon the example Query and Cached Query above, imagine we convert the query to a Cached Dataset by removing the filter on y and instead using y as the index_by property. We use a timeframe of this_90_days and a daily interval. The cost of a single column update query will be N = E * P = 10K * 3 = 30K (actually less in practice due to the partial-interval behavior mentioned above). The number of column updates per month will be 1440, from the table. So the total number of properties scanned will be less than 30K * 1440 = 43.2M or $4.32 per month, a >10x reduction compared the Cached Query version and a 37x reduction over the original ad hoc version.

Warning: avoid this_N_years Cached Datasets

When using a this modifier and a yearly interval, we have to run a column update query every hour that covers the entire current year-to-date. This becomes quite expensive as the year progresses and contains more events. For this reason we strongly discourage use of this_N_years with Cached Datasets.