 # Compute Pricing Guide

At the core, the Keen Compute API is your method of asking questions about the data you have collected. This legacy guide is still a useful and accurate pricing tool for customers on our legacy metered pricing model. For all accounts created after July of 2019, however, this is no longer the case. Properties scanned, the metric that this guide helps our customers control, is no longer a billable component for customers on one of our subscription tiers. This guide will remain public to all accounts due to the value it can still provide in aiding customers who are trying to optimize the performance of your queries and datasets.

### What does “Properties Scanned” mean?

In short, it represents the amount of data we had to process to answer the query you requested. For the rest of this document we’ll use the following definitions:

`N`: the total properties scanned.
`E`: the number of events that exist within the `timeframe` you provided.
`P`: the number of properties per event required to calculate the query.

The formula used to calculate the number of properties scanned per query is:

`N = E * P`

To calculate `P` for a given query, first find the set of unique properties referenced in the `filters`, `group_by`, or `target_property` parameters. If the query has a `timeframe` (which almost all should) then include `keen.timestamp` in this set. `P` is then equal to the cardinality, or number of elements, in the set. This means that if you’re filtering and grouping by the same property, that only increases `P` by 1.

Here are some examples of calculating `P` (all assuming that a `timeframe` is present):

Query Definition P
count collection C 1
count collection C, filter on property A 2
sum on property A, filter on property B 3
sum on property A, filter on property A, group_by property A 2

#### Working Example

To illustrate this formula let’s consider an example query: a `sum` of property `x`, with a filter on property `y` being equal (`eq`) to a specific value, a `timeframe` of `this_90_days`, and an `interval` of `daily`. Assume that the collection being queried gets a steady `10K` events per day. For this query `P` will be `3` (the properties are `x`, `y`, and `keen.timestamp`) and `E` will be `900K` so `N = 900K * 3 = 2.7M`. For the sake of example, if the cost is \$1 per `10M` properties scanned this query would cost \$0.27. If this query is powering a KPI in a dashboard that is viewed 20 times per day, or 600 times per month, the monthly cost of that KPI would be `600 * 2.7M = 1.62B` properties scanned or \$162. Read on to see how caching can help bring that price down.

### Why is Keen Compute usage priced based on properties scanned?

You can think of Keen as a columnar database indexed on `keen.timestamp`.

“Columnar database” means that when we are reading data to evaluate a query, we can read just the subset of properties (or columns) that are relevant to the query and skip the others. All other factors being equal, a query that reads 10 properties requires 5 times as much work as one that reads 2 properties.

“Indexed on `keen.timestamp`” means that we can efficiently look up the subset of events whose `keen.timestamp` falls within a given range, also known as a `timeframe`. All other factors being equal, a query whose `timeframe` includes 10 million events requires 10 times as much work as one whose `timeframe` includes 1 million events. Importantly, this is true even if filters on other properties dramatically reduce the number of events that are actually used to compute the result. A query with a `timeframe` that includes 10 million events and a `filter` that matches just 10 will still have to read all 10 million.

While these two factors are not the only contributors to how much it costs Keen to execute a query, they tend to dominate in most cases and provide a good approximation. We use these factors to compute your usage in order to align your costs with ours, which incentivizes efficient implementations and allows us to lower costs for everyone.

### How are Extractions priced?

Giving you access to your raw data is very important for us at Keen. We provide the ability to extract chunks of your data in CSV or JSON format via our Extraction API.

The pricing for this follows the same pricing formula mentioned above. The `P` in this case is equal to the number of properties per event you want to extract. By default this is all of the properties in the schema for the given event collection (similar to a `SELECT *` SQL query).

You can limit the properties retrieved by an extraction using the `property_names` parameter. If you only need a small subset of the properties in the schema for your use case then this can result in a large cost savings (and performance improvement).

### How are funnels priced?

Funnels are a powerful tool for analyzing data. They allow you to analyze a cohort’s behavior across multiple events.

The calculations for `E` and `P` are slightly different for funnels than for other queries:

`E` is calculated as the sum of the number of events that matched the timeframe of each step in the funnel. For example if step 1 is over collection `foo` with a timeframe of `this_30_days`, step 2 is over collection `foo` with a timeframe of `this_10_days`, and step 3 is over collection `bar` with a timeframe of `this_30_days`, the `E` will be equal to:

``````[# of events in foo in this_30_days]
+ [# of events in foo in this_10_days]
+ [# of events in bar in this_30_days]
``````

`P` is calculated based on the set of properties that appear in any `step`: all filters, all `actor_id` properties, plus `keen.timestamp`. Note that properties with the same name but in different collections are currently considered to be the same property for the purposes of this calculation.

The total properties scanned is still just computed as `N = E * P`.

### Finding the Cost of a Query at Execution Time

We provide you with an option to enrich query results with the detailed number of scanned events (`E`) and properties (`P`), as well as the total `N`. Please read how to modify your query in order to see the execution details.

## How Caching Saves on Query Costs

By default our compute API calculates answers at the time of request. These “ad hoc” queries are great for exploration, but if similar or identical queries are made frequently it can drive up costs. Caching is an effective way to make common queries both cheaper and faster.

### Cached Queries

A Cached Query is a query that Keen automatically runs periodically according to a specified `refresh_rate` (configurable between 4 and 48 hours). The result is then kept in a cache so when it is retrieved, it is pulled from the cache instead of recomputing it.

Cached Query pricing is based purely on the queries that Keen runs to update the cache; there is no cost to you for retrieving the cached result.

If a hypothetical dashboard is viewed 100 times per day and its queries are all being calculated from scratch every time, the total properties scanned usage will rise very quickly. If instead the same dashboard uses Cached Queries they will only be calculated once per `refresh_rate` period, thus reducing the amount of compute that needs to be done. On top of that, the data required to power the dashboard will be served from the cache for increased speed.

To estimate the total monthly properties scanned for a Cached Query, simply compute the properties scanned for a single execution (using the `N = E * P` formula from above) and then multiply by `R` = the number of times the query will be run per month. With a `refresh_rate` of 4 hours and a 30-day month, for example, `R` will be around 180.

#### Working Example, revisited

In our example above we considered a query run 20 times per day that would generate `1.62B` or \$162 worth of properties scanned usage per month. If that same query was migrated to a Cached Query with a `refresh_rate` of 4 hours then it would generate `2.7M * 180 = 486M` properties scanned per month, or \$48.60, a 70% savings. The more frequently the query result is retrieved, the bigger the savings of a Cached Query over ad hoc. Conversely if the ad hoc query is only run once or twice per day then Cached Queries are probably not a good cost-saving opportunity.

### Cached Datasets

Cached Datasets allow you to precompute results for a query for every value of an `index_by` property (or combination of properties) and then quickly look up the result of that query for a specific value.

Like Cached Queries, Cached Datasets are priced based on the queries that Keen runs in the background to build and update the cache. There is no cost for retrieving results. To estimate how much a Cached Dataset will cost per month, we need to understand a little more about how it is updated.

Under the hood Keen logically stores the cached dataset as a matrix whose rows are the values of the `index_by` property and whose columns are the intervals. Each cell in the matrix represents the result of the query for that `index_by` value in that interval. Once per hour Keen checks which columns are due to be refreshed by finding (a) any columns that have never been computed before; (b) the column that contains the current time, if any; and (c) the column that contains the time 48 hours ago, if any. (This 48 hour “trailing update” is to catch any late-arriving data.) Note that (b) and (c) may be the same column, e.g. because the Cached Dataset has a `monthly` interval and it is the middle of the month.

To update a column Keen runs a query that is similar to the one in the Cached Dataset definition, but it is modified in the following ways:

• The `interval` is removed.
• The `timeframe` is set to be the absolute boundaries of the interval corresponding to the column being updated. For example if the Cached Dataset has a `daily` interval and we are updating the column for the current day, then the `timeframe` for the update query will be set to something like `{"start":"2019-01-01T00:00:00Z","end":"2019-01-02T00:00:00Z"}`.
• The `index_by` property (or properties) are added to the list of `group_by` properties.

Once this query completes it is parsed into individual results for each `index_by` value and the appropriate cells in the matrix are updated.

With that background in mind the monthly cost of a Cached Dataset can be estimated as follows. First, compute the cost of a single column update query. To do this follow the same `N = E * P` formula as before, using the typical number of events per unit of the Cached Dataset’s `interval` for `E` and computing `P` including the `index_by` properties. Then based on the `interval` and `timeframe` being used look up how many column update queries will be run per month in the following table (note that a 30-day month is assumed for simplicity):

Interval Timeframe keyword Column updates per month (approx.) Explanation
`minutely` `this_N_minutes` `60 * 24 * 30 = 43200` Each hour the latest ~60 minutes are computed. (Note: 48 hours = 2880 minutes > 2000 (`N` max value), so the “trailing update” will always be outside of the timeframe.)
`minutely` `previous_N_minutes` `60 * 24 * 30 = 43200` Same as `this_N_minutes`.
`hourly` `this_N_hours` `2 * 24 * 30 = 1440` Each hour is computed twice: once when it first comes into the timeframe and once by the “trailing update” pass.
`hourly` `previous_N_hours` `2 * 24 * 30 = 1440` Same as `this_N_hours`.
`daily` `this_N_days` `2 * 24 * 30 = 1440` Each hour the current day and the day before yesterday are computed.
`daily` `previous_N_days` `30 + (24 * 30) = 750` Each day is computed once within the hour after it ends, then again every hour the 2nd day afterwards by the “trailing update” pass.
`weekly` `this_N_weeks` `(24 * 30) + (4 * 48) = 912` Each hour the current week is computed. For the 48 hours at the start of a week, the previous week is computed by the “trailing update” pass. (Note: some months will compute 5 weeks instead of 4.)
`weekly` `previous_N_weeks` `4 * 48 = 192` For the 48 hours at the start of a week, the previous week is computed by the “trailing update” pass. (Note: some months will compute 5 weeks instead of 4.)
`monthly` `this_N_months` `(24 * 30) + 48 = 768` Each hour the current month is computed. For 48 hours at the start of a month, the previous month is computed by the “trailing update” pass.
`monthly` `previous_N_months` `48` For the 48 hours at the start of a month, the previous month is computed by the “trailing update” pass.
`yearly` `this_N_years` `24 * 30 = 720` Each hour the current year is computed. (Note: the previous year is recomputed for the 48 hours of the year, so this will be slightly higher in January.)
`yearly` `previous_N_years` `48 in January; 0 thereafter` The previous year is re-evaluated for the first 48 hours in January but after that no updates are necessary.

Then multiply together the cost per update query times the number of column updates per month to get a rough estimate of total properties scanned usage. Note that for `this_N_*` timeframes with longer-duration intervals (`daily` and up) this will usually be an over-estimate because the column update query for an in-progress interval won’t read as much data as for an already-finished interval. For example when first updating the column for the current day given a `this_N_days` interval, there will only be 1/24th of a day’s worth of events to scan. To get a precise estimate you will need to take this into account, but in most cases even the conservative estimate will be good enough.

It is also important to note that there will be an initial bootstrapping phase when a Cached Dataset is first created. During this phase Keen will need to run queries for every column in the matrix, i.e. every interval in the `timeframe`. To estimate the cost of this bootstrapping phase just estimate the cost of a single column update query (as described above) and multiply by the number of columns/intervals (e.g. `500` for a `this_500_days` timeframe).

#### Working Example, re-revisited

Building upon the example Query and Cached Query above, imagine we convert the query to a Cached Dataset by removing the filter on `y` and instead using `y` as the `index_by` property. We use a timeframe of `this_90_days` and a `daily` interval. The cost of a single column update query will be `N = E * P = 10K * 3 = 30K` (actually less in practice due to the partial-interval behavior mentioned above). The number of column updates per month will be 1440, from the table. So the total number of properties scanned will be less than `30K * 1440 = 43.2M` or \$4.32 per month, a >10x reduction compared the Cached Query version and a 37x reduction over the original ad hoc version.

#### Warning: avoid `this_N_years` Cached Datasets

When using a `this` modifier and a `yearly` interval, we have to run a column update query every hour that covers the entire current year-to-date. This becomes quite expensive as the year progresses and contains more events. For this reason we strongly discourage use of `this_N_years` with Cached Datasets.