Group By Ranges & Buckets

The Keen REST API fully supports the new group_by ranges and group_by buckets functionality. It’s possible to invoke an ad-hoc query, create and retrieve the results from a saved query, a cached query, or a cached dataset. The response structure is identical as the group_by property_name response, so it should work out-of-the-box with the visualisation framework of your choice.

However, we are still working on rolling out the support for the group_by ranges and the group_by buckets in:

  • Some SDKs
  • The Explorer
  • The Saved Queries view

Overview

The purpose of this guide is to show you how to use the new group_by parameter options: ranges and buckets.

The group_by parameter can be added to all analysis types (i.e. count, count_unique, select_unique, sum, min, max, average, percentile, …) but extractions. It groups results categorically. Up until now it was only possible to group by a specific value of a property_name. The new functionality introduces a possibility to group by a range of some property values.

Grouping by a range of values can be useful to:

  • build a Web Service response time histogram
  • charge for phone calls of different duration
  • analyze the results of an exam

An exam case study

The examples below will analyse an exam results using the group_by ranges and group_by buckets functionality.

Following event schema will be used in the examples:

{
  "score": 71,
  "age": 59, 
  "state": "ID"
}

Assumptions:

  • The max score at the exam is 100, the min score is 0.
  • For statistical reference the age and the state (2-letter code) are registered.
  • The exam is only for adults, so age >= 18.

Exam results analysis

To understand how students performed in the exam the group_by buckets feature can be used. The score will be divided into groups of size 10 as specified in the size parameter.

The query:

{
  "analysis_type": "count",
  "event_collection": "exam_results",
  "group_by": {
    "buckets": {
      "size": 10
    },
    "property_name": "score"
  },
  ...
}

The query explained:

  • The group_by buckets construct is perfect for an initial analysis of some data. The definition requires just the size and the property_name parameters.
  • The optional offset parameter will default to 0, so buckets will be created using the pattern: [0,10), [10, 20), [20, 30), ....
  • The last bucket [100, 110) aggregates only the score = 100 events, however, in group_by buckets there is no way to modify the key shown in the chart legend.

The result:

Improved exam result analysis

The chart above gives us some information, however, it’s still unknown how many students have passed (65 points or more), or what grade they’ve received. Let’s use the group_by ranges with the count analysis to build a more sophisticated exam score chart.

The query:

{
  "analysis_type": "count",
  "event_collection": "exam_results",
  "group_by": {
    "ranges": [
      {"to": 65, "key": "[0, 65) -> E"},
      {"from": 65, "to": 70, "key": "[65, 70) -> D"},
      {"from": 70, "to": 80, "key": "[70, 80) -> C"},
      {"from": 80, "to": 90, "key": "[80, 90) -> B"},
      {"from": 90, "key": "[90, 100] -> A"}
    ],
    "property_name": "score"
  },
  ...
}

The query explained:

  • The group_by ranges requires more manual work, as each range needs to be specified manually, but is more powerful:
    • each range can have different size
    • as edges can be specified as either inclusive or exclusive,
    • the key can be customised
  • Both the to and from parameters are optional.
    • If from is not provided then -Infinity is assumed.
    • If to is not provided then +Infinity is assumed.
  • The key is also optional. If not provided it will be calculated based on the from and to parameters.
  • The from parameter is inclusive by default. It can be changed by the from_closed boolean parameter.
  • The to parameter is exclusive by default. It can be changed by the to_closed boolean parameter.

The result:

Exam results in different age groups

We can drill down further to see in which age group the exam score average was the highest.

The query:

{
  "analysis_type": "average",
  "target_property": "score",
  "event_collection": "exam_results",
  "group_by": {
    "buckets": {
      "size": 10,
      "offset": 18
    },
    "property_name": "age"
  },
  ...
}

The query explained:

  • As mentioned in the Overview not only the count analysis type can be used with the group_by parameter.
  • In this example the average analysis will be performed on data subsets determined by the age property.
  • Since all participants are adults, the offset parameter is set to 18, so the buckets are built starting from this value.

The result:

Exam results grouped by US states

In this example more group_by ranges features will be presented. The most important one is the ability to group by not only number ranges but also by string ranges.

The query:

{
  "analysis_type": "average",
  "target_property": "score",
  "event_collection": "exam_results",
  "group_by": {
    "ranges": [
      {"from": "A", "to": "B", "key": "Starting with A"},
      {"from": "C", "to": "D", "key": "Starting with C"},
      {"from": "M", "to": "MI", "to_closed": true, "key": "States: MA, MD, ME, MI"},
      {"from": "MI", "to": "N", "from_closed": false, "key": "States: MN, MO, MS, MT"},
      {"to":"N", "key": "Starting with [A-M]"},
      {"from":"N", "key": "Starting with [N-Z]"},
      {"key": "All states"}
    ],
    "property_name": "state",
    "allow_range_overlap": true,
    "alias": "mark"
  },
  ...
}

The query explained:

  • String types are compared lexicographically, so you can define ranges for strings as well.
  • The from parameter is inclusive by default. This can be changed by the "from_closed" = false boolean parameter.
  • The to parameter is exclusive by default. Can be changed by the "to_closed" = true boolean parameter.
  • There are several overlapping range definitions, so the allow_range_overlap must be set to true, otherwise the query will error.
  • The key is the string that will be displayed in the chart legend.
  • The last range definition is interesting as neither from nor to is present. Such range collects all events and will display the average for all states (as the key suggests).
  • The alias is optional and only visible in the raw JSON response.

The result:

Exam pass rate evolution over time

The exam results statistic: how many passed and failed attempts per year, can be created using the interval keyword.

The query:

{
  "analysis_type": "count",
  "event_collection": "exam_results",
  "group_by": {
    "ranges": [
      {"to": 65, "key": "FAIL"},
      {"from": 65, "key": "PASS"}
    ],
    "property_name": "score",
    "alias": "Exam result"
  },
  "interval": "yearly",
  ...
}

The query explained:

  • Only two range definitions are present: PASS and FAIL.
  • Since to is exclusive, then all score values strictly less than 65 are marked as FAIL.
  • Since from is inclusive, then all score values grater than or equal to 65 are marked as PASS.
  • The yearly interval will additionally group the results by the year the exam has been completed.

The result: