Group By Ranges & Buckets
The Keen REST API fully supports the new group_by ranges and group_by buckets functionality.
It’s possible to invoke an ad-hoc query, create and retrieve the results from a saved query, a cached query, or a cached dataset.
The response structure is identical as the group_by property_name response, so it should work out-of-the-box with the visualisation framework of your choice.
However, we are still working on rolling out the support for the group_by ranges and the group_by buckets in:
- Some SDKs
- The Explorer
- The Saved Queries view
Overview
The purpose of this guide is to show you how to use the new group_by parameter options: ranges and buckets.
The group_by parameter can be added to all analysis types (i.e. count, count_unique, select_unique, sum, min, max, average, percentile, …) but extractions. It groups results categorically. Up until now it was only possible to group by a specific value of a property_name. The new functionality introduces a possibility to group by a range of some property values.
Grouping by a range of values can be useful to:
- build a Web Service response time histogram
- charge for phone calls of different duration
- analyze the results of an exam
An exam case study
The examples below will analyse an exam results using the group_by ranges and group_by buckets functionality.
Following event schema will be used in the examples:
{
"score": 71,
"age": 59,
"state": "ID"
}
Assumptions:
- The max
scoreat the exam is 100, the minscoreis 0. - For statistical reference the
ageand thestate(2-letter code) are registered. - The exam is only for adults, so
age >= 18.
Exam results analysis
To understand how students performed in the exam the group_by buckets feature can be used. The score will be divided into groups of size 10 as specified in the size parameter.
The query:
{
"analysis_type": "count",
"event_collection": "exam_results",
"group_by": {
"buckets": {
"size": 10
},
"property_name": "score"
},
...
}
The query explained:
- The
group_by bucketsconstruct is perfect for an initial analysis of some data. The definition requires just thesizeand theproperty_nameparameters. - The optional
offsetparameter will default to 0, so buckets will be created using the pattern:[0,10), [10, 20), [20, 30), .... - The last bucket
[100, 110)aggregates only thescore = 100events, however, ingroup_by bucketsthere is no way to modify thekeyshown in the chart legend.
The result:
Improved exam result analysis
The chart above gives us some information, however, it’s still unknown how many students have passed (65 points or more), or what grade they’ve received. Let’s use the group_by ranges with the count analysis to build a more sophisticated exam score chart.
The query:
{
"analysis_type": "count",
"event_collection": "exam_results",
"group_by": {
"ranges": [
{"to": 65, "key": "[0, 65) -> E"},
{"from": 65, "to": 70, "key": "[65, 70) -> D"},
{"from": 70, "to": 80, "key": "[70, 80) -> C"},
{"from": 80, "to": 90, "key": "[80, 90) -> B"},
{"from": 90, "key": "[90, 100] -> A"}
],
"property_name": "score"
},
...
}
The query explained:
- The
group_by rangesrequires more manual work, as each range needs to be specified manually, but is more powerful:- each
rangecan have different size - as edges can be specified as either inclusive or exclusive,
- the
keycan be customised
- each
- Both the
toandfromparameters are optional.- If
fromis not provided then-Infinityis assumed. - If
tois not provided then+Infinityis assumed.
- If
- The
keyis also optional. If not provided it will be calculated based on thefromandtoparameters. - The
fromparameter is inclusive by default. It can be changed by thefrom_closedboolean parameter. - The
toparameter is exclusive by default. It can be changed by theto_closedboolean parameter.
The result:
Exam results in different age groups
We can drill down further to see in which age group the exam score average was the highest.
The query:
{
"analysis_type": "average",
"target_property": "score",
"event_collection": "exam_results",
"group_by": {
"buckets": {
"size": 10,
"offset": 18
},
"property_name": "age"
},
...
}
The query explained:
- As mentioned in the
Overviewnot only the count analysis type can be used with thegroup_byparameter. - In this example the
averageanalysis will be performed on data subsets determined by theageproperty. - Since all participants are adults, the
offsetparameter is set to 18, so the buckets are built starting from this value.
The result:
Exam results grouped by US states
In this example more group_by ranges features will be presented. The most important one is the ability to group by not only number ranges but also by string ranges.
The query:
{
"analysis_type": "average",
"target_property": "score",
"event_collection": "exam_results",
"group_by": {
"ranges": [
{"from": "A", "to": "B", "key": "Starting with A"},
{"from": "C", "to": "D", "key": "Starting with C"},
{"from": "M", "to": "MI", "to_closed": true, "key": "States: MA, MD, ME, MI"},
{"from": "MI", "to": "N", "from_closed": false, "key": "States: MN, MO, MS, MT"},
{"to":"N", "key": "Starting with [A-M]"},
{"from":"N", "key": "Starting with [N-Z]"},
{"key": "All states"}
],
"property_name": "state",
"allow_range_overlap": true,
"alias": "mark"
},
...
}
The query explained:
- String types are compared lexicographically, so you can define ranges for strings as well.
- The
fromparameter is inclusive by default. This can be changed by the"from_closed" = falseboolean parameter. - The
toparameter is exclusive by default. Can be changed by the"to_closed" = trueboolean parameter. - There are several overlapping
rangedefinitions, so theallow_range_overlapmust be set totrue, otherwise the query will error. - The
keyis the string that will be displayed in the chart legend. - The last
rangedefinition is interesting as neitherfromnortois present. Suchrangecollects all events and will display the average for all states (as thekeysuggests). - The
aliasis optional and only visible in the raw JSON response.
The result:
Exam pass rate evolution over time
The exam results statistic: how many passed and failed attempts per year, can be created using the interval keyword.
The query:
{
"analysis_type": "count",
"event_collection": "exam_results",
"group_by": {
"ranges": [
{"to": 65, "key": "FAIL"},
{"from": 65, "key": "PASS"}
],
"property_name": "score",
"alias": "Exam result"
},
"interval": "yearly",
...
}
The query explained:
- Only two
rangedefinitions are present:PASSandFAIL. - Since
tois exclusive, then allscorevalues strictly less than65are marked asFAIL. - Since
fromis inclusive, then allscorevalues grater than or equal to65are marked asPASS. - The
yearlyinterval will additionally group the results by the year the exam has been completed.
The result: