Avatar photo

The What, How, and Why of OR Filters

Hello Keen community! Back in September, we announced the release of “or” filters. In this post, we’ll share what we learned in a deep dive on the what, how, and why.

Why, Part 1: Foundation for a Richer Query Language

Let’s start with the big question: why build “or” filters? One simple answer is that it has been among our most-requested features for a long time and that in and of itself was enough to justify building it. On our quest to make Keen ever more flexible, this capability allows our customers to quickly query their data in ways that up until now were either complicated and expensive, or outright impossible.

But the scenarios unlocked by the new architecture on which “or” filters were built are even more exciting. We’ll discuss the new model in detail below, but just as a teaser here are some of the enticing scenarios that it will enable us to build:

  • Filtering on functions of properties, e.g. (player.level mod 5) eq 0
  • Filtering on relationships between properties, e.g.
    player.level < (1.2 * monster.level)
  • Using expressions in the target_property of a query, e.g.
    sum (order.count * order.unit_price)
  • Using expressions in the group_by of a query, e.g.
    group_by (user.point_program exists true) or
    group_by bin(customer.age, [0, 20, 40, 60, 80, 100])

Adding these capabilities will open up whole new paradigms of querying in Keen (if you’ve got ideas, we’d love to hear them), and bring us one step closer to parity/compatibility with SQL and other mature query languages. But before we go any deeper let’s go back and cover the basics.

What, Part 1: A Simple Example

To satisfy those of you who arrived here just looking for some sample code, here’s what an “or” filter looks like:

[code lang=”php” gutter=”1″]
$ curl https://api.keen.io/3.0/projects/PROJECT_ID/queries/count \
-H "Authorization: READ_KEY" \
-H ‘Content-Type: application/json’ \
-d "{
\"event_collection\": \"purchases\",
\"timeframe\": \"this_14_days\",
\"filters\": [
{
\"operator\" : \"or\",
\"operands\" : [
{
\"property_name\" : \"customer.total_revenue\",
\"operator\" : \"gte\",
\"property_value\" : 100.00
},
{
\"property_name\" : \"customer.tier\",
\"operator\" : \"eq\",
\"property_value\" : \"premium\"
}
]
}
]
}"
[/code]

(You can see a similar example and read more in our API docs.)

How, Part 1: A Fundamentally New Concept

Looking at the sample request above, it may seem like “or” filters are a simple feature – and in many respects they are. They allow you to query against events that match any one of a set of conditions, rather than matching all conditions. But under the hood, the implementation was actually quite complex, and it’s worth going into why.

Prior to the introduction of “or” filters, all supported filter types conformed to the same pattern: a 3-tuple of `(property_name, operator, property_value)`. This was reflected in our implementation, which explicitly defined a filter as a POJO (Plain Old Java Object) with those properties:

[code]
public class Filter {
private final String propertyName;
private final Operator operator;
private final Object propertyValue;

}
[/code]

This allowed for simple and efficient code, but it lacked the flexibility necessary to introduce filters with fundamentally different structures. We could have hacked in support for “or” filters by just glomming onto that existing `Filter` class, but it would have degraded overall code quality and would have left us in even worse shape when we inevitably add the next filter operator with yet another structure. It seemed like there must be a better way…

How, Part 2: What is a Filter?

Good software architecture often starts with asking simple questions. What, really, is a filter? Logically speaking it’s just a predicate on an event, nothing more and nothing less. We could have modeled it that way and it would have worked, but there was an even broader generalization to make: an expression is a function that takes an event as input and produces some value (in the Keen type system) as output, and a filter is just an expression that always produces a boolean (i.e. true or false) output.

An expression can be a constant, or a reference to a property, or some function with one or more expressions as its operands. Since functions are themselves expressions, they can appear as operands to other functions and form a so-called Abstract Syntax Tree (AST). For now we have defined functions for the existing filter types (such as “eq”, “gte”, and now “or”) but the logic allows for expressions using any mathematical operations you can think of: addition, subtraction, multiplication, division, modulus, logarithms, exponentiation, or even binning etc.

What, Part 2: Using “or” Filters

The example from “What, Part 1” illustrates the mechanics of running an “or” filter query, but what problem is that query actually solving? Suppose that you want to add a graph to your embedded customer-facing analytics dashboard showing them how many of their clickthrough events were from their “high-value” customers, which you define to be customers who either (a) have a lifetime revenue over $100 or (b) explicitly subscribe to a premium tier. The example query above is solving this problem.

Without “or” filters that would be much trickier to accomplish. You could query the two parts separately and sum them, i.e.:

[code]
high_val_purchases = count(customer.total_revenue gte 100)
+ count(customer.tier eq "prem")
[/code]

Chart A: How many clickthroughs from high-value customers?

Double-counted sum of all customers with LTR > $100 and 2,135 premium tier customers

But this will end up double-counting customers who are subscribed to the premium tier and have a lifetime revenue of over $100. You could correct for this (using the inclusion-exclusion principle) by subtracting out this double-counted amount:

[code]
high_val_purchases = count(customer.total_revenue gte 100)
+ count(customer.tier eq "prem")
– count(customer.total_revenue gte 100 & customer.tier eq "prem")
[/code]

Chart B: How many clickthroughs from high-value customers?

7,863 total minus 1,601 (premium tier accounts with LTR > $100) = 6,262

This works, but now you are running three queries to get the result you want – which means three times the compute usage, plus extra load time on your dashboard. So even this simple case illustrates the value of native “or” filters, and in more involved cases (such as an “or” of three or more conditions) the savings in time, cost, and complexity can be great.

Why, Part 2: Making Keen a One-Stop Shop

Our mission is to make it as easy as possible for you to turn your data into a valuable resource for your users. Keen already provides a lightweight and low friction way to do simple analyses, but there are many more scenarios that can be enabled by increased expressivity. The more questions that Keen can answer for your users, and the more efficiently those questions can be structured, the higher the value it can provide. “Or” filters is just one such feature that we’ve recently implemented and we’ll share many more in the future.

Until Keen can efficiently solve all your analytics needs, we’ll always have our work cut out for us – but we’re making great progress, and we’re happy to have you along for the ride. Drop us a note at team@keen.io with any feedback about “or” filters, expressions, or anything else you’d like to see in the product.

Thanks!
Kevin Litwack
Chief Platform Architect