Event data is a powerful data format that allows us to track and analyze things that happen around us.
However, if you’re used to relational data modeling, event modeling can feel pretty foreign and awkward.
Let’s compare and contrast these data formats so we can better understand their weaknesses and their superpowers. It’s not a contest. Most business use both, and they each have their place.
The simplest way I can think to contrast the two is this:
Event Data = Verbs
Relational Data “Rows” = Nouns
Event data answers questions about what things have done. Actions that happened.
Relational data answers questions about the state of things.
Questions that are trivial to answer with a relational data model, like “what class is Sam currently enrolled in?” can be maddeningly complicated to answer with event data. But on the flip side, event data can make other types of questions suddenly very simple and easy, like “how many classes has Sam enrolled in over the course of his life?”.
Something is Amiss
To illustrate why it’s important to get this right, let’s look at some real-world examples of trying to use the wrong data format. It’s like trying to use a screwdriver in place of a hammer.
Trying to use event data for a relational data problem
Let’s say you’re tracking events for every time a new products are added to your inventory. You’re also tracking events for every time products are sold. This is useful data for many types of analysis. But it it would be a really nightmarish way to keep track of which products are currently in inventory, something that would be trivial with a relational table.
How many hammers are currently in our inventory?
# relational model: oh let's just go look at our inventory table
# event data model: uhhh, let's count every time a hammer was ever added since the store opened, then subtract every time a hammer was ever sold, and theoretically that should tell us how many we have.
Trying to use relational data for an event data problem
A recent example that comes to mind is a training company providing reporting on course completions.
The symptom they noticed was that when looking back at historical training course completion, the numbers would sometimes change. Like, the number of courses completed last March would inexplicable change from 3842 to 3833 a month later. What the heck!?
Turns out, they were counting the course completions by joining data across several tables. If courses were ever deleted, which occasionally they were, the
Trying to use event data for a relational data problem
Relational Data (aka Entity Data)
If you’ve ever worked with an application database, you know about entity data. It’s the standard format for the most common type of database, the relational database. Here’s an example:
Relational data is stored in tables. Entities are things like users, products, accounts, posts, levels, etc. There is a separate table for each of type of entity, and each table has columns to hold properties about the entities. There is one row in the table for each entity. In this example, the entities are enemies.
Relational databases are really good for capturing the current state of your application. Things like users, product inventories, accounts payable, etc. You can very quickly lookup information about any entity.
One characteristic of entity databases is that they are normalized. Data is rarely duplicated. For example, you might have a table for Accounts, with attributes like the account name, type, category, etc. Accounts have many users associated with them, but you wouldn’t store information about those users in the Accounts table. Instead, you would include a key in each user record which links to its account. From a data storage (disk usage) perspective, this is very efficient.
Event Data (aka Analytics Data)
Now let’s look at the characteristics of event data. Here’s an example:a
Event data doesn’t just describe entities; it describes actions performed by entities. This example above describes the action of publishing this blog post. You can imagine we have a collection of events called “publishes” where we track an event for each new post.
What makes this “event data”? Event data has three key pieces of information. I first saw these identified by Ben Johnson in his speakerdeckon Event Data (he calls it “Behavior Data”).
The action is the thing that’s happening (e.g. “publish”). The timestamp is self explanatory: the point in time the thing happened. The state refers to all of the other relevant information we know about this event, including information about entities related to the event, such as the author.
Let’s look at a more complex event. This is a “death” type event:
Here is an example data point for a player death in the game minecraft. Imagine we are recording every player death that happens in the game.
There are a lot of ways in which the player can experience death: falling from great heights, starvation from not eating enough pork chops, drowning, clumsily stumbling into lava, zombies scaring the crap out of you in a cave, etc.
Let’s say we want to analyze these player deaths in Minecraft. Perhaps we want to find out the most common type of death, the average player age at the time of death, the most lethal enemies, or any number of death-related questions. Perhaps we are trying to find out if the game is too difficult on certain levels, or if the new villain we introduced is causing way more destruction than we’d imagined, or if there is any correlation between types of users and types or frequency of deaths.
We can find out all of these things using the simple event data model shown above. The event data model has a few special qualities:
1. The data is rich (has data about lots of relevant entities)
2. The data is denormalized (we often store the same data repeatedly on all the relevant events)
Additional perks of event data: it can be nested, and that it has a much more flexible schema compared to the rigid tables of entity databases.