Data Modeling Tips & Best Practices

A well-designed data model makes your analytics more powerful, performant, and accessible. This 200-level Data Modeling Guide helps you avoid common beginner mistakes and save time. If you haven’t seen it yet, check out the 100-level Data Modeling Guide too.

This guide will cover:

We’d love to get your advice on what could make our product and this documentation simpler or more powerful. Please, please, please share your feedback with us. We’d love to hear from you.

Test Your Analytics Implementation

Just as you would test any other feature you create and deploy, test your analytics implementation.

A test should include assessing whether the full data volume expected was received, and if that data is accurate. Also consider whether your data is structured in such as way that will allow you to get a key metric. Make a few queries to understand how it would work.

We also recommend creating a seperate project for testing your implementation. We write more about this in the Projects section in the Intro to Event Data guide.

Include Common Index Names (e.g. day_of_week)

If you’re interested in doing analysis by day of week, or any other index, querying your data becomes easier if you send the identifier with the event. The alternative is manually parsing timestamps, which can be a little painful at times.

There are two ways you can achieve this. You can use the Datetime Enrichment or you can manually add index names to your event as seen in the example below.

If you’re interested in doing analysis by day_of_week, month_of_year or even hour_of_day:

  {
    "hour_of_day": 14,   // 2pm
    "day_of_week": 0,    // Sunday
    "month_of_year": 12  // December
  }

This would let you count “pageviews” for your blog, grouped by day of the week, over a given timeframe, to help pick the best day or time to publish your next post.

Why use numbers instead of strings? This makes sorting query results easier. These values can then be substituted in the query response with whichever display-friendly string values you prefer (Eg: “Jan” vs. “January”).

This same philosophy should also be applied with any particular organizational metric you would want to group by, such as cohort.

Check for Data Type Mismatch

This is the most common mistake. Make sure your data is in the right format. If you have a property such as “number of items” and you pass the value as “4”, you will not be able to add the values to determine a “total number of items” because it is a string.

Tip: From the Keen UI in the “Streams” tab in your project, you can view your “Event Streams” (pictured below) and inspect the event properties you’ve collected. Do a quality check to ensure the object has the data type you expect.

View Your Events

Avoid Using Deletes Systematically

Keen IO allows for deletion of individual events, however they should be used in one-off cases rather than in regular use. In best practice, it is not recommended to build any workflow that relies on individual deletes. Backtracking through your data is inefficient.

Examples where deletes should be used:

  • one-off events
  • corrupted events
  • unexpected bad data
  • removing sandbox data from production
Please note: Between June 15th to July 10th we will be undergoing a period of scheduled maintenance to make significant enhancements to our core infrastructure. During this time, our delete API will intermittently reject valid requests and return a 202 response. The Keen IO status page will be kept up to date when deletes are not working. You should expect delete unavailability at any time during this period.

No Variable Event Collection Names!

Best practice deems that collection names and property names should be pre-defined and static. Dynamic names are a no-no. Here’s an example of what we mean:

Say you are a SaaS company that has many subscribers. You want to track each time your customers publish new content on your site. Here are a couple of different ways you could model this:

Method 1 [WRONG]: One “post” event collection per customer. E.g. collection name = “Posts - Customer 2349283”.

    post = {
      "post": {
        "id": "19SJC039",
        "name": "All that Glitters in this World"
      }
    }
    // Don't do this!
    // Add your event to the "Posts - Customer 2349283" collection. // EW WRONG NO NO
    // keen.addEvent("Posts - Customer 2349283", post) // EW WRONG NO NO NO

Pros:

  • The only benefit to this method is that you achieve very fast performance by breaking all of your events into small collections. However, it’s generally NOT worth it given the cons. The best implementations of Keen IO use server-side caching for their customer-facing metrics anyway, so slightly longer query time isn’t a problem.

Cons:

  • You will have no way to do analysis across all of your customers. E.g. “count the number of posts the last 7 days”. You would have to run X counts where X is your number of customers, then add them all up.
  • The user experience of having to look through hundreds or thousands of collections in the Data Explorer would not be great.
  • Your schema will become bloated. Every query you run references your schema. By adding lots of unique collections to the schema, you increase the effort required each time it is referenced.

Method 2 [CORRECT]: One “post” event collection for all customers. E.g. collection = “posts”. Each event in the collection should contain customer_name and customer_id properties so that you can efficiently segment your events by customer. It’s also a great idea to include other info about the customer such as their starting cohort, lifetime number of posts, etc.

    post = {
      "customer": {
        "id": "020939382",
        "name": "Fantastic Fox Parade"
      },
      "post": {
        "id": "19SJC039",
        "name": "All that Glitters in this World"
      }
    }
    // Yes!
    // Add your event to the "posts" collection.
    keen.addEvent("posts", post) // HOORAY
No variable property names! Programmatically generated property names will similarly muck up your data model and lead to query inefficiencies. In the worst-case scenario, you will not be able to perform the queries you need.

Use Client-Side Unique Event IDs

Particularly for mobile and smart device event tracking, where events are often stored offline and can have interesting posting scenarios, we recommend including your own unique event identifier. It’s as simple as adding a property like device_event_id: <generated GUID>.

By specifying a unique client-side id for each event, you can check to make sure that your device is not sending duplicate events, and that you’re getting all of the events you expect. While you wouldn’t really use this property day-today, it can be really handy for troubleshooting edge cases. For example, we’ve seen corner cases where batches of events were repeatedly reposted from the device, and also instances where there were suspiciously more session_start events than session_end events. The device_event_id is really handy for determining root cause in these issues.

Doesn’t Keen IO already do this?

Keen’s backend API goes to great lengths to ensure that the events that you send are recorded once and only once. That’s why you will find the property keen.id on every single event (internally used to ensure once-only writes). In addition, our open source client libraries, like Android & iOS, also include measures to make sure your batches of events aren’t written twice.

There is always the possibility that your code may generate duplicate events, or that a transmission edge case might cause the event to be sent more than once. In those cases, it’s nice to have a client-side event ID in addition to the keen.id.

Avoid Trapping Your Data: Lists

We recommend that you avoid using lists of objects.

Many filters behave differently when working with lists. It’s worth taking a moment to look these over. The “in” and “eq” filters are noteworthy. Other filters and analyses will also not make sense for object values in a list.

These are similar problems to sending arrays of objects in an event.

In the section below, we describe a scenario of a shopping cart application with multiple items in an order, where we discuss alternatives to modeling this as a list of objects. It can help you think about how you can model similar events involving lists of objects.

Tips for Modeling Events with Multiple Items

A common question that we see is how to model something like a shopping cart transaction which contains multiple items. The most obvious solution, to create one orders collection with one event per shopping cart transaction, is not the best solution. You will not be able to easily see what the most purchased items are because they’re trapped within the shopping cart list object.

To avoid this problem: avoid using lists of objects. The best way to model shopping cart transactions is to create two separate collections. One collection contains an event for each item purchased, let’s call it purchased_product. The second collection contains information summarize the order itself like the total transaction price, number of items, etc, let’s call it completed_transaction.

Now, a basic count of purchased_product will reveal the total number of items purchased, while allowing you to obtain counts grouped by a specific item. By creating two separate collections, one for each item and one for the transaction, you have made your data accessible with the ability to create powerful metrics.

Splitting the data in two collections allows you to very easily and intuitively run queries regarding both individual products (e.g. What were the most popular products purchased) as well as aggregate metrics on orders like, “What is the average order size?”.

    purchased_product = {
       "product": {
          "id": "5426A",
          "description": "canvas shorts",
          "size": "6",
          "color": "blue",
          "price": 10.00
        },
        "order": {
           "id": "0000001",
           "amount": 49.95,
           "number_of_items": 4
        },
        "seller_id": "293840928343",
        "user": {
           "uuid": "2093859048203943",
           "name": {
             "first": "BoJack",
             "last": "Horseman",
             "full": "BoJack Horseman"
           }
        }
      }

      completed_transaction = {
        "order": {
           "id": "0000001",
           "amount": 49.95,
           "number_of_items": 4
         },
         "seller_id": "293840928343",
         "user": {
           "uuid": "2093859048203943",
           "name": {
             "first": "BoJack",
             "last": "Horseman",
             "full": "BoJack Horseman"
           }
        }
      }

The collection purchased_product should contain one event for each product that was purchased, with information about the product (e.g. product id, description, color, etc), in addition to some properties that link the item to the order it belonged to (e.g. order_id, payment_method, etc).

The second collection completed_transaction should contain one event for every purchase transaction. In these events you track information about the aggregate transaction, like total transaction amount, number of items, etc.

Tips for Setting Up Attribution Timeframes

If you need to filter on “hours since step X”, include “hours since step X” as a property in your events. For example, you might include an event property such as hours_since_signup.

This allows you to run queries like “How many people sent data where hours_since_signup < 48”.

You can then also run funnels like:

  • Step 1 must have been within 30 days of now.
  • Step 2 must be completed within 7 days of Step 1.
  • Step 3 must be completed within 7 days of Step 2.

Next Steps

Now that you are a data modeling champion, you are ready to continue sending events to Keen IO, compute, and visualize your events.

Also, once you have events streaming to Keen IO, you can learn more about the Data Explorer here too.