Data Modeling Tips & Best Practices

A well-designed data model can help you get access the metrics you need from your data set. If your data is structured properly, querying can be efficient and accessible. If you haven’t seen it yet, check out the 100-level Data Modeling Guide first.

This 200-level Data Modeling Guide helps avoid common beginner mistakes and provides tips to help save you time.

Test Your Analytics Implementation

Just as you would test any other feature you create and deploy, test your analytics implementation.

A test should include assessing whether the full data volume expected was received, and if that data is accurate. Also consider whether your data is structured in such as way that will allow you to get a key metric. Make a few queries to understand how it would work.

Include Common Index Names (e.g. day_of_week)

If you’re interested in doing analysis by day of week, or any other index, querying your data becomes easier if you send the identifier with the event. The alternative is manually parsing timestamps, which can be a little painful at times.

For example, if you’re interested in doing analysis by day_of_week, month_of_year or even hour_of_day:

  {
    "hour_of_day": 14,   // 2pm
    "day_of_week": 0,    // Sunday
    "month_of_year": 12  // December
  }

This would let you count “pageviews” for your blog, grouped by day of the week, over a given timeframe, to help pick the best day or time to publish your next post.

Why use numbers instead of strings? This makes sorting query results easier. These values can then be substituted in the query response with whichever display-friendly string values you prefer (Eg: “Jan” vs. “January”).

This same philosophy should also be applied with any particular organizational metric you would want to group by, such as cohort.

Check for Data Type Mismatch

This is the most common mistake. Make sure your data is in the right format. If you have a property such as “number of items” and you pass the value as “4”, you will not be able to add the values to determine a “total number of items” because it is a string.

Tip: From the Keen UI in the Settings for your project in the Schema section, you can go to the Event Explorer (pictured below) and then look at the event properties. Do a quality check to ensure the object has the data type you expect.

Event Explorer

Avoid Using Deletes Systematically

Keen IO allows for deletion of individual events, however they should be used in one-off cases rather than in regular use. In best practice, it is not recommended to build any workflow that relies on individual deletes. Backtracking through your data is inefficient.

Examples where deletes should be used: * one-off events * corrupted events * unexpected bad data * removing sandbox data from production

No Variable Event Collection Names!

Best practice deems that collection names and property names should be pre-defined and static. Dynamic names are a no-no. Here’s an example of what we mean.

Say you are a SaaS company that has many subscribers. You want to track each time your customers publish new content on your site. Here are a couple of different ways you could model this.

Example: Variable vs Static Collection Names

Method 1 [WRONG]: One “post” event collection per customer. E.g. collection name = “Posts - Customer 2349283”.

    post = {
      "post": {
        "id": "19SJC039",
        "name": "All that Glitters in this World"
      }
    }
    // Don't do this!
    // Add your event to the "Posts - Customer 2349283" collection. // EW WRONG NO NO
    // keen.addEvent("Posts - Customer 2349283", post) // EW WRONG NO NO NO

Pros: * The only benefit to this method is that you achieve very fast performance by breaking all of your events into small collections. However, it’s generally NOT worth it given the cons. The best implementations of Keen IO use server-side caching for their customer-facing metrics anyway, so slightly longer query time isn’t a problem.

Cons: * You will have no way to do analysis across all of your customers. E.g. “count the number of posts the last 7 days”. You would have to run X counts where X is your number of customers, then add them all up. * The Keen workbench will break. It’s simply not designed for customers with hundreds of collections. * Your schema will become bloated. Even query your run references your schema. By adding lots of unique collections to the schema, you increase the effort required each time it is referenced.

Method 2 [CORRECT]: One “post” event collection for all customers. E.g. collection = “posts”. Each event in the collection should contain customer_name and customer_id properties so that you can efficiently segment your events by customer. It’s also a great idea to include other info about the customer such as their starting cohort, lifetime number of posts, etc.

    post = {
      "customer": {
        "id": "020939382",
        "name": "Fantastic Fox Parade"
      },
      "post": {
        "id": "19SJC039",
        "name": "All that Glitters in this World"
      }
    }
    // Yes!
    // Add your event to the "posts" collection.
    keen.addEvent("posts", post) // HOORAY
No variable property names! Programmatically generated property names will similarly muck up your data model and lead to query inefficiencies. In the worst-case scenario, you will not be able to perform the queries you need.

Use Client-Side Unique Event IDs

Particularly for mobile and smart device event tracking, where events are often stored offline and can have interesting posting scenarios, we recommend including your own unique event identifier. It’s as simple as adding a property like device_event_id: . By specifying a unique client-side id for each event, you can check to make sure that your device is not sending duplicate events, and that you’re getting all of the events you expect. While you wouldn’t really use this property day-today, it can be really handy for troubleshooting edge cases. For example, we’ve seen corner cases where batches of events were repeatedly reposted from the device, and also instances where there were suspiciously more session_start events than session_end events. The device_event_id is really handy for determining root cause in these issues.

Doesn’t Keen IO already do this?
Keen’s backend API goes to great lengths to ensure that the events that you send are recorded once and only once. That’s why you will find the property “keen.id” on every single event (internally used to ensure once-only writes). In addition, our open source client libraries, like Android & iOS, also include measures to make sure your batches of events aren’t written twice.

There is always the possibility that your code may generate duplicate events, or that a transmission edge case might cause the event to be sent more than once. In those cases, it’s nice to have a client-side event ID in addition to the keen.id.

Avoid Trapping Your Data: Lists

Imagine a scenario where you own a shopping cart application. As you model a purchase event with four things in the cart, you decide to place all four items from the shopping cart into the purchase event as a list along with the number four to represent how many items there were. This one purchase event contains a list of all four objects and the number four.

As your data is modeled this way, you will successfully be able to do a basic count to count how many purchase events occur, and also sum the number of items purchased. However, you cannot easily see what the most purchased items are because they’re trapped within the shopping cart list object.

To avoid this problem: avoid using lists of objects. Create a separate purchase events for each item in the list. Now, a basic count of purchase events will reveal the total number of items purchased, while allowing you to obtain counts grouped by a specific item.

You may still want to capture the fact that there was a single action resulting in the sale of four items. Create a second identifier called purchase transaction - with four items total contained within. A sum of purchase transactions will confirm the total number of items purchased.

By creating two separate collections, one for each item and one for the transaction, you have made your data accessible with the ability to create powerful metrics.

Tips for Modeling Events with Multiple Items (e.g. shopping cart transactions)

A common question that we see is how to model something like a shopping cart transaction which contains multiple items. The most obvious solution, to create one collection ‘orders’ with one event for per shopping cart transaction, is not quite the best one!

The best way to model shopping cart transactions is to create two separate collections. One collection contains an event for each product purchased. The second collection contains information summarize the order itself like the total transaction volume, number of items, etc.

The collection purchased_product should contain one event for each product that was purchased, with information about the product (e.g. product id, description, color, etc), in addition to some properties that link the item to the order it belonged to (e.g. order_id, payment_method, etc).

The second collection, for orders, should contain one event for every purchase transaction. In these events you track information about the aggregate transaction, like total transaction amount, number of items, etc.

Splitting the data in two collections allows you to very easily and intuitively run queries regarding both individual products (e.g. What were the most popular products purchased) as well as aggregate metrics on orders like “what is the average order size?”. You have now gained more flexibility and power in your queries.

    purchased_product = {
       "product": {
          "id": "5426A",
          "description": "canvas shorts",
          "size": "6",
          "color": "blue",
          "price": 10.00
        },
        "order": {
           "id": "0000001",
           "amount": 49.95,
           "number_of_items": 4
        },
        "seller_id": "293840928343",
        "user": {
           "uuid": "2093859048203943",
           "name": {
             "first": "BoJack",
             "last": "Horseman",
             "full": "BoJack Horseman"
           }
        }
      }

      completed_transaction = {
        "order": {
           "id": "0000001",
           "amount": 49.95,
           "number_of_items": 4
         },
         "seller_id": "293840928343",
         "user": {
           "uuid": "2093859048203943",
           "name": {
             "first": "BoJack",
             "last": "Horseman",
             "full": "BoJack Horseman"
           }
        }
      }

Tips for Setting Up Attribution Timeframes

If you need to filter on “hours since step X”, include “hours since step X” as a property of your events. For example, you might include an event property such as hours_since_signup.

This allows you to run queries like “How many people sent data where hours_since_signup < 48”.

You can then also run funnels like:

  • Step 1 must have been within 30 days of now.
  • Step 2 must be completed within 7 days of Step 1.
  • Step 3 must be completed within 7 days of Step 2.

What’s Next?

Sign up for a free account to get started.

Sign Up Free