Data Modeling Guide

Keen IO provides public REST APIs to perform both Data Collection and Data Analysis. We also offer full exporting abilities so you can pump data into your existing analysis workflow and flex your muscles in Excel, Tableau, or Hadoop.

We believe in the power of data to uncover new truths about what’s important in your application. However, in order to truly leverage your data and our analysis suite, you’ll need to put some thought into the types of things you want to record and how you’ll record them. We created this data modeling guide to help you get the most out of your data.

We’d love to get your advice on what could make our product and this documentation simpler or more powerful. Please, please, please share your feedback with contact@keen.io. We’d love to hear from you.

Projects

The first step in integrating your application with Keen IO is the creation of a project. You can think of a project as a data silo. The data in a project is completely separate from data in other projects.

There are a few scenarios where it makes sense to create multiple projects to logically separate data:

  • If you have more than one application, create a separate project for each app. For example, you might have a project called Eat My Shorts App and another one called CraftMine App.

  • You probably have a production environment and a test environment. It’s a good idea to separate that data to avoid accidentally polluting your production data store with test data. Continuing our example, you would have 4 projects:

    • Eat My Shorts App - Staging
    • Eat My Shorts App - Prod
    • CraftMine - Staging
    • CraftMine - Prod
  • If you run your app on multiple platforms — iOS and Android for example — we recommend storing that data in a shared project rather than creating separate projects. Having your iOS and Android events in a single project will make it much easier to do analysis across platforms. You’ll be able to ask questions like: “How many people are using our app? How many people are using our new feature?” You can always use filters to do comparisons between the platforms — just make sure you include a platform property when sending data.

  • If you have an application with many similar instances, for example an app for restaurants with a different version for each restaurant, email us and we will help you figure out the best way to structure your projects. There may be cases where you want to logically separate data for different companies while at the same time requiring cross-project analysis – we can help!

Events & Event Data

Our database is optimized to store event data. Events are actions that occur at a point in time. These actions can be performed by a user, an admin, a server, a program, etc. Events have properties. Properties are the juicy bits of data that describe what is happening and allow you to do in-depth analysis. When we talk about “event data” we mean events and all the properties that you send along with them.

Here is an example of a purchase event and its properties. There’s a timestamp property that’s automatically included at the top, plus a set of custom properties like item, cost, customer, and store.

{
    "keen": {
        "timestamp": "2012-06-06T19:10:39.205000"
    },
    "item": "sophisticated orange turtleneck with deer on it",
    "cost": 469.5,
    "payment_method": "Bank Simple VISA",
    "customer": {
        "name": "Francis Woodbury",
        "age": 28,
    },
    "store": {
        "name": "Yupster Things",
        "city": "San Francisco",
        "address": "467 West Portal Ave",
    }
}

This event is sent to Keen IO using an HTTP POST request to a URL of the following format:

http://api.keen.io/3.0/projects/<project_id>/events/<event_collection>

Read on for more info on Event Properties and Event Collections!

Event Collections

Event Collections are used to logically organize all the events happening in your application. Events belong in a collection together when they can be described by similar properties. For example, all logins share properties like first name, last name, app version, platform, and time since last login. It makes sense to store all of your logins in an Event Collection called “Logins”.

Logins are just one example of an Event Collection. Here are some more: purchases, social media shares, comments, saves, exits, upgrades, errors, levelups, interactive gestures, modifications, views, signups.

Event collections can have almost any name, but there are a few rules to follow:

  1. The name must be 64 characters or less.
  2. It must contain only Ascii characters.
  3. It cannot be a null value.

How to Create an Event Collection

Event Collections are created automatically when you send an event to Keen. The event collection name is required in order to send an event. If the event collection name doesn’t exist yet, Keen will automatically create it when your first event is received.

As soon as an Event Collection’s first event is recorded, the collection will be immediately available for analysis via the Keen IO API.

Best Practices for Event Collections

Some things to consider when creating your event collections:

  1. Events in an Event Collection have similar properties. For example, all logins share properties like first name, last name, app version, platform, and time since last login.
  2. Event Collections for a given application share many “global properties”. For example, most events in your application probably share some properties like user ID, app version, and platform. It’s a good planning exercise to identify those properties that you want to include in every Event Collection so you can structure them the same way each time.
  3. When possible, minimize the number of distinct Event Collections. Let’s say you’re analyzing purchases across many devices and you want to compare them. You’ve got purchases from multiple versions of your iPhone app and multiple versions of your iPad app. It’s logical to think of creating separate event collections for each of them, but it’s not the best way. Instead, consider creating a single event collection called Purchases. Each purchase in your event collection share many properties like item description, unit price, quantity, payment method, and customer. Additionally, you can include properties for DeviceType (iPhone, iPad, etc) and Version (2.4A, 2.4B, 1.3).

Since you’re now tracking those Device & Version properties for every purchase, it’s very easy to do the following:

  • count the total number of purchases across all devices
  • count the total number of purchases where DeviceType equals “iPhone”
  • count the total number of purchases for iPhone app version 2.4A.

Check out the filters page for more information on how to slice and dice your data.

Event Properties

Properties are pieces of information that describe an event and relevant information about things related to that event.

When we talk about events and their properties, we are starting to dig into the art of data science. There is no prescription for what events you should record and what properties will be important for your unique application. Rather, you need to think creatively about what information is important to you now, and what might be important in the future.

While we believe that it can’t hurt to have too much information, we have put some practical limits in place. There cannot be more than 1,000 properties per Event Collection. This is usually caused by the dynamic naming of properties. For example, creating a property whose name is the current time. This will create a new property for every event you send since they will be recorded at different times!

Here are some things to consider capturing as event properties:

  • Information about the event itself. If your event is a phone call, what number is being called? How many times did the phone ring? Did someone answer?
  • Information about the actor performing the event. For example, if you’re recording a user action, what do you know about the user at that point in time? If possible, record their age, gender, location, favorite coffee shop, or whatever else you know that might be useful for analyzing their behavior later.
  • Information about other actors involved. For example, if your event is a user sharing content with another user, you could record the properties of the recipient. What is their name? To what groups do they belong?
  • Information about the session. How long has your app been running since this event occurred? Is this the user’s first session?
  • Information about the environment. What platform? What hardware? What version of your application?
  • Other relevant information about the “state of the universe”. If you think that sounds vague, I agree with you! Think about anything else that might be handy to know later. If you’re making a farming game, record the items in a user’s garden and their coordinates. You might find some interesting usage patterns. Maybe people who spend over $30 all have statues in their garden; maybe you could add more fancy decorations to the game to entice them to spend more?

Though it might seem counter-intuitive and redundant to send the same information (e.g. user info, platform info) with every event, it will make it much easier for you to segment your data later.

Feel free to add or remove events and properties from your code at any time. Keen will automatically keep track of whatever you send, and your new properties will be available for analysis immediately.

Properties all have a name and a value. While they can have almost any name, there are a few rules to follow.

Property Name Rules

  1. Must be less than 256 characters long.
  2. There cannot be any periods (.) in the name.
  3. They cannot be a null value.

Property Value Rules

  1. String values must be less than 10,000 characters long.
  2. Numeric values must be between -2^63 (-9223372036854775808) and 2^63 - 1 (9223372036854775807) (inclusive).
  3. Values in lists must themselves follow the above rules.
  4. Values in dictionaries must themselves follow the above rules.

Property Hierarchy

The nice thing about using JSON as a data format is that you can include LOTS of properties with your events, and you can organize them into a hierarchy.

You can see in the example below that this purchases event has properties that describe the purchase, properties that describe the customer, and properties that describe the store.

The ability to store the properties in this hierarchy makes it much simpler to name the properties. Notice how the customer name and the store name are simply labeled “name”. When you look for these properties in a filter or in your data extract, you’ll find them labeled customer.name and store.name.

{
   "item": "sophisticated orange turtleneck with deer on it",
   "cost": 469.50,
   "payment_method": "Bank Simple VISA",
   "customer": {
       "id": 233255,
       "name": "Francis Woodbury",
       "age": 28,
       "address": {
           "city": "San Francisco",
           "country": "USA"
       }
   },
   "store": {
       "name": "Yupster Things",
       "city": "San Francisco",
       "address": "467 West Portal Ave"
   }
}

This is a simple example — your hierarchy can have as many levels and properties as you want!

Property Data Types

Keen IO supports a variety of data types (integer, string, array, etc). Keen automatically infers the data types of your event properties based on the data you send. Some properties, such as timestamp and geo-location, require you to use a specific property name. Arrays may only contain the supported primitive types, not additional JSON key value objects.

Inferred Data Types

Keen IO automatically infers your event property’s data type. The possible data types are:

  • string - string of characters
  • number - number or decimal
  • boolean - either true or false
  • array - collection of data points of like data types

You will have different filtering options for different properties. That’s because Keen automatically detects the relevant filtering operators based on your property’s data type. For example, you won’t have the option to apply a greater than or less than filter to a boolean property with only TRUE or FALSE property values. (That would be super confusing!) For a list of the possibilities, check out filters.

You can easily check your data’s property types using the event explorer in Keen IO, or you can do it via API.

Arrays

You can store arrays as values in Keen events. There are a few things to know when using them.

Many filters behave differently when working with arrays. It’s worth taking a moment to look these over. The “in” and “eq” filters are noteworthy. Other filters and analyses will also not make sense for array values.

Arrays of objects are not recommended.

Note that group-by will group together identical arrays only. By identical we mean same elements in the same order.

Timestamp Data Type

Two time-related properties are included in your event automatically. The properties “keen.timestamp” and “keen.created_at” are set at the time your event is recorded. You have the ability to overwrite the keen.timestamp property. This could be useful, for example, if you are backfilling historical data. Be sure to use ISO-8601 Format.

Note

Keen stores all date and time information in UTC!

Here’s an example “pageview” event showing the keen timestamp properties:

{
    "keen": {
        "created_at": "2012-12-14T20:24:01.123000+00:00",
        "timestamp": "2012-12-14T20:24:01.123000+00:00",
        "id": "asd9fadifjaqw9asdfasdf939"
    },
    "device": {
        "OS": "Mac",
        "name": "Chrome",
        "version": 23
    },
    "page": "Intro to Analytics Course Page"
}

ISO-8601 Format

ISO-8601 is an international standard for representing time data. The format is as follows:

{YYYY}-{MM}-{DD}T{hh}:{mm}:{ss}.{SSS}{TZ}
  • YYYY: Four digit year. Example: “2012”
  • MM: Two digit month. Example: January would be “01”
  • DD: Two digit day. Example: The first of the month would be “01”
  • hh: Two digit hour. Example: The hours for 12:01am would be “00” and the hours for 11:15pm would be “23”
  • mm: Two digit minute.
  • ss: Two digit seconds.
  • SSS: Milliseconds to the third decimal place.
  • TZ: Time zone offset. Specify a positive or negative integer. To specify UTC, add “Z” to the end. Example: To specify Pacific time (UTC-8 hours), you should append “-0800” to the end of your date string.

Note

If no time zone is specified, the date/time is assumed to be in local time. At Keen, we’ll treat that as UTC.

Example ISO-8601 date strings:

2012-01-01T00:01:00-08:00
1996-02-29T15:30:00+12:00
2000-05-30T12:12:12Z

Geo Data Type

You can use the “keen.location” property to record latitude and longitude coordinates for your event. Geo coordinates should be specified as an array of the longitude and latitude values in decimal format (up to 6 decimal places).

Recording these coordinates enables you to do Geo Filtering.

By the way, the Keen IO iOS library automatically records geo coordinates for your events.

Here’s an example of a “checkin” event which includes the location property.

{
    "keen": {
        "timestamp": "2012-12-14T20:24:01.123000+00:00",
        "location": {
            "coordinates": [-88.21337, 40.11041]
        }
    },
    "user": {
        "name": "Smacko",
        "age": 21
    },
    "place": "Urbana party house"
}

Data Modeling Tips & Best Practices

A well-designed data model can help you get access the metrics you need from your data set. If your data is structured properly, querying can be efficient and accessible. In this section you can find common mistakes to avoid and tips to save time.

Test Your Analytics Implementation

Just as you would test any other feature you create and deploy, test your analytics implementation.

A test should include assessing whether the full data volume expected was received, and if that data is accurate. Also consider whether your data is structured in such as way that will allow you to get a key metric. Make a few queries to understand how it would work.

Client-Side Unique Event IDs

Particularly for mobile and smart device event tracking, where events are often stored offline and can have interesting posting scenarios, we recommend including your own unique event identifier. It’s as simple as adding a property like device_event_id: <generated GUID>. By specifying a unique client-side id for each event, you can check to make sure that your device is not sending duplicate events, and that you’re getting all of the events you expect. While you wouldn’t really use this property day-today, it can be really handy for troubleshooting edge cases. For example, we’ve seen corner cases where batches of events were repeatedly reposted from the device, and also instances where there were suspiciously more session_start events than session_end events. The device_event_id is really handy for determining root cause in these issues.

Doesn’t Keen IO already do this?

Keen’s backend API goes to great lengths to ensure that the events that you send are recorded once and only once (that’s why you will find the property “keen.id” on every single event - internally we use this to ensure once-only writes). Our open source client libraries, like Android & iOS, also include measures to make sure your batches of events aren’t written twice. However, there is always the possibility that your code may generate duplicate events, or that a transmission edge case might cause the event to be sent more than once. In those cases, it’s nice to have a client-side event ID in addition to the keen.id.

Avoid Using Deletes Systematically

Keen IO allows for deletion of individual events, however they should be used in one-off cases rather than in regular use. In best practice, it is not recommended to build any workflow that relies on individual deletes. Backtracking through your data is inefficient.

Examples where deletes should be used:
  • one-off events
  • corrupted events
  • unexpected bad data
  • removing sandbox data from production

Avoid Data Type Mismatch

This is the most common mistake. Make sure your data is in the right format. If you have a property such as “number of items” and you pass the value as “4”, you will not be able to add the values to determine a “total number of items” because it is a string.

Tip: From the Keen UI on the Project Overview tab (pictured below), you can go to the Event Explorer and then look at the event properties. Do a quality check to ensure the object has the data type you expect.

../../_images/event_explorer.png

Include Common Index Names (e.g. day_of_week)

If you’re interested in doing analysis by day of week, or any other index, querying your data becomes easier if you send the identifier with the event. The alternative is manually parsing timestamps, which can be a little painful at times.

For example, if you’re interested in doing analysis by day_of_week, month_of_year or even hour_of_day:

{
  "hour_of_day": 14,   // 2pm
  "day_of_week": 0,    // Sunday
  "month_of_year": 12  // December
}

This would let you count “pageviews” for your blog, grouped by day of the week, over a given timeframe, to help pick the best day or time to publish your next post.

Why use numbers instead of strings? This makes sorting query results easier. These values can then be substituted in the query response with whichever display-friendly string values you prefer (Eg: “Jan” vs. “January”).

This same philosophy should also be applied with any particular organizational metric you would want to group by, such as cohort.

No Variable Event Collection Names!

Best practice deems that collection names and property names should be pre-defined and static. Dynamic names are a no-no. Here’s an example of what we mean.

Say you are a SaaS company that has many subscribers. You want to track each time your customers publish new content on your site. Here are a couple of different ways you could model this.

Example: Variable vs Static Collection Names

Method 1 [WRONG]: One “post” event collection per customer. E.g. collection name = “Posts - Customer 2349283”.

post = {
  "post": {
    "id": "19SJC039",
    "name": "All that Glitters in this World"
  }
}
# Add your event to the "Posts - Customer 2349283" collection. // EW WRONG NO NO
keen.add_event("Posts - Customer 2349283", post) // EW WRONG NO NO NO
Pros:
  • The only benefit to this method is that you achieve very fast performance by breaking all of your events into small collections. However, it’s generally NOT worth it given the cons. The best implementations of Keen IO use server-side caching for their customer-facing metrics anyway, so slightly longer query time isn’t a problem.
Cons:
  • You will have no way to do analysis across all of your customers. E.g. “count the number of posts the last 7 days”. You would have to run X counts where X is your number of customers, then add them all up.
  • The Keen workbench will break. It’s simply not designed for customers with hundreds of collections.
  • Your schema will become bloated. Even query your run references your schema. By adding lots of unique collections to the schema, you increase the effort required each time it is referenced.

Method 2 [CORRECT]: One “post” event collection for all customers. E.g. collection = “posts”. Each event in the collection should contain customer_name and customer_id properties so that you can efficiently segment your events by customer. It’s also a great idea to include other info about the customer such as their starting cohort, lifetime number of posts, etc.

post = {
  "customer": {
    "id": "020939382",
    "name": "Fantastic Fox Parade"
  },
  "post": {
    "id": "19SJC039",
    "name": "All that Glitters in this World"
  }
}
# Add your event to the "posts" collection.
keen.add_event("posts", post) // HOORAY

Note

No variable property names! Programmatically generated property names will similarly muck up your data model and lead to query inefficiencies. In the worst-case scenario, you will not be able to perform the queries you need.

Avoid Trapping Your Data: Lists

Imagine a scenario where you own a shopping cart application. As you model a purchase event with four things in the cart, you decide to place all four items from the shopping cart into the purchase event as a list along with the number four to represent how many items there were. This one purchase event contains a list of all four objects and the number four.

As your data is modeled this way, you will successfully be able to do a basic count to count how many purchase events occur, and also sum the number of items purchased. However, you cannot easily see what the most purchased items are because they’re trapped within the shopping cart list object.

To avoid this problem: avoid using lists of objects. Create a separate purchase events for each item in the list. Now, a basic count of purchase events will reveal the total number of items purchased, while allowing you to obtain counts grouped by a specific item.

You may still want to capture the fact that there was a single action resulting in the sale of four items. Create a second identifier called purchase transaction - with four items total contained within. A sum of purchase transactions will confirm the total number of items purchased.

By creating two separate collections, one for each item and one for the transaction, you have made your data accessible with the ability to create powerful metrics.

Tips for Modeling Events with Multiple Items (e.g. shopping cart transactions)

A common question that we see is how to model something like a shopping cart transaction which contains multiple items. The most obvious solution, to create one collection ‘orders’ with one event for per shopping cart transaction, is not quite the best one!

The best way to model shopping cart transactions is to create two separate collections. One collection contains an event for each product purchased. The second collection contains information summarize the order itself like the total transaction volume, number of items, etc.

The collection purchased_product should contain one event for each product that was purchased, with information about the product (e.g. product id, description, color, etc), in addition to some properties that link the item to the order it belonged to (e.g. order_id, payment_method, etc).

The second collection, for orders, should contain one event for every purchase transaction. In these events you track information about the aggregate transaction, like total transaction amount, number of items, etc.

Splitting the data in two collections allows you to very easily and intuitively run queries regarding both individual products (e.g. What were the most popular products purchased) as well as aggregate metrics on orders like “what is the average order size?”. You have now gained more flexibility and power in your queries.

purchased_product = {
   "product": {
      "id": "5426A",
      "description": "canvas shorts",
      "size": "6",
      "color": "blue",
      "price": 10.00
    },
    "order": {
       "id": "0000001",
       "amount": 49.95,
       "number_of_items": 4
    },
    "seller_id": "293840928343",
    "user": {
       "uuid": "2093859048203943",
       "name": {
         "first": "Marge",
         "last": "Simpson",
         "full": "Marge Simpson"
       }
    }
  }

  completed_transaction = {
    "order": {
       "id": "0000001",
       "amount": 49.95,
       "number_of_items": 4
     },
     "seller_id": "293840928343",
     "user": {
       "uuid": "2093859048203943",
       "name": {
         "first": "Marge",
         "last": "Simpson",
         "full": "Marge Simpson"
       }
    }
  }

So, what are you waiting for? It only takes a few minutes and a few lines of code to start collecting the events that really matter to you.

Sign Up Free