So you’ve decided to take the plunge and build an in-house analytics system for your company. Maybe you’ve outgrown Google Analytics and Mixpanel, or maybe you’re an early-stage business with unique analytics needs that can’t be solved by existing software. Whatever your reasons, you’ve probably started to write up some requirements, fired up an IDE, and are ready to start cranking out some code.
At Keen we began this process several years ago and we’ve been iterating on it ever since, having successes and stumbles along the way. We wanted to share some of the lessons we learned to help you through the build process.
Today we’ll give an overview of key areas to consider when building an in-house analytics system. We’ll follow up with detailed posts on these areas in the weeks to come.
Before you build your in-house analytics system, you need to consider what inputs will be coming into it, both expected and unexpected. Assuming you already know what kinds of data you want to track and what your data-model will look like, here are a few things to think about:
- Traffic variability
- Rate limiting and traffic management
- Good old-fashioned input validation
Each of these concerns needs to be addressed properly to make sure that your users get a solid experience. Most of them go quite a bit beyond checking inputs to a function.
We’ve all heard about defensive programming, validating inputs, and script injection. When you build a public-facing analytics system there are a variety of different types of malicious inputs, not all of which manifest themselves as readily as others. Defending against a DDOS event requires architectural decisions around what is an acceptable load profile. Managing rate limiting is heavily informed by what sort of a business or service you want to run, and is also impacted by the level of service you want to give certain users.
Some questions to ask: Are all users equal? Do certain users somehow need to be treated differently from others? Considering these questions in advance will help you build the right system for your users’ needs.
Today, almost all web applications require developers to select at least one storage solution, and this is an especially important consideration for an in-house analytics system. Some key questions to consider are:
- What sort of scale are you looking to support?
- What is the relationship between reads/writes?
- Are you trying to build a forever solution or something for right now?
- How well do you know the technology?
- How supportive is the community?
The better set up you are to answer these questions, the more successful your solution will be.
At Keen we use Cassandra as our primary data store and have a few other storage solutions for rate limiting, application data, etc… We chose Cassandra as our primary store because of its performance and availability characteristics. Another decision point was how well it scales with writes when the data volume gets very large. We will discuss this in more depth in a future post.
There are more technologies available to developers today than ever before. How do you know which ones will work best for your analytics needs? What OS do you use? What caching technologies?
At Keen we have gone through this process numerous times as we built and scaled our analytics platform. One recent example was selecting the language for two of the systems in our middleware layer: caching and query routing. These are fairly well-studied problems that don’t require bleeding-edge technologies to solve well.
Here are the criteria we used to make our selection:
- We needed a mature toolchain that would allow us to predictably troubleshoot and deploy our software
- We needed a language that was statically typed and concise
- We did not need everyone to have prior knowledge of the language (since we didn’t have an existing codebase to build on top of)
With these factors in mind, we ended up eyeing a Java Virtual Machine (JVM). The toolset is mature, performance is adequate, it is very predictable and has a large set of frameworks to solve common problems. However, we didn’t want to develop in Java as it tends to be overly verbose for our needs.
In the end we decided to use Scala. It runs on the JVM so we get all of the benefits of the mature toolchain, but we are able to avoid the extra verbosity of the Java language itself. We were able to build a few services with Scala with quick results and have been very happy with both the language and the tooling around it.
Querying + Visualization
Once you’ve figured out where your data will live, you will need to decide how to give your teams access to it. What will reporting look like? Will you build a query interface teams can use to run their own analysis? Will you create custom dashboards for individual teams: product, marketing, and sales?
Ok, so now your service is up and running, you are providing value to your teams, and business is up and to the right. Unfortunately you have a team member who isn’t particularly happy with query performance. “Why are my queries slow?” they ask.
You now have to dig in to understand why it is taking so long to serve a query. This feels odd because you specifically chose technologies that scale well and performance a month ago was blazingly fast.
Where do you start? In most analytics solutions there are a number of systems involved with serving the request. There is usually an inbound write queue, some query dispatching mechanism, an HTTP API layer, various tiers for request processing, storage layers, etc… It is critical to be able to trace a request end to end as well as monitor the aggregate performance of each component of the system and understand total response times.
At Keen we have invested in all of these areas to ensure we have real-time visibility into performance of the service. Here’s an overview of our process:
- Monitor each physical server and each component
- Monitor end to end performance
- Build internal systems that trace requests throughout our stack
- Build auto-detection for performance issues that notify a human Keen engineer to investigate further
This investigation process leverages our JVM tools, along with various custom tools and testing environments that help us quickly pinpoint and fix the problem when the system is underperforming.
Yep. This is actually a thing: “If something can go wrong, it will.” Inevitably pieces of your analytics solution will have issues, if not the whole system itself. I touched on this in the troubleshooting section, but there are much larger issues you will need to think through, such as:
- How are you laying out your servers in the network?
- How do you deal with data corruption or data loss?
- What is your backup and recovery timeline and strategy?
- What happens when a critical team member moves on to another role or company?
Imagine these scenarios. Maybe you were using FoundationDB, only to have it scooped up by Apple, and now you are trying to figure out how this impacts you. Maybe someone was expanding storage and took down all your load balancers because your machines weren’t labeled correctly. Maybe your sticks of memory went bad. Maybe Level3 just went down and took your whole service offline.
These represent just a few of issues you will likely run into as you run your own service. How well you can deal with them will help define how well you can serve your customers.
Stay tuned for more details
Over the next few months we will release in-depth posts covering each of the areas above to help you build a successful in-house analytics system. We look forward to sharing our thoughts and lessons we learned building out our service.
Want an alternative to build-it-yourself analytics?
We went through all the work of building an analytics infrastructure so you don’t have to. Our API’s for collecting, querying, and visualizing data let you get in-house analytics up and running fast, to you give your team and your customers the data they need.