Avatar photo

An Update on Query Durations

We wanted to provide an update on the state of Keen’s query performance. After some rough patches in February and March, we’ve made significant progress in stabilizing queries.

However, query durations are still not as fast as they were, say, three or four months ago. We understand this continues to be frustrating for customers who built solutions that relied on those faster query times. We want queries to be faster too, and hold ourselves to a very high standard when it comes to reliability & performance. It pains us to limit your experience. As part of our commitment to transparent communication, we wanted to increase your awareness of what we’re doing to address the situation.

Why are my queries slow?

There is no single reason for these query duration issues, but they are generally related to the challenges of rapidly scaling our service. To be perfectly transparent, many of you are running fast-growing companies, and your data & query volumes are growing with you. On top of being fantastic, committed, and growing customers, many of you have also recommended Keen to new developers, too. As a result Keen usage has consistently grown (and continues to grow) 20% month over month. Scaling to support you is the challenge we signed up for, and we’re happy to do it, but you wouldn’t be paying us if it weren’t indeed a challenge.

Although platform companies like ours would love to say it isn’t the case, of course another factor that leads to spikes in query duration are individual users dramatically exceeding the standard query load (aka, noisy neighbors). We have already drastically improved, and continue to improve, our ability to detect and protect the platform from these types of use cases, and to work with these customers to find the right solution for their needs. It’s our job to ensure noisy neighbors don’t impact your experience, and we’re committed to that. We don’t want to pretend like that isn’t a challenge either, though.

For those interested in the technical challenges (and triumphs!) of building distributed systems, we plan to write more to explain individual bottlenecks we have encountered with various pieces of the pipeline infrastructure.

What are you doing to resolve this?

Currently we are significantly strengthening nearly every major internal system we rely on. To get a bit more technical, enhancements to our Zookeeper installation are wrapping up. Capacity expansion to our Storm cluster is underway. Our Cassandra data model is being reworked to address costly hot spots. And we’ve further rationalized our internal DNS which will ease deployment and maintenance.

In addition, we now have even more powerful internal tools for performance profiling and benchmarking. We will also be rolling out better service protection in the coming weeks. Structurally, we are looking to significantly expand the size of the platform engineering team (there were only 6 of us until recently; now we have 8 and our team is still growing).

Finally, we didn’t set out to build a company just to see how fast we could grow it. There is no point in scaling Keen bigger and faster if it comes at your expense. The trust of our customers is our most precious asset. So, as another protection to our customers (and our team), we’ve decided to put on hold several new, very large potential customers. Longevity, stability, and sustainability are far more important to us than fast growth.

How can I stay up to date?

You can check our status page for regular updates on performance metrics and query durations.

We also suggest checking out our Query Performance Guide. The guide contains some great tips on how to optimize your queries. In addition, our beta Query Caching feature is now ready for general availability. If you’re interested in significantly increasing performance and consistency for queries that are used repeatedly please reach out to us and we can enable this new feature for you.

What’s the timeline?

While we continue to work through this rough patch, queries will not improve in one fell swoop. All queries will continue to be slower than normal over the next few weeks. Please be aware that Extractions and Percentiles are particularly slow at present.

Achieving our performance standards–in speed, dependability, and scalability–is our top priority. We believe that the investments we are making over the next few weeks will pay off and your query performance will improve, not just in the near term but well into the future as our volume continues to grow.

As always, your patience and understanding are greatly appreciated. We can’t imagine building for a better community. Once again, our deepest apologies for providing you with less than stellar service. We will improve, and we are committed to transparent communication along the way.