Peter, Brad, and I hosted an AMA about recent platform instability on February 27th. It was a chance for our community to ask about what’s been up with the API lately and how we’re addressing it. Here are a few highlights from the conversation, fit and trimmed for readability. Editing notes are in brackets. The AMA itself contains the full, unedited versions.
Nima: We’re also an analytics company dealing with billions of events each month, and have 3 layers of redundancy that last up to 60 days in our system — I was surprised to see that you only keep your queues for a week.
Josh: Most data is off of Kafka and persisted to Cassandra in multiple replicas, multiple data centers, within 5 seconds of receipt. When the write path gets backed up that delay can turn into hours, but a week isn’t something we anticipated because we hadn’t considered a failover scenario where data in the previous DC still needed to be ingested a week later. We haven’t decided exactly what we want to do about this. Storing Kafka logs longer is one option but that’s a difference in degree, not kind. My hunch right now is that we’ll build in warnings into the monitoring stack for this a la “Hey dummy, this data is going to be reclaimed and isn’t permanently persisted yet!”
Nima: We’re having discussions on building even more of our analytics on top of Keen (client events -> engineering dash-boarding) — but it’s hard to make that decision due to the recent events; especially when no real plan has been shared with the customers around what you’ll do to address these issues.
Josh: Here are some of the plans we have in store:
- Index optimization, which will improve query performance and reduce Cassandra I/O generally. Expect to see a document about this on the dev group next week. We’re trying to share more of this stuff, earlier, for transparency but also feedback (we know many of y’all are pretty smart about this stuff)
- Cassandra and Astyanax / Cassandra driver work. There are a few optimizations we’ve identified that will reduce the % of Cassandra reads & writes that have to be coordinated to nodes that don’t own the row key in question. Big drop in I/O and increase in performance to be gained.
- Hardware. Kafka, Zookeeper, Storm, and Cassandra are continuously getting new metal.
Nima: So I guess the question is: what are the plans? How will you scale up faster than your customers are scaling?
Josh: The question of how we scale up faster than customers is favorite of mine. Fundamentally I think it comes down to people. Another kind of “monitoring” we have in place is this — “Is Keen a great place to work for distributed systems engineers?” If the answer is “Yes!” than we have the opportunity to address performance and scale challenges with the brightest minds out there. I think that gives us as good a shot as any to keep up with your (very impressive!) growth.
Jody: I’ve heard to write good software, you have to write it twice. What are some of the major things you’d do differently if you had to rebuild keen from the ground up?
Josh: I have heard that too, and I’d say at LEAST twice. Here are a few things I would have done differently [geared specifically toward our distributed Java stack]:
- Not use Storm’s DRPC component to handle queries. It wasn’t as battle-tested as I’d hoped and I missed that. We spent a lot of time chasing down issues with it, and it’s partially responsible for queries that seem to run forever but not return anything. Thankfully due to the work of Cory et. al. we have a new service Pine that to replace DRPC and all queries are running on it now. It has queueing and more intelligent rate limiting and customer isolation.
- Greater isolation of write and read code & operational paths. The semantics of failure are very different in writes vs. reads (you can re-try failed reads, but not un-captured writes). To be honest it took us a few bruises in 2013 & 2014 to really drill that into our design philosophy. Most of our stack is pretty well separated today, but we could have saved some headaches and customer impact by doing it earlier.
- More monitoring, more visibility, earlier. Anything to ease the operational burden across our team during incidents and prevent stress and burnout.
Steve: I would be interested in your dev plans over the next 3–6 months especially on the query API.
Peter: Over the next 3–6 months our dev plans revolve around performance, consistency, and scalability. We intend that these architectural changes set us up for future feature expansion. Doors we’d like to open include custom secondary indices, greater filter complexity, chained queries, and increased ability to do data transformation. I’m happy to share more of my excitement around what the future holds if you’re curious 🙂
Rob: We’ve been using Keen for a while now and I recommend you guys to everyone I know. It looks like you’re having some success, which is awesome, but with that, issues scaling. How do you intend to improve performance even while you scale?
Brad: Thanks so much for your support Rob. There are a few things that we are doing to be able to improve performance as we scale. We are looking at making our systems more efficient, further down the road looking at new capabilities in terms of how we service queries and writes as the come into the system. We are currently on the second major iteration of our architecture which is a result of looking at how we can better serve our customers.
Nariman: Any updates on the read inconsistency issue? We first reported this in January and continue to see it in production as of yesterday. The problem hasn’t changed since it was first reported, it includes null timestamp and other null properties.
Brad: We have a number of areas which require repair and I have made it through about a quarter of the data as we ran into a few operational issues around doing the repairs. That said, we are now running full steam ahead on the repairs and monitoring them closely and making sure we don’t run them so fast they cause other issues in our infrastructure. This is my top priority and I have all the support I need to get this done.
Marios: You mention that you use Cassandra, are there scenarios where Cassandra’s conflict resolution may result in customer data loss/corruption?
Peter: In the past we ran into issues with Cassandra’s concept of Last Write Wins (“doomstones” and clock drift). Currently, the majority of our consistency issues are due to unforeseen edge cases in node failure scenarios.
Lee: I have been thinking about querying Keen overnight for consistency and replaying (when possible) if this check fails. Do you have a recommended procedure for replaying events over a questionable period? Thoughts on this?
Josh: [the first part of the answer goes into how, this is the 2nd part] Would I recommend it? It depends. If individual events have a very high impact (e.g. financial / you’re using them to bill customers) than if it were me I’d keep my own rolling record of events over some period of time. The Internet and natural disasters and the entropy of an expanding universe can never be ruled out. Most compliance directives & regulations mandate this anyway.
That said, one of our design goals for Keen is that in the general case you shouldn’t have to do that. The amount of write downtime Keen had last year can be stated in single-digit hours. We hate that we have a period with 1–2% inconsistent data, believe me, but mobile apps suffer connectivity issues, people run AdBlock on the web, batteries in sensors unexpectedly die — single-digit error percentages creep into the analytics equation way before data is sent to Keen. The question you have to ponder is a cost/benefit — the cost of building/maintaining verification vs. the cost to you / your business when more error creeps in than usual (because Keen or something else fails in an unexpected way).
There’s no one right answer. It really depends on what you’re tracking, your cost & time constraints, and in some sense your belief that the resiliency of the Internet and downstream services like Keen will get more reliable over time. I believe they (and we) will. Part of why I wanted to have this AMA, and more in the future, is to share the transparency I have into what’s going on at Keen so you have more information with which to form your own beliefs.
That’s it for the quick summaries. You can read all ~35 posts in full on the AMA itself.
I want to pre-emptively acknolwedge that we didn’t get into as much of the nitty gritty tech stuff as I had hoped. There is a still a technical postmortem on the way and I’m happy to use that as a starter for another conversation on the developer group.
A special thanks to everyone who dropped by to chat with us. We learned a lot and your participation was really inspiring.