Building a High Performance Distributed System: Apache Kafka vs Amazon Kinesis

This article was originally published in February 2017 and has been updated.

At Keen IO, we’ve been running Apache Kafka in a pretty big production capacity for years, and are extremely happy with the technology. We also do some things with Amazon Kinesis and are excited to continue to explore it as we continue to tweak our high performance distributed system.

Apache Kafka vs Amazon Kinesis

If you’ve narrowed it down to choosing between Kinesis and Kafka for the solution, the choice usually depends on these factors more than it does on your use case:

  • Company size
  • Stage
  • Funding
  • Culture

(Spoiler: for some use cases, the answer is obviously Kafka, I’ll get to that later).

Predetermining Factors for a High Performance Distributed System

If you’re a Distributed Systems engineering practice and:

  • Have lots of distributed dev ops / cluster management / auto-scale / streaming processing / sysadmin chops
  • Prefer to interact with Linux vs. interacting with an API

You may choose Kafka regardless of other factors. The inverse is true if:

  • you’re more of and web, bot, or app development practice
  • Are fans of any services like Amazon RDS, Amazon EC2, Twilio, and SendGrid more than services like Apache ZooKeeper and Puppet.

Going Head-to-Head

In somewhat-artificial tests on high performance distributed system, Kafka today has more horsepower out of the box on rough numbers. Thus Kafka presently can be tuned to outperform Kinesis when it comes to raw numbers. But are you really going to do all that tuning? Or are there other pros and cons to consider? A Corvette can beat a Toyota Corolla in many tests, but maybe gas mileage is what matters most to you? Or perhaps longevity and interoperability? But maybe, like lots of business decisions, is it Total Cost of Ownership (TCO) that wins the day?

What follows is a bit of a side-by-side breakdown of the big chunks of the TCO for each technology.

High Performance Distributed System Performance (can it do what I want?)

For the vast majority of the use cases, you really can’t go wrong with either of these technologies performance-wise. Although you can find an in-between solution such as Amazon MSK, there are other great posts that point to Kafka shining in this department.

Advantage: Kafka — but performance is often a pass/fail question, and for nearly all cases, both pass.

High Performance Distributed System Setup (human costs)

Kinesis more than just slightly easier to set up than Kafka. When compared with roll-your-own on Kafka, Kinesis mitigates a lot of problems. Cross-region issues have been considered, but otherwise, you would have to learn and manage:

  • Apache ZooKeeper
  • Cluster management
  • Provisioning
  • Failover
  • Configuration management
  • And more

If you’re a first-time user of Kafka, it’s easy to sink days or weeks into making Kafka into a scale-ready production environment. Kinesis will take you a couple of hours max. And as it’s in AWS, it’s production-worthy from the start.

Advantage: Kinesis, by a mile.

Ongoing ops (human costs)

It also might be worth adding that there can be a big difference between the ongoing burden of running your own infrastructure vs. paying AWS to do it for you. Especially considering headaches that can come with self-running:

  • 24-hour pager rotation to deal with hiccups
  • building a run book over time based on your experience
  • Other standard site reliability issues

In many Kafka deployments, the human costs related to this part of your stack alone could easily become a high hundreds of thousands of dollars per year.

Ops work still has to be done by someone if you’re outsourcing it to Amazon, but it’s probably fair to say that Amazon has more expertise running Kinesis than your company will ever have running Kafka. Plus the multi-tenancy of Kinesis gives Amazon’s ops team significant economies of scale.

Advantage: Kinesis, by a mile.

Ongoing ops (machine costs)

This one is hard to peg down. The only way to be certain for your use case is to build fully-functional deployments on Kafka and on Kinesis then load-test them both for costs. This is worthwhile for some investments, but not others. But we can make an educated guess.

Time Investment

As Kafka exposes low-level interfaces—and you have access to the Linux OS itself—Kafka is much more tunable. If you invest the human time your costs can go down over time based on:

  • Your team’s learning
  • Seeing your workload in production
  • And optimizing for your particular usage

With Kinesis, your costs will probably go down over time automatically because that’s how AWS as a business tends to work. But that cost reduction curve won’t be tailored to your workload. Mathematically, it will work more like an averaging-out of the various ways Amazon’s other customers are using Kinesis. This means the more typical your workload is for them, the more you’ll benefit from AWS’s inevitable price reduction.

Cost of Utilization

Meanwhile — and this is quite like comparing cloud instance costs (e.g. EC2) to dedicated hardware costs — there’s the utilization question: to what degree are you paying for unused machine/instance capacity? On this front, Kinesis has the standard advantage of all multi-tenant services, from Heroku and SendGrid product to commuter trains to HOV Lanes. It is far less likely to be as over-provisioned as a single-tenant alternative would be, meaning a given project’s cost curve can much better match the shape of its usage curve. The vendor makes a profit margin on your usage, but AWS (and all of Amazon, really) is a classic example of Penetration Pricing, never focused on extracting big margins.

Advantage: Probably Kinesis, unless your project is a super special snowflake.

Incident Risk

Your risks of production issues will be far lower with Kinesis. After your team has built up a few hundred engineer-years of managing your Kafka cluster — or if you can find a way to hire this rare and valuable expertise from the outside — these risks will decline significantly. As long as you’re also investing in really good monitoring, alerting, 24-hour pager rotations, etc. The learning curve will be less steep if your team also manages other heavy distributed systems.

But between go-live and when you have grown or acquired that expertise, can you afford outages and lost data in the meantime? The impact depends on your case and where it fits into your business. The risk is difficult to model mathematically: if you could a given service outage or data loss incident well enough to model their impact, you’d know enough to avoid the incident entirely.

Advantage: Kinesis

Conclusion

The TCO is probably significantly lower for Kinesis. So is the risk. In most projects, risk-adjusted TCO should be the final arbiter. So why do we use Kafka, despite the fact that the risk-adjusted TCO may be higher?

The first answer is historical: Kinesis was announced in November 2013, well after we had built on Kafka. But we would almost certainly choose Kafka even if we were making the call today. This is because of these core reasons:

Event Streaming

Event streaming is extremely core to what we do at our company. In the vast majority of use cases, data engineering is auxiliary to the product, but for us, it is product: one of our products is called Keen Streams, and is itself a large-scale streaming event data input + transformation + enrichment + output service. Kafka helps power the backbone of the product, so tunability is key for our case.

Tunability

Nothing is more tunable than running an open source project on your own stack, where you can instrument and tweak any layer of the stack (on top of Kafka, within Kafka, code in the Linux boxes underneath, and configuration of those boxes to conform to a variety of workloads). And because what we sell is somewhere between PaaS and IaaS ourselves, and because performance is a product feature for us as opposed to an auxiliary nice-to-have on an internal tool, we’ve chosen to invest heavily into that tuning and into the talent base to perform that tuning.

Versatile Deployment

Being that it’s open source and can be deployed anywhere, Apache Kafka is extremely versatile. Infrastructure cost is a key input to our gross margins, so we enjoy a lot of benefits by being able to deploy into various environments . Data location is a key input to some enterprise customers’ decision-making process, so it’s valuable to maintain control over where all of our servicesincluding the event queue itselfare deployed.

Do More with Keen

At Keen IO, we built a massively scalable event database that allows you to stream, store, compute, and visualize all via our lovingly-crafted APIs. Keen’s platform uses a combination of Tornado, Apache Storm, Apache Kafka, and Apache Cassandra, which allows for a highly available and scalable, distributed database. Have an experience or content you’d like to share? We enjoy creating content that’s helpful and insightful.

Enjoyed the article? Check us out! Or email us– we would love to hear from you.