Avatar photo

The problem with ‘log everything’ and how to solve it

I posted an answer this weekend on Quora about a very familiar problem — how a startup with a small engineering team should approach data & data infrastructure.

I’ve played the data-engineer-by-necessity role at several startups now and it’s not easy. Deploying infrastructure, hand-rolling integrations, and building dashboards all take time. And as an engineer, that’s time you could be using to build features or fix bugs.

And to make it worse, you might treat your data build-out like everything else you do — you might do a really good job. You craftsman, you!

It’ll be fast, extensible, and well-designed. And because it is, you’ll add every piece of data in sight to it.

You’ll sleep well knowing that you’re tracking everything (what we call optimistic event collection). Then you’ll wake up a week later and realize you haven’t made any decisions yet using all of this new data. There’s already too much of it to sort through.

Your next temptation will be to retroactively generate & fit hypotheses to what you’ve been capturing. And just like that you’ve gone and broken science.

All is not lost. If you’re smart, or lucky, you’ll narrow your data down to what you care about and eventually swing your data investment into the black. But why start so deep in the red?

A better start might be to take a more ‘pessimistic’ approach. Identify precisely the metric you need to know, hypothesize about what influences it, and ignore the rest; tune out the noise. Build outward from there.

I suppose that’s the anecdotal, hits-close-to-home version of my Quora answer (below). If you’ve ever been in that situation, let’s get a beer and commiserate. And if you haven’t, let’s get a beer and celebrate!

Update: Kyle also answered this question and made some great points. Tracking too few events can result in slow iteration speeds or gaps in analysis. Read his answer for some tips on finding a suitable middle ground.

Logging (data): How should a web startup with a 2-person engineering team approach the “log everything” data infrastructure problem?

My disclaimer should follow from my topic bio — I work at Keen IO; we help customers think about this problem every day.

I agree with/encourage these points:

  • Do event-based analytics w/ flexible JSON. Think of your event properties as the ‘current state of the world’. Make payloads wide so you can look for correlations later.
  • You should absolutely own your data from the get-go.
  • Even if you have ‘ownership’, your data shouldn’t be locked into generic or isolated portals and views. You need the ability to compose exactly the queries and views that are meaningful to your specific problem with minimal investment.
  • Your core competency is building your product, not wrangling data or building data infrastructure. That said, any investments in analytics you make should not be throwaway. They should represent the customization of what’s available generically to your exact domain.

I disagree with the track everything mentality. Having too much data is the best excuse for not using any of it. Track what you think matters, but rapidly adjust this as you hone in on product-market fit.

Data FOMO causes more harm than good. Right now there are 5–10 metrics that represent 80–90% of what data can teach you about your product. Your job is to identify those 5–10 metrics and start tracking them.

All that said, we live in a world of cheap storage and powerful tools/APIs. Logging everything isn’t that expensive. Do it if it gives you piece of mind.

Just don’t trade breadth-and-comfort for depth-and-action. Broad data is more interesting than actionable. It’s only a few specific metrics you’ll actually use to make business decisions, and you need to understand what influences those numbers inside and out.