Compute Performance Improvements & Magic of Caching

I’m happy to announce that we’ve rolled out some significant performance and reliability improvements to the Keen IO platform for all query types.

Improved Query Response Times

The overall response times have improved. Queries to Keen (via the API or through the Explorer) should be faster now. The following graph shows the impact of the changes.

95th Percentile Query Duration

Improved Query Consistency

We have also made our query processing more robust by fixing a bug in our platform that could cause query results to fluctuate (different results for the same query) during certain operational incidents like this one.

The Magic of Caching

These dramatic results have been possible due to more effective caching of data within our query platform.

We’ve been working on improving query response times for many months and to understand the most recent update it would be useful to have a little background on how Keen uses caching and how it’s evolved over time.

Query Caching Evolution

At the lowest level we have a fleet of workers (within Apache Storm) responsible for computing query results. Any query can be considered as a function that processes events.

Query = function(events)

Workers pull pending queries from a queue, load the relevant events from the database, and apply the appropriate computation to get the result. The amount of data needed to process a query varies a lot but some of the larger queries need to iterate over hundreds of millions of events, over just a few seconds.

If you want to know more about how we handle queries of varying complexity and ensure consistent response times I wrote a blog post on that earlier which is available at here. Simplified view of a Query being processed

(Simplified view of a Query being processed)

We started experimenting with caching about a year ago. Initially, we had a simple memcached based cache running on each storm worker for frequently accessed data. At this stage, the main problem that we had to solve was invalidating data from the cache.

Cache Invalidation

We don’t store individual events as individual records in Cassandra because that won’t be efficient, so instead we group events (by collection and timestamps) into what we call ‘buckets’. These buckets sometimes get updated when new events come in or if our background compaction process decides that the events need to be re-grouped for efficiency.

If we used a caching scheme that relied on a TTL or expiry, we would end up with queries showing stale or inconsistent results. Additionally, one instance of cache per worker means that different workers could have a different view of the same data.

This was not acceptable and we needed to make sure that cache would never return data that has been updated. To solve this problem, we

  1. Added a last-updated-at timestamp to each cache entry, and
  2. Set-up memcached to evict data based on an LRU algorithm.

The scheme we used to store events was something like the following:

Cache Key = collection_name+bucket_id+bucket_last_updated_at_

Cache Value = bucket (or an array of events)

The important thing here is that we use a timestamp bucket_last_updated_at as part of our cache key. The query processing code first reads a master index in our DB that gives it a list of buckets to read for that particular query. We made sure that the index also gets updated when a bucket is updated and has the latest timestamp. This way the query execution code knows the timestamp for each bucket to read and if the cache has an older version it would be simply ignored and eventually evicted.

So our first iteration of the cache looked something like the following: Query Caching V1

(Query Caching V1)

This was successful in reducing load to Cassandra and worked for many months but we weren’t fully able to utilize the potential of caching because we were limited by the memory on a single storm machine.

We went on to create a distributed caching fleet. We decided to use Twitter’s Twemproxy as a proxy to front a number of memcached servers. Twemproxy handles sharding of data and dealing with server failures etc.

This configuration allows us to pool the spare memory on all our storm machines and create a big, distributed-cache cluster. Query Caching V2

(Query Caching V2)

Once we rolled out the new configuration the impact was pretty dramatic. We saw a major increase in cache hit-rate and improvements in query performance. Improved cache hit rate after distributed caching rollout

(Improved cache hit rate after distributed caching rollout)

Improving Query Consistency

Keen’s platform uses Apache Cassandra, which is a highly available and scalable, distributed database. We had a limitation in our architecture and usage of Cassandra such that we were susceptible to reading incomplete data for queries during operational issues with our database.

Improved cache hit rates meant that most of the query requests were served out of cache and we were less sensitive to latency increases in our backend database. We used this opportunity to move to using a higher Consistency Level with Cassandra.

Earlier we were reading one copy (out of multiple copies) of data from Cassandra for evaluating queries. This was prone to errors due to delays in replication of new data and was also affected by servers having hardware failures. We now read at least two copies of data each time we read from Cassandra.

This way if a particular server does not have the latest version of data or is having problems we are likely to get the latest version from another server which improves the reliability of our query results.

Manu Mahajan

Backend Software Developer. Human.

Delivering embedded analytics to your customers

More and more companies are making embedded analytics a core part of their product offering. Companies like Medium, Spotify, Slack, and Intercom are leveraging data as a core product offering to drive results. This isn’t just happening within high growth tech startups. In a recent survey, The Data Warehousing Institute found that around 30% of enterprises already have embedded analytics as a product offering in development or production and this effort is expected to double by the year 2018. Regardless of your industry or company size, you might have thought about ways to use data to engage your users, demonstrate product value, or create opportunities for upsell and differentiation. Whatever your objective, delivering embedded analytics to your customers can be a significant undertaking and addition to your product roadmap. You’ll need to tackle questions like: What is the purpose of having analytics for our customers? How will you display data to customers? Will you let your customers run their own analysis on data? Will you build in-house or leverage existing technology? How many engineering resources can you dedicate to this? What is the timeline?

We’ve put together a framework for thinking through all the moving parts of delivering embedded analytics to your customers so you’ll be best setup for success. Click here to view the handy PDF version.

Define your analytics objective

  • Can data help drive customer engagement?
  • How will providing embedded analytics to your customers differentiate your product?
  • Do you have dedicated resources to help build out this product?
  • Do you have executive buy in?

Data Readiness

  • Do you currently have customer data stored?
  • What sources do you need to collect data from? Are there APIs you can utilize for third party providers?
  • How clean is your data?
  • What format is your data in? Will you need to extract, load and transform it?
  • What are the key KPIs your customers care about?

Security & Access

  • How strict are the security requirements of your customers? What type of compliance do they require?
  • How granular do you want to get security permissions? Securing by company, by department, by role?
  • What are your hosting and infrastructure requirements?

Application UX

  • How do you want to display the analytics within your application?
  • How much control do you want customers to have over their analytics? Do you want to make it exportable? Do you want them to run their own queries?
  • Do you know where in the user flow you’d like to incorporate analytics?
  • Do you have a support structure set in place for customers who engage with your analytics service?

Performance

  • How real time do your customers need their data to be?
  • Do you have a sense for how many queries and how often you’ll need to run these queries per customer?

Engineering Resources

  • What are your current resource constraints?
  • Do you have data engineering and data modeling expertise?
  • Do you have a UI Engineer to design the look and feel of analytics into your application?
  • What additional resources will you need?

Delivery & Extensibility

  • Do you have a sense for the timeline to deliver an MVP?
  • How often do you expect your customer metrics to change?
  • Can you dedicate full time resources to build this?

Want to reference this list later? We’ve created this handy PDF checklist for you to print off. We also curated a list of 25 companies who are delivering analytics to their customers for fun inspiration.

Happy building! If you’d like to chat about how we’ve helped companies deliver analytics to their customers, give us a shout or request a demo.


We’ll be writing and sharing more content soon. Sign up to join thousands of builders and stay up to date on the latest tips for delivering analytics to your customers:

Alexa Meyer

Growth and UX. Cheese chaser. Aspiring behavioral economist.

Just Released: Download to CSV + Clone Queries

We have some very exciting news to share today! We’ve released some updates to Keen’s Data Explorer that we think you’ll enjoy. Keen IO users can now:

  • Download query results directly into CSV files
  • Clone saved queries

These two features have been widely requested by our community and we’re thrilled to make them available to everyone.

How to download query results to CSV

Now you can download query results displayed in the “Table” view as a CSV file from the Explorer. If you’ve entered a name for your query, that name will automatically be used as the CSV file name. If your query has not been named, we’ll provide a placeholder file name that you can update whenever you like.

To download a CSV:

  • Log in to your Keen IO account and run a query in the Explorer
  • Select the “Table” visualization type from the dropdown
  • Click “Download CSV”

How to clone a saved query

A cloned query is essentially a copy of a saved query. Once you’ve cloned a query, you can modify it without impacting the original query. This is especially handy when you want to build off of complex queries (like funnels with custom filters on each step) without having to enter all of the query parameters from scratch each time.

To clone a query:

  • Log in to your Keen IO account and select a saved query from the “Browse” tab
  • Click “Clone”
  • Enter a name for your cloned query and click “Save”

A note of thanks

A huge thank you goes out to Keen IO user and community member, Israel Menis, for their open source contributions to the Data Explorer. Their contributions helped make these features possible!

As always, if you have any questions or feedback, please reach out to us anytime. We hope cloned queries and CSV download help streamline your workflow.

Happy Exploring!

Sara Falkoff

Software Engineer

Announcing: Search on Keen Docs!

We’ve been spending time working on the Developer Experience of using Keen, making the Keen documentation searchable is one of the first updates with more to come.

Try it out here!

Searchable Docs

In the weeks to come, we’re excited to write a technical blog post on how we implemented search in our docs with Algolia. At Keen IO, we are a developer-first company and believe in creating a world-class developer experience. We have functional tools and API’s for our developers to build applications that show off their data quickly. And we also believe that the workflow on our site should be as easy-to-use as possible, and we’re committed to creating this positive Developer Experience.

Do you have feedback for our Developer Experience? Just drop us a comment or write to us at community@keen.io.

Happy Coding! 📊

–Developer Advocacy Team

Maggie Jan

Data Scientist, Engineer, Teacher & Learner

25 Examples of Native Analytics Data Designs in Modern Products

Data is so ubiquitous, we are sometimes oblivious to just how much of it we interact with—and how many companies are making it a core part of their product. Whether you’re aware of it or not, product leaders across industries are using data to drive engagement and prove value to their end-users. From Fitbit and Medium to Spotify and Slack, data is being leveraged not just for internal decision-making, but as an external product offering and differentiator.

These data-as-product features, often displayed as user-facing dashboards, are known as “native analytics” because they are offered natively within the context of the customer experience. We’ve gathered 25 examples of native analytics in modern software to highlight their power and hopefully inspire their further adoption.


Ahrefs Lets Website Owners Drill Down on Referrers

Every day, Ahrefs crawls 4 billion web pages, delivering a dense but digestible array of actionable insights from 12 trillion known links to website owners (and competitors), including referrers, social mentions, keyword searches, and a variety of site rankings.


AirBnB Helps Hosts Improve their Ratings and Revenue

In addition to providing intimate housing options in 161 countries to 60M+ guests, Airbnb also reminds its more than 600,000 hosts of the fruits of their labors—with earnings reports—and gently nudges them to provide positive guest experiences—with response rates and guest ratings.


Etsy Helps Build Dream Businesses

The go-to online shop Etsy, which boasts 35M+ products, provides its 1.5M+ sellers with engagement and sales data to help them turn their passion into the business of their dreams.


Eventbrite Alerts Organizers to Sales and Check-ins

Event organizers use Eventbrite to process 4M tickets a month to 2M events in 187 countries. They also turn to Eventbrite for real-time information, to stay up to date with ticket sales and revenue, to track day-of check-ins, and to understand how to better serve and connect with their attendees.


Facebook Expands Reach of Paid Services

With Facebook striving to take a bigger bite out of Google’s share of online ad sales, its strategic use of data has spread beyond the already robust Facebook Ads Manager to comprehensive metrics for Pages, including, of course, key opportunities to “boost” posts.


Fitbit Helps Users Reach Their Fitness Goals

Fitbit’s robust app, connected to any of its eight activity trackers, allows its 17M+ worldwide active users to track steps, distance, and active minutes to help them stay fit; track weight change, calories, and water intake to stay on pace with weight goals; and track sleep stats to help improve energy levels.


GitHub Tracks Evolving Code Bases

GitHub, the world’s largest host of source code with 35M+ repositories, allows its 14M+ users to gain visibility into their evolving code bases by tracking clones, views, visitors, commits, weekly additions and deletions, and team member activity.


Intercom Targets Tools—and Data—to Users’ Needs

Intercom, the “simple, personal, fun” customer communications platform, delivers targeted data-driven insights depending on which of the platform’s three products a team uses: Acquire tracks open, click, and reply rates; Engage tracks user profiles and activity stats; and Resolve tracks conversations, replies, and response times.


Jawbone UP Enables Ecosystem of Fitness Apps with Open API

Jawbone’s four UP trackers helps users hit fitness goals by providing insights related to heart rate, meals, mood, sleep, and physical activity both in its award-winning app, and through an extensive ecosystem of apps that draw data from the platform’s open API.


LinkedIn Premium Tracks Funnel Conversions

LinkedIn’s Premium suite of networking and brand-building tools helps demonstrate the ROI of sponsored campaigns by providing users with visibility into their engagement funnel—from impression, to click, to interaction, to acquired follower.


Medium Provides Publishers with Key Reader Metrics

Though Medium’s model is sometimes murky—publishing platform, publication, or social network?—it provides clear insights to its writers (or is that publishers?) in the form of views, reads, recommends, and referrers for published stories.


Mint Helps Users Budget and Save

Mint encourages users make better finance decisions and save up for big goals by giving them visibility into their spending trends, especially as they relate to personalized budgets.


Pinterest Allows Pinners to Track Engagement

The internet’s favorite mood board, Pinterest provides it 110M monthly active users with traffic and engagement stats including repins, impressions, reach, and clicks.


Pixlee Illuminates Its Unique Value Proposition

Pixlee helps brands build authentic marketing by making it easy to discover images shared by their customers, and then deploy them in digital campaigns. To help its clients understand the impact of this unique value proposition, Pixlee serves up an on-brand, real-time dashboard that presents custom metrics like “lightbox engagement” alongside traditional metrics like pageviews and conversions.


Shopkeep Improves Business Decision Making

Shopkeep’s all-in-one point-of-sale platform uses a wide range of data—from best-selling items to top-performing staff—to helps businesses make fact-based decisions that improve their bottom line.


Slack Delivers Visibility Into Internal Communications

The messaging app of choice for more than 60,000 teams—including 77 of the Fortune 100 companies — Slack delivers stats related to message frequency, type, and amount, plus storage and integrations.


Spotify Shares Stats as Stunning Visuals

Spotify’s stream-anywhere music service turns data insights into beautiful, bold visuals, informing their listeners of how many hours of songs they listened to in a year and ranking most-listened-to artists. They also help artists get the most from the platform by highlighting listeners by location and discovery sources.

Fan insights by Spotify


Square Zeros In On Peak Hours and Favorite Items

Going beyond credit card payments to comprehensive business solutions, Square provides business owners with real-time reports that include hourly sales by location, which help them hone in on peak hours and preferred products.


Strava Turns Everyday Activities Into Global Competitions

Strava turns everyday activities into athletic challenges by comparing its users’ performance stats against the community’s for a given walk, run, or ride. The app also used its 136B data points to create the Strava Insights microsite, providing insight into cycling trends in its 12 cities across the globe.


Swarm Updates the Foursquare Experience with New Gamified Features

Swarm adds additional gamification and social features to the original Foursquare check-in experience, providing users with their popular check-ins broken out by type, as well as friend rankings and leaderboards for nationwide “challenges.”


Triptease Builds Strong Relationships with Hotels

The Triptease smart widget allows hotels to display real-time prices for rooms listed by competing sites like Hotels.com to help convince guests to book directly and help the hotel build richer customer relationships. To keep a strong relationship with their own hotel-users, Triptease shows the impact on revenue of widget-enabled conversions, as well as the hotel’s real-time price rankings compared to other websites.


Twitter Beefs Up Its Business Case

As the internet’s 140-character collective consciousness positions itself more decisively as a boon for businesses, it has beefed up and beautified its analytics dashboard. Twitter’s dashboard now includes impressions, profile visits, mentions, and follower change for the past month, plus cards for Top Tweet, Top Follower, and Top Mention.


Vimeo Provides “Power” Stats in a Straightforward Interface

“We basically wanted to give users a power tool, but didn’t want them to feel like they needed a license to operate it,” explains Vimeo senior product designer Anthony Irwin of the video-hosting platform’s analytics tool. Today, Vimeo’s 100M+ users can dig deep—or stay high-level—on traffic, engagement, and viewer demographics.


Yelp Extrapolates Conversion-Generated Revenue

More than a ratings site for local businesses, Yelp also helps its 2.8M businesses engage and grow relationships with their customers. To highlight this value proposition, the company provides business users with a tally’s of customer leads generated through the platform, as well as a calculation of estimated related revenue.


Zype Helps Users Track Video Revenue

With a single interface, Zype makes it easy to publish and monetize video content across various platforms. Core to its value is the ability to provide users with key stats including monthly earnings, new subscriptions, and successful revenue models.


Building analytics into your product? We can help with that. Check out Native Analytics.

Want to see your stats featured in our next post? Send us a note

_

We’ll be releasing more guides and examples in the coming months. Subscribe to join hundreds of other product leaders who are following native analytics trends: ->

Alexa Meyer

Growth and UX. Cheese chaser. Aspiring behavioral economist.

Introducing: Auto Collector for Web

Want to quickly test out Keen? Need to get a quick sense of the key interactions on your website or web app? You’re in luck! We just released an Auto Collector for Web.

What does it do?

  • Drop in snippet to automatically collect key web interactions
  • Auto tracks pageviews, clicks, and form submissions
  • Auto enriches events with with information like referrers, url, geo location, device type, and more

Ready to get started? Just drop in the snippet and start seeing web events flow in within seconds. You can also add your own custom tracking.

Check out our guide to learn more or head on over to your project settings to get started. We’ve auto-populated your project ID and write ID for you.

Happy Tracking!

Alexa Meyer

Growth and UX. Cheese chaser. Aspiring behavioral economist.

How companies are delivering reporting and analytics to their customers

Today’s users of technology expect stats and charts in each and every one of their favorite apps and websites. Many companies are turning advanced analytics into a paid feature, while others are bundling analytics into their core product to improve engagement and retention. Keen’s Native Analytics lets every company differentiate with data and analytics that are truly native to their product.

In this on-demand webcast you’ll learn:

  • Key applications of Native Analytics and how companies like Triptease, Bluecore, and SketchUp use Native Analytics to deliver analytics to their users, right within their products and drive ROI
  • Why ease of use and the right capabilities are crucial to your success
  • Key considerations for a successful Native Analytics implementation

How companies deliver embedded analytics and real-time reporting for their customers

Don’t forget to download the Native Analytics checklist.

Thinking about adding Native Analytics to your product or want to improve your existing implementation? Contact us for a free consultation!

Alexa Meyer

Growth and UX. Cheese chaser. Aspiring behavioral economist.

Announcing our new podcast: Data Science Storytime!

image

We’re excited to announce the debut of Data Science Storytime, a podcast all about data, science, stories, and time.

In Episode 1, Kyle Wild (Keen IO Co-founder and CEO) and I brainstorm the concept of the show, debate the difference between data science and non-data science, and recount the story of the action-hero data scientist who skipped a meeting with Kyle to rescue a little girl trapped on a mountain (or so he assumes).

Tune in for all this and plenty more as we consider the many ways data shapes our lives and activates our imagination, today and in the future.

If you like what you hear, make sure to subscribe to get a new episode every two weeks. And follow us on Twitter @dsstorytime. Thanks, and enjoy the show!

Kevin Wofsy

Teacher, traveler, storyteller

An Open Source Conversation with OpenAQ

OpenAQ logo

Last month, I sat down with Christa Hasenkopf and Joe Flasher from OpenAQ, one of the first open, real-time, air quality data platforms to talk about open environmental data, community building, analytics, and open source. I hope you enjoy the interview!


Taylor: Could you both tell me a little bit about yourselves, and how y'all got interested in environmental data?

Christa: I’m an atmospheric scientist, and my background for my doctoral work was on ‘air quality’ on a moon of Saturn, Titan. As I progressed through my career, I got more interested in air pollution here on Earth, and realized I could apply the same skills I’d gained in my graduate training to do something more Earth-centric.

That took Joe, my husband, and I to Mongolia, where I was doing research in one of the most polluted places in the world: Ulaanbaatar, Mongolia. As a side project, Joe and I worked together with colleagues at the National University of Mongolia to launch a little open air quality data project that measured air quality and then sent out the data automatically to Twitter and Facebook. It was such a simple thing, but the impact of that work felt way more significant to me than my research. It also seemed more impactful to the community we were in, and that experience led us down this path of being interested in open-air quality across the world. As we later realized, there are about 5-8 million air quality data points produced each day around the world by official or government-level entities in disparate and sometimes temporary forms but that aren’t easily accessible in aggregate.

Joe: I was a trained as an astrophysicist but then I quickly moved into software development and so when Christa and I were living in Mongolia, I think we just sort of looked around and saw things that didn’t exist that we could make, we went ahead and did that. Open data was always something that seemed like the right thing to do. Especially when it’s data that affects everyone, like air quality data. I think we have the tools together: I had the software development skills and Christa with atmospheric science to put things in place that could really help people.

Taylor: That’s awesome. Could you tell me more about the OpenAQ Project?

Christa: Basically what we do is we aggregate air quality data from across the world and put it one format in one place, so that anyone can access that data. And the reason we do that is because there is still a huge access gap between all of the real-time air quality data publicly produced across the world and the many sectors for the public good that could use these data. Sectors like: public health research or policy applications, or an app developer who wants to make an app of global air quality data. Or say even a low cost-sensor group that wants to measure indoor air quality and also know what the outdoor air quality is like so you know when to open your windows if you live in a place like Dhaka, Bangladesh or Beijing, China. And so by putting the data in this universal format, many people can do all kinds of things with them.

Joe: Yeah, I think we’re just focused on two things. One is getting all the underlying air quality data collected in one place and making it accessible, and the main way to do that is with an API that people can build upon. And then we also have some of these other tools that Christa mentioned to help groups examine the data and look at the data, but meshing that with tools built by people in the community. Because I think the chances of building the best thing right away is very small. What we’re trying to do is make the data openly available to as many people as possible. Because a lot of these solutions are based in local context in a community.

Taylor: That’s really cool. I have heard from other organizations that when you open up the data, you democratize the data because it’s available for the people.

I read the Community Impact document for the project and you had mentioned that some researchers from NASA and NSF and UNICEF are using the data from OpenAQ. I was wondering, what are some other cool applications of the data that you are seeing?

Christa: I think when we first started the project it was all about the data. It was all about collecting the data, getting as much data as we could. And as we went on, we realized, pretty quickly, it’s actually about the community we are building around it and the stuff that people are building. And so there are a few different pieces.

One thing we have seen is a journalist taking OpenAQ-aggregated data to analyze air quality data in their local communities. There is a journalist in Ulaanbaatar, Mongolia, who has published a few data-driven articles about air quality in Ulaanbaatar relative to Beijing. There are some developers who have built packages that make the data more accessible to people using different programming languages.

There is a statistician in Barcelona, Spain, who has built a package in R that makes the data very accessible in R and makes cool visualizations. This person made a visualization where she analyzed fireworks across the US on the Fourth of July. She did a time series, and you could see a map of the US, and as 9pm rolled around in the various time zones you can see air quality change across the US as the fireworks went off.

There is a developer in New Delhi, India, who has made a global air quality app and Facebook bot that compares air quality in New Delhi to other places or will send you alerts. We feel these usages point to the power of democratizing data. No one person or one entity can come up with all the possible use-cases themselves, but when it’s put out there in a global basis, you’re not sure where it’s going to go.

Joe: We have also been used by some commercial entities to do weather modeling, pollution forecasting. Christa, there was an education use case right… Was it Purdue?

Christa: Yeah, a professor there is using it for his classroom to bring in outdoor air quality data to indoor air quality models. Students pick a place around the world. They use outdoor quality data from there to model what indoor air quality would look like, so they are not just modelling air quality data in Seattle, which is pretty good air quality. But they are also pulling in places like Jakarta or Dhaka, to see what air quality would be like indoors, based on the outdoor parameters.

Low cost sensor groups have contacted us because they are interested in getting their air quality data shared on our platform. These groups would like their data to be accessible in universal ways so that more people can do cool stuff with it too. Right now, for our platform, we have government-level data, some research-grade data, and a future direction we are hoping to move is low-cost sensors, too.

Taylor: As you have touched on, I read that OpenAQ has community members over four continents and aggregated 16 million data points from 24 countries. I am curious, how were you able to grow the project to have all that data coming in?

Christa: We have a couple ways of getting the word out about OpenAQ and getting people interested in their local community and to engage with the OpenAQ global community. One way is we do in-person. We visit places that are facing what our community calls “air-inequality” - extremely poor air quality in a given location - and we have a workshop that will convene various people, not just scientists, not just software developers, but also artists, policy makers, people working in air quality monitoring within a given government, and educators. We focus on getting them all in the same room, working on ways they can use open data to advance fighting air inequality in their area.

So far, we’ve held a workshop in Ulaanbaatar, and we have had meetups in San Francisco and DC, since that’s where we’re based. We have also done presentations in the UK, Spain, and Italy. We are about to have our next work shop in Delhi in November. We’re getting the word out through the workshops, the meetups, on Twitter, we have a slack channel. Participation in the OpenAQ Community has been growing organically in terms of participation. Whether it’s in terms of the development end, pulling in more data, or in the application of the data. We tend to get more people interested in using the data once they are aggregated rather than in those helping to build ways to add in more data, which makes sense. We are always in need of more people helping on helping build and improve the platform.

Joe: In the beginning it was very interesting how we decided to add in new sources - there are so many possible ones to add from different places. You could look at a map and see where we had been, because whenever we would go somewhere to give a presentation we would want to make sure we had local air quality data. So before we would give a presentation in the UK, we would make sure we had some UK data. Data has been added like that and according to interest for particular locations in the community.

An interesting thing that we are able to do now with the Keen analytics, is that we can look at what data people are requesting most, and even if we don’t have the data, they might still be requesting it. So we can see from the analytics where we should potentially focus on bringing in new data. So it has been a very helpful way for us to be more data-driven when looking at what data to bring in.

Taylor: When you have a project that is an open source or an open data platform, your time becomes very valuable. You want to put your resources where they are needed most.

Joe: We want to be as data-driven as possible. And it’s hard for us to talk directly to all of the people who are using the data. I think we have a similar problem to anyone who opens up data completely. We don’t require anyone to sign up for anything. We have a lot more people using the data than we know about. We can see just from how many times the data is getting grabbed that it is popular. The analytics really help us, sort of tell something about those use cases, even if we don’t know of them specifically.

Taylor: Could you explain your use of Keen for everyone so they can understand how you are figuring that out?

Joe: The API is powered by a Node.js application that includes the Keen library. Every request that comes in goes to Keen and so we have a way to sift through it.

We don’t track any use, any sign ups, any API keys or anything at the moment. We don’t see addresses that come in from the requests, they are anonymous. But we do get tons of data that we can look through. And that was super-helpful. It gave me two lines of code that go into my API and then all my requests come into Keen and I can handle all the queries there.

We do all the normal things that you would do: total counts of requests that are coming in, we look at our end points usage statistics. This is also very interesting, we were looking at this the other day, not all our endpoints are equal and our system has some that are much heavier computationally and have taken a lot more work to create. It’s interesting to look at how much they are getting hit versus how much effort we put into making them. We can see the most popular endpoints that we have, and then we can also see ones that aren’t used as much. This helps me figure out what and how to prioritize efforts. We have a very database request heavy system. Knowing specifically the sort of queries that are coming in really helps us optimize the database to get the most out of it and make it most cost efficient.

Taylor: That’s interesting that you were able to gauge how much effort you put into some of those endpoints and then look their usage. When you don’t have that data, you are just guessing. It can also help you see that maybe there should be more education on some endpoints.

Why was it important to y'all for this platform to be open source?

Christa: So one of the major reasons we built this platform and made it open source is that we noticed a few of the groups who were gathering this sort of data and the data themselves weren’t open, nor was it clear how they were gathered. There was a few efforts, some commercial, some unclear if they were commercial or public, there were some researchers who do this. And everyone was doing it in a different way or wasn’t entirely clear how it was being done. We saw a lot of efforts having to duplicate what another effort was doing because their work wasn’t open. So we thought if someone just makes the data open and also the platform itself open source and transparent, so it’s clear how we’re grabbing the data - that’s a huge reason to do it. The other reason we chose, was that when we first started this, there was just two of us in our little basement apartment. It’s a big project, and we knew we would need help. So making it open source was an obvious route to find folks interested in helping us around the world.

Joe: I think the other piece here is that open source and free aren’t the same thing. But they are often times lumped together. Beyond just open source, I think what we wanted to be was freely available, because air pollution disproportionately affects people in developing countries. They are the ones that would generally have to pay for this data or don’t have access to them at all. And so we wanted to break down that barrier and let everyone have access to the data, making tools, and not have that be a roadblock.

Taylor: To end things, what is the most exciting thing about the project to each of y'all?

Christa: I think for me it’s definitely interacting with people in specific communities and sharing the data in the open. I love that, it’s the best.

Joe: For me it is definitely having people build something on top of it. As a developer, that’s the best feeling. In fact the first workshop we did in Mongolia, there was a developer who, just over the weekend, built an interface, like a much better exploration interface for the data than what I had initially made. Which was great, right? So I think we used that, and pointed people to that over and over and over again, because I think it took us probably, I don’t know, six months until we finally rolled out sort of a different exploration interface for the data. And that was just made by one community member and that was awesome.


I wanted to thank Christa and Joe for taking the time to talk to me about OpenAQ. I don’t know about you, but I learned a lot! It is a wonderful project that you should definitely check out.

Keen IO has an open source software discount that is available to any open source or open data project. We’d love to hear more about your project of any size and share more details about the discount. We’d especially like to hear about how you are using Keen IO or any analytics within your project. Please feel free to reach out to opensource@keen.io for more info.

A cat typing email really quickly

Taylor Barnett

developer, community builder, and huge fan of tacos

IoT Analytics over time with Keen and Scriptr

This is a guest post written by Ed Borden, Evangelist at Scriptr.io, VP Ads at Soofa.

A large part of Internet of Things applications typically involves management operations; you want to know what your assets are doing right now and if you need to react in some way.

I think about that ‘realtime’ domain as an area with a particular set of tools, challenges, and thought processes, and the 'historical’ domain as another. In the historical domain of IoT, I think about what the high-value information will be in the form of questions, like:

  • How long was this parking space unoccupied last week?
  • Which truck in my fleet was in service the longest?
  • How long was this machine in power-saving mode?
  • What are the 5 best and worst performers in this group?

For these types of questions, Keen is my go-to. However, architecting the answers to these questions takes a little bit of shifting in your architecture design.

You might typically push events to Keen as they are happening, but if you are only pushing data based on the changes in state of a thing (as is the common model for asset tracking/management-type scenarios), you won’t have enough information to ask these types of questions since you need to know how long the thing has been in each state. So, when an event comes in:

  1. you need to cache the timestamp and state the thing is going into, and
  2. create an event based on the previous cached state that was just transitioned out of, which must include the “duration” of that state.

Once this is done, Keen really shines at the rest! You can simply do a “sum” query on the durations of events, filtering by groups of devices and timeframes.

The below snippet using Keen IO will tell you how long a parking space was occupied:

var timeOccupied = new Keen.Query("sum", {
   event_collection: "deviceUpdates",
   target_property: "duration",
   timeframe: "this_7_days",
   filters: [ 
      { operator: "eq",
        property_name: "hardwareId",
        property_value: hardwareId
      },        
      { 
        operator: "eq",
        property_name: "deviceState",
        property_value: "occupied"
      }
   ]
});

If you want to sum all of the parking spots on the street, give each event a “streetId” and filter by that instead of “hardwareId”.

The below snippet will tell you how many parking spaces were occupied longer than an hour (because street parking is limited to one hour and you want to know where the most violations are occurring):

var violationsOccurred = new Keen.Query("count", {
   event_collection: "deviceUpdates",
   target_property: "duration",
   timeframe: "this_7_days",
   filters: [ 
      { operator: "gt",
        property_name: "duration",
        property_value: 60
      },
      { 
        operator: "eq",
        property_name: "deviceState",
        property_value: "occupied"
      }
   ]
});

I could do this all day! That’s because once you have this sort of infrastructure in place, the sky really is the limit on the types of high-value information you can extract. And you did this all without managing any database infrastructure or API surface of your own?!

So, how do we implement the complete system? Here Keen can use a little help from an IoT service called Scriptr.io. Scriptr.io has a web-based IDE which lets you write some Javascript, hit “Save”, and that code instantly becomes a hosted webservice with a URL. Using Scriptr.io’s fast local storage module and Keen connector, we can do some caching and light processing on that 'in-flight’ datastream in a simple and expressive way that ALSO requires no devops/infrastructure! A match made in #NoDevOps heaven. It would look like this:

//Any POST body to your Scriptr script's URL can be accessed 
//with the 'request' object  
var eventData = JSON.parse(request.rawBody);

//The 'storage' object is a key/value store which we access with 
//the current device's ID
var lastEventData = storage.local[eventData.hardwareId];

//In this example, we'll assume these are epoch times, otherwise we'd convert
var eventDuration = eventData.timestamp - lastEventData.timestamp; 

//Add the duration to the last event data object which we'll push to Keen
lastEventData.eventDuration = eventDuration; 

//This is the Scriptr.io -> Keen.io connector
var keenModule = require('../../modules/keenio/keenioclient'); 
var keen = new keenModule.Keenio("my_Keen_credentials");

//Next, record the Keen event
keen.recordEvent({
  collection: "deviceUpdates",
  data: lastEventData
});

//Cache the current event by the device's ID
storage.local[eventData.hardwareId] = eventData; 

Below, you can see this in the Scriptr IDE:

image

There you go – Big IoT Data! You can learn more about the Scriptr.io platform here or the Scriptr -> Keen connector here.

Ed Borden

Evangelist at Scriptr.IO, VP Ads at Soofa