Streaming to Amazon S3

When you activate this integration, we will stream your full-resolution, enriched event data to Amazon Simple Storage Service (S3), in your Amazon account.

This is useful for a few reasons:

  • Backups
    • This gives you an extremely cheap backup of your data, which is stored in gzip-compressed text files.
  • Routing events to third parties
  • SQL-based analysis
    • While Keen Compute is great for building analytics features into your products and analysis-driven automation into your workflows, nothing beats SQL for data exploration. We recommend combining our S3 Integration with Amazon Athena to accomplish this.

Note: If there are other cloud file systems you would like to stream your enriched data, we would love to hear from you. We may already be in private beta with those services.

Configuring your Keen IO project

  1. If it isn’t already turned on, reach out to us to have this feature turned on.
  2. Find the project you’d like to have streaming data to S3 and navigate to the Streams page.
  3. Click on the “Configure S3 Streaming” button.
  4. Configure your S3 account as mentioned in the Keen to S3 Instructions.
  5. Enter the bucket you’d like your Keen IO data to flow into.
  6. Click the “Update” button.

Keen to S3 Instructions

  1. Sign in to the AWS Console and navigate to the S3 Console.
  2. Select the bucket you wish to use and ensure the Properties tab is selected.
  3. Expand the Permissions section and click “Add more permissions.”
  4. In the Grantee field enter ad6a62a1f25789760c5a581938a7ee06a865d0b95cc5b1b900d31170da42a48c
  5. Ensure List, Upload/Delete, and View Permissions are selected.
  6. Click the “Save” button.

How your data streams to S3

Data is streamed to S3 in fixed time increments (default is every 5 minutes). Assuming your S3 bucket is called “MyBucket”, the bucket/key structure will look as follows:

MyBucket/<project_id>/<ISO-8601_timestamp>/<event_collection>/<project_id>-<event_collection>-<ISO-8601_timestamp>.json.gz

An example structure looks like this:

MyBucket
└── 530a932c36bf5a2d230
    ├── 2014-01-01T00:05:00.000Z
    │   ├── pageviews
    │   │   └── 530a932c36bf5a2d230-pageviews-2014-01-01T00:05:00.000Z.json.gz
    │   ├── signups
    │   │   └── 530a932c36bf5a2d230-signups-2014-01-01T00:05:00.000Z.json.gz
    │   └── logins
    │       └── 530a932c36bf5a2d230-logins-2014-01-01T00:05:00.000Z.json.gz
    └── 2014-01-01T00:10:00.000Z
        ├── pageviews
        │   └── 530a932c36bf5a2d230-pageviews-2014-01-01T00:10:00.000Z.json.gz
        ├── signups
        │   └── 530a932c36bf5a2d230-signups-2014-01-01T00:10:00.000Z.json.gz
        └── logins
            └── 530a932c36bf5a2d230-logins-2014-01-01T00:10:00.000Z.json.gz

Error Scenarios

There may be times in which our system cannot write all the events for a given time period. This can be caused by network latency, 3rd party system failure, or a complication arising within our own system. To account for this, we need the ability to update previously written folders. When this happens, we will add new keys to the bucket with an additional incremental suffix as well as update the MyBucket/batches folder with the additional key.

An example of a bucket with this:

MyBucket
└── 530a932c36bf5a2d230
    ├── 2014-01-01T00:05:00.000Z
    │   ├── pageviews
    │   │   ├── 530a932c36bf5a2d230-pageviews-2014-01-01T00:05:00.000Z.json.gz
    │   │   └── 530a932c36bf5a2d230-pageviews-2014-01-01T00:05:00.000Z.json.gz.1
    │   ├── signups
    │   │   └── 530a932c36bf5a2d230-signups-2014-01-01T00:05:00.000Z.json.gz
    │   └── logins
    │       └── 530a932c36bf5a2d230-logins-2014-01-01T00:05:00.000Z.json.gz
    ├── 2014-01-01T00:10:00.000Z
    │   ├── pageviews
    │   │   └── 530a932c36bf5a2d230-pageviews-2014-01-01T00:10:00.000Z.json.gz
    │   ├── signups
    │   │   └── 530a932c36bf5a2d230-signups-2014-01-01T00:10:00.000Z.json.gz
    │   └── logins
    │       └── 530a932c36bf5a2d230-logins-2014-01-01T00:10:00.000Z.json.gz
    └── batches
        └── 2014-01-01T00:12:12.123Z-pageviews

The file that was placed in the “batches” folder contains the timestamp that a new file was placed into an existing timeframe. That file contains the fully qualified key for the additional data.

In this example, the contents would be:

530a932c36bf5a2d230/2014-01-01T00:05:00.000Z/pageviews/530a932c36bf5a2d230-pageviews-2014-01-01T00:05:00.000Z.json.gz.1