Amazon S3 Integration

Stream inbound event data into an S3 bucket.


When you activate this integration, we will stream your full-resolution, enriched event data to Amazon Simple Storage Service (S3), in your Amazon account.

This is useful for a few reasons:

  • Backups
    • This gives you an extremely cheap backup of your data, which is stored in gzip-compressed text files.
  • Routing events to third parties
  • SQL-based analysis
    • While Keen Compute is great for building analytics features into your products and analysis-driven automation into your workflows, nothing beats SQL for data exploration. We recommend combining our S3 Integration with Amazon Athena to accomplish this.

Note: If there are other cloud file systems you would like to stream your enriched data, we would love to hear from you. We may already be in private beta with those services.

Configuring your Keen IO project

  1. If it isn’t already turned on, reach out to us to have this feature turned on.
  2. Find the project you’d like to have streaming data to S3 and navigate to the Streams page.
  3. Click on the “Configure S3 Streaming” button.
  4. Configure your S3 account as mentioned in the Keen to S3 Instructions.
  5. Enter the bucket you’d like your Keen IO data to flow into.
  6. Click the “Update” button.

Keen to S3 Instructions

Please ensure that the region your S3 bucket resides in supports “Version 2” of the Signature Version. Select a region for your S3 bucket that supports “Version 2”.

  1. Sign in to the AWS Console and navigate to the S3 Console.
  2. Select the bucket you wish to use.
  3. Select the Permissions section and click “Add users.”
  4. In the ID field enter ad6a62a1f25789760c5a581938a7ee06a865d0b95cc5b1b900d31170da42a48c
  5. Ensure “Read” + “Write” is checked for Object Access, and “Read” + “Write” are checked for Permissions Access.
  6. Click the “Save” button. The new user should show as dan.
  7. Enter your bucket name below and click “Save”.

AWS Console and S3 User Interface

How your data streams to S3

Data is streamed to S3 in fixed time increments (default is every 5 minutes). Assuming your S3 bucket is called “MyBucket”, the bucket/key structure will look as follows:

MyBucket/<project_id>/<ISO-8601_timestamp>/<event_collection>/<project_id>-<event_collection>-<ISO-8601_timestamp>.json.gz

An example structure looks like this:

MyBucket
└── 530a932c36bf5a2d230
    ├── 2014-01-01T00:05:00.000Z
    │   ├── pageviews
    │   │   └── 530a932c36bf5a2d230-pageviews-2014-01-01T00:05:00.000Z.json.gz
    │   ├── signups
    │   │   └── 530a932c36bf5a2d230-signups-2014-01-01T00:05:00.000Z.json.gz
    │   └── logins
    │       └── 530a932c36bf5a2d230-logins-2014-01-01T00:05:00.000Z.json.gz
    └── 2014-01-01T00:10:00.000Z
        ├── pageviews
        │   └── 530a932c36bf5a2d230-pageviews-2014-01-01T00:10:00.000Z.json.gz
        ├── signups
        │   └── 530a932c36bf5a2d230-signups-2014-01-01T00:10:00.000Z.json.gz
        └── logins
            └── 530a932c36bf5a2d230-logins-2014-01-01T00:10:00.000Z.json.gz

Error Scenarios

There may be times in which our system cannot write all the events for a given time period. This can be caused by network latency, 3rd party system failure, or a complication arising within our own system. To account for this, we need the ability to update previously written folders. When this happens, we will add new keys to the bucket with an additional incremental suffix as well as update the MyBucket/batches folder with the additional key.

An example of a bucket with this:

MyBucket
└── 530a932c36bf5a2d230
    ├── 2014-01-01T00:05:00.000Z
    │   ├── pageviews
    │   │   ├── 530a932c36bf5a2d230-pageviews-2014-01-01T00:05:00.000Z.json.gz
    │   │   └── 530a932c36bf5a2d230-pageviews-2014-01-01T00:05:00.000Z.json.gz.1
    │   ├── signups
    │   │   └── 530a932c36bf5a2d230-signups-2014-01-01T00:05:00.000Z.json.gz
    │   └── logins
    │       └── 530a932c36bf5a2d230-logins-2014-01-01T00:05:00.000Z.json.gz
    ├── 2014-01-01T00:10:00.000Z
    │   ├── pageviews
    │   │   └── 530a932c36bf5a2d230-pageviews-2014-01-01T00:10:00.000Z.json.gz
    │   ├── signups
    │   │   └── 530a932c36bf5a2d230-signups-2014-01-01T00:10:00.000Z.json.gz
    │   └── logins
    │       └── 530a932c36bf5a2d230-logins-2014-01-01T00:10:00.000Z.json.gz
    └── batches
        └── 2014-01-01T00:12:12.123Z-pageviews

The file that was placed in the “batches” folder contains the timestamp that a new file was placed into an existing timeframe. That file contains the fully qualified key for the additional data.

In this example, the contents would be:

530a932c36bf5a2d230/2014-01-01T00:05:00.000Z/pageviews/530a932c36bf5a2d230-pageviews-2014-01-01T00:05:00.000Z.json.gz.1