During re:Invent 2016 Amazon Launched Athena, a service to simplify searching data stored in S3. It stood out as something that could be used to quickly get to grips with large log data sets or to carry out a quick hunt. It’s also just a really good idea.
To test Athena I used AWS VPC Flow Logs. CloudWatch allows you to stream logs to a Lambda function; the function can write the log stream to a file in your bucket. I’m not going to describe in detail how to set that up, in summary: after you have enabled Flow logging for your VPC the logs start to appear in the CloudWatch console and from there you can select to stream them to Lambda (select VPC Flow logging as the format). I have included some example Python code at the end of this note to help you get a working function.
Now we can load up the Athena Console and get started. The query editor excepts ANSI SQL (or Hive DDL) making it a shallow learning curve for anyone who created SQL queries before.
Firstly lets create a new database:
CREATE DATABASE goduncan;
Now we can create a table using our flow data in S3 (note: The LOCATION field needs to be your S3 bucket name):
This statement should create a table ‘flow_logs’ listed under the database in the left column. If you select the eye icon on the right of the table name it will run a simple select all statement to showing your flow records.
Here are a few examples statements you can run on your flow logs:
Top destination hosts:
select distinct dst, count(dst) as CountOf from flow_logs group by dst order by CountOf desc
Top source hosts:
select distinct src, count(src) as CountOf from flow_logs group by src order by CountOf desc
SELECT * FROM flow_logs ORDER BY out_bytes DESC limit 20