Amazon Athena
Table of contents
General
Amazon Athena is a serverless interactive query service, built on top of Presto, for accessing any data (relational, non-relational, structured, semi-structured or unstructured) directly from S3 (in different S3-tier levels) via Glue Catalog, on-premises or from multicloud environment.
Supports many data formats, such as: CSV and JSON as human readable formats; ORC and Parquet as columnar and splittable (parallelization) formats and Avro a non-columnar but splittable format.
It is compatible with QuickSight and other tools via ODBC/JDBC and it is possible to do cross-account data access.
Athena Workgroups
It is a feature for better management of Athena Users/Teams/Applications, as it can control query access, track costs, set IAM policies, Data limits, Encryption settings and keep a query history per Workgroup.
Workgroups integrate with IAM, SNS and CloudWatch.
ACID Transactions
Thanks to Apacha Iceberg, Athena supports transactional tables with time travel features. To create a table using Iceberg, just add 'table_type' = 'ICEBERG'
.
These tables are compatible with EMR, Spark, Glue, etc.
With time these tables may get slower, so it is recommended to perform a periodic compaction. To do that, run the command:
OPTIMIZE table_name REWRITE DATA
USING BIN_PACK
WHERE catalog = 'catalog_name';
CTAS - Create Table as Select
Used to create a table from the result of a SQL Query. Can be useful for creating a Backup table, a smaller table with a subset of data or even for converting the file data type.
For example:
CREATE TABLE new_table_name
WITH (
format='Parquet',
write_compression='SNAPPY',
external_location='s3://path/to/file'
)
AS SELECT *
FROM old_table_name;
Fine-Grained Access Control
The access control in Athena is not as restrictive as if you were using AWS Lake Formation. For example, it is not possible to restrict column/row access.
It is possible to allow or disallow some operations DDL/DB Management instructions though, such as ALTER
or CREATE
, DROP
, MSCK REPAIR TABLE
etc. However, it is not straight forward, as you will need to block/allow some/all the IAM Actions required for the particular operation.
Cost and Performance
It is a pay-as-you-go service, charged based on bytes of Data. This means failed queries or DDL instructions are free, but successful or cancelled queries are not.
S3 data transfer charges apply if the S3 Bucket is in a different region than the Athena Query. The data transferred data may be bigger than the query result.
The best way of saving on cost and increasing performance is compressing the data (with GZIP, for example), using columnar and splittable file formats, such as ORC and Parquet, partitioning the data and using a smaller number of large files instead of a large number of small files.
Anti-patterns
Athena is not recommended for:
- Highly formatted reports;
- ETL.
Athena Federated Query
Athena federated query uses a custom connector, which is basically a Lambda Function, to connect to any data source. It can also be used for creating pipelines and storing data in S3.
Write operations are not supported with Federated Queries.
Amazon Athena for Apache Spark
It is selectable as an alternate analytics engine and allows Jupyter notebooks with Spark within the Athena Console.
It is serverless and is priced based on compute usage and DPU per hour.