Wednesday, August 8, 2018

AWS Athena Vs Redshift...

In recent years data warehouse architecture has a huge shift towards Cloud-based.

Here are the few benefits of cloud-based data warehousing comparing to traditional on-premise: Scalability, Cost, Performance and Time to market.

I will try to cover the concepts/considerations between cloud vs on-premise in another blog post.

I am already working on Google BigQuery and thought of spending some time in AWS offerings in data warehousing arena and here is what I have learned so far.

AWS offers Redshift as their main data warehousing capability. But also comes Athena.

Athena is a serverless service and does not need any infrastructure to create, manage, or scale data sets. It works directly on top of Amazon S3 data sets. It creates external tables and therefore does not manipulate S3 data sources, working as a read-only service from an S3 perspective. Athena uses Presto and ANSI SQL to query on the data sets. It also uses HiveQL for DDL statements.

On the other hand, Redshift is a petabyte-scale data warehouse used together with business intelligence tools for modern analytical solutions. Unlike Athena, Redshift requires a cluster for which we need to upload the data extracts and build tables before we can query. Redshift is based on PostgreSQL 8.0.2.

Comparing Athena to Redshift is not simple. Athena has an edge in terms of portability and cost, whereas Redshift stands tall in terms of performance and scale.


Component Athena Redshift
Serverless  Yes. Athena works with the data hosted on S3. No. Need to setup AWS Cluster(s)
Query  Athena uses Presto and ANSI SQL to query on the data sets. It also uses HiveQL for DDL. Redshift is based on Postgres 8.0.2
Datafile Format CSV, JSON (Both simple and nested),
Columnar Storage, ORC and Parquet.
CSV, JSON (Both simple and nested), TSV and Apache logs.
Compression Supports all compressed formats except LZO (Snappy can be used instead) Supports All compression methods
Data Types Supports complex data types like:
Arrays, maps, and structures.
No Support for Complex data types.
Partitioning Limit on Singly key or multi-column partitioning is available.
20,000 per Table
Distribution Style. There are 3 styles available when creating a table:
1] Even - This is the Default Distribution style.
The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column.
EVEN distribution is appropriate when a table does not participate in joins or when there is not a clear choice between KEY distribution and ALL distribution.
2] Key - The rows are distributed according to the values in one column.
The leader node will attempt to place matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns so that matching values from the common columns are physically stored together.
3] All - A copy of the entire table is distributed to every node.
Where EVEN distribution or KEY distribution place only a portion of a table's rows on each node, ALL distribution ensures that every row is collocated for every join that the table participates in.
ALL distribution is appropriate only for relatively slow moving tables; that is, tables that are not updated frequently or extensively.
UDF
(User Defined Functions)
No Yes (but not with network calls using UDF's)
Primary Key Yes but only as logical as it depends on the data loaded on S3. NO
Pricing Charges for the amount data scanned during Query Execution. Scanned data is rounded to the nearest 10MB. No charge for Failed Queries. Charges depends on the cluster(s) hosted.
Row Size No Limit 4 MB is the max size of any row in a data source
Max Number of Tables One Database can contain max of 100 tables.
Max number of Database is capped at 100.
Max number of tables per cluster is 9900 (including temporary tables).
Max Number of Schemas   9900 per cluster
SerDe (Serializer and Deserializer) Yes No.
External Table Only External Table is allowed when creating the table as the data is only referred from S3 as a table. Allowed in Redshift Spectrum.
A table can be either External (from S3) or local to Redshift.
Integration with BI Tools Using JDBC it can integrate with BI Tools or SQL Clients. Or with QuickSight for easy visualizations. Redshift can be integrated with Tableau, Informatica, Microstrategy, Pentaho, SAS and other BI Tools.
QuickSight to connect to Redshift instance, you must create a new security group for that instance.