Support Questions

Adda_Fuentes2 · ‎10-03-2016

Hi, I recently began to check HAWQ and I have seem in a few sites that they have compared its performance using the MPP implementation that HAWQ has against Presto. I was wondering if someone could help explain to me how do these two compare with one another?

gkeys · ‎10-03-2016

Both are MPP (massively parallel processing) databases designed to query large volumes of data (e.g. PB) at relatively fast response times (e.g. seconds) though much around performance depends on server scaling, query type, database design and table design.

HAWQ is open source Apache -- its roadmap is driven by the Apache community and it can be implemented with Hortonworks HDP hadoop platform as well as other hadoop platforms. You can download HAWQ and implement on the Hortonworks sandbox (or full cluster).

Presto is Apache-licensed but not an Apache project -- its roadmap is driven by Teradata and not the community.

Presto has the advantage of being able to query data inside and outside HDFS whereas HAWQ is confined to HDFS or tables built on HDFS which are optimized using the parquet file format.

For queries against hadoop, Presto is not natively YARN-enabled but you can integrate it to YARN via Twill. HAWQ is natively YARN-enabled.

HAWQ is 100% postgreSQL compliant (e.g. you can implement pgAdmin against it) whereas Presto offers extensive ANSI SQL support but is not 100% compatible.

HAWQ is generally faster than Presto.

HAWQ has a MADLIB data science and machine learning plugin that lets you do complex data science as functions inside your sql queries and against your database.

Diving deeper to compare the two is a much more complex topic. For example they differ significantly in their architecture and scaling strategies.

Note that Hive has made great strides in recent years and months and is approaching HAWQ in its query response times. This is largely due to the focus the Apache community (and Hortonworks) has given to optimizing Hive for the ORC file format, LLAP in memory caching and cost-based optimization.

Use the following for general resources and for taking a much deeper dive. You could probably stay up all night discussing this question from a technical deep-dive perspective.

HAWQ

http://hawq.incubator.apache.org/ http://hortonworks.com/apache/hawq/ https://blog.pivotal.io/big-data-pivotal/products/pivotal-hawq-benchmark-demonstrates-up-to-21x-fast...

Presto

http://www.teradata.com/products-and-services/presto-download/?pcid=Google_Presto-US-EN-GGL-BMM_paid... http://siliconangle.com/blog/2015/06/09/teradata-adopts-presto-for-hadoop-sql-queries/

Hive

http://hortonworks.com/apache/hive/ https://cwiki.apache.org/confluence/display/Hive/Home

View solution in original post

gkeys · ‎10-03-2016