Support Questions

Find answers, ask questions, and share your expertise

which way is the best when using hive to analyse S3 data?

avatar
Explorer

I am working on a project to analyse S3 data with Hive. I've found there are different ways to let hive operate S3 data.

  1. use S3 as the default file system to replace HDFS
  2. create hive table directly pointing to S3 data
  3. load S3 data to HDFS first, and create hive table for analysing, and load data back to S3

I am wondering which way is most popular to analyse S3 data when performance is a big concern. Because the S3 data might be very large.

1 ACCEPTED SOLUTION

avatar

If your data is already in S3, then I would suggest you use Spark to analyze it. You can use the S3A filesystem client to process the data from S3 and use SparkSQL for SQL analysis on the data, similar to how you would with Hive.

Using S3 over HDFS is for very specific use cases:

1. You want very cheap storage

2. You don't have a long running Hadoop cluster with HDFS

3. You want to access the data anytime and from anywhere

The S3A filesystem client provides very fast access to data in S3 buckets. The Hortonworks Data Cloud in the AWS marketplace has an enhanced S3A filesystem client for even faster performance. There are different types of clusters you can spin up in the Hortonworks Data Cloud based on your workload: ETL, Data Warehouse, or Data Science. These clusters are pre-configured and tuned to run Apache Spark and Hive so you don't have to setup. Just click on which cluster type you want, your EC2 instance types, and the cluster is launched for you. You use S3 storage with Hortonworks Data Cloud to separate storage from compute since clusters launched in the Hortonworks Data Cloud are meant to be on-demand/ephemeral clusters and not long running.

View solution in original post

7 REPLIES 7

avatar

If your data is already in S3, then I would suggest you use Spark to analyze it. You can use the S3A filesystem client to process the data from S3 and use SparkSQL for SQL analysis on the data, similar to how you would with Hive.

Using S3 over HDFS is for very specific use cases:

1. You want very cheap storage

2. You don't have a long running Hadoop cluster with HDFS

3. You want to access the data anytime and from anywhere

The S3A filesystem client provides very fast access to data in S3 buckets. The Hortonworks Data Cloud in the AWS marketplace has an enhanced S3A filesystem client for even faster performance. There are different types of clusters you can spin up in the Hortonworks Data Cloud based on your workload: ETL, Data Warehouse, or Data Science. These clusters are pre-configured and tuned to run Apache Spark and Hive so you don't have to setup. Just click on which cluster type you want, your EC2 instance types, and the cluster is launched for you. You use S3 storage with Hortonworks Data Cloud to separate storage from compute since clusters launched in the Hortonworks Data Cloud are meant to be on-demand/ephemeral clusters and not long running.

avatar
Explorer

Thank you very much for your valuable suggestion. I am going to try SparkSQL. Actually, I am now facing a couple of problems of letting Hadoop tools (sqoop, hive) to work with S3 data. Hopefully, I can get some further guides from you.

My project is to integrate data from different resources (i.e. S3 and databases), so as to analyse the data. My customer wants to store everything in S3, and use Hadoop as ephemeral cluster. I am now trying to figure out the following things : 1) how to transfer data from database to S3, 2) and how to analyse data already in S3.

I've already done some experiments as listed below:

1) using sqoop to transfer data from database to S3 directly (but failed). My question is "Is sqoop a good choice in this scenario?"

2) query data and write into a hive table pointing to S3. (but very slow when there are many partitions). My test was based on the hive benchmark data "tpcds_bin_partitioned_orc_10”. I guess it is because there are many sub-directories and many small files in each directory, leading to write data to S3 very costly. I am going to try SparkSQL according to your suggestion. But I am still curious about which way is better to use Hive with S3 data in terms of performance, loading S3 data to HDFS for Hive analyse and then saving results back to S3, or doing analyse directly on hive table point to S3 data.

Thank you very much for your time.

avatar

Performance will be optimized when you colocate storage and compute - i.e. store in HDFS on a cluster using direct attached storage and access via Spark or Hive. In this type of architecture Spark or Hive on HDFS will outperform Spark or Hive on S3. The data is not traveling far when reading from disk and loading into memory. On the other hand, there are economies of scale with S3 (object store) that outweigh the performance benefits of a block storage file system like HDFS. Doing analysis directly on the data in HDFS will give better performance. In fact, I have even tested this. Please see here: https://community.hortonworks.com/content/kbentry/53271/performance-of-spark-on-hdphdfs-vs-spark-on-...

You could use Sqoop or Spark to extract from a relational database and store in S3. Both will work. Sqoop by default will launch 4 parallel tasks (mappers) against the source table. With Spark you could do something like local mode and specify the number of threads to use, local[n], or use as many threads as the number of processors available to the JVM, local[*].

Apache NiFi is something that could also work. There are processors to extract from a relational database either single threaded or multi-threaded and write to S3. With NiFi you create a DataFlow via a GUI for rapid application development.

avatar
Explorer

Thank you very much!

avatar
Explorer

Hi Binu,

Thank you for your advice. I've done some experiments based on Hortonwork's Hive Benchmark to compare the performance of Hive and Spark to analyse S3 data. I assume that both the two methods need to load S3 data into HDFS and create hive tables pointing to the HDFS data. The reason I also create hive tables for Spark is that I want to use HiveQL and I don't want to write too many codes for registering temp tables for Spark.

I observed the following things:

  • for tpcds_10GB, load S3 text table into HDFS took 233 seconds, which is acceptable
  • create orc_table and table analysing took very long time (more than one hour, so I terminate it manually), which is unacceptable.
  • execute query12.sql, hive text table (17.64 secs), hive orc table(6.013 secs), spark (45 secs). There are also some example that spark outperform hive (i.e. query15.sql) .

My questions:

  • As table analysing for orc table taking very long time, is there a way to avoid re-analysing tables when load S3 to HDFS? If there is no way to avoid the long-time table optimisation operation, I might not be able to use the Hive method because in my project there are many tables and all of them are very large.
  • Should I alway use HiveContext rather than SQLContex? Because I find when I use SQLContext class, some of hive script can't execute.

Looking forward to your reply! Thank you very much!

avatar

1. You should be using the latest version of HDP or HDCloud you can, to get the speedups on S3A read and write. HDP2.5 has the read pipeline speedup, but not the listing code (used in partitioning) and the write pipeline.

2. Write your data back to HDFS, then at the end of the work, copy to S3. That gives significantly better performance in the downstream jobs, and avoids fundamental mismatch between how work is committed (hive uses renames) and how s3 works (there are no renames, only slow copies).

Have a look at this document on Hive on S3 for more advice, including which options to set for maximum IO speedup.

avatar
Explorer

Thanks Stevel.

My project will have lots of tables and some of them will have partitions. In addition, I need to do table analyse to improve query performance. My question is that "if I save these data to S3 and load them back to HDFS next time when a new cluster starts, is there a way to avoid repairing partitions and doing table analyse again?" Because I find that table analyse usually will take lots of time.

Another question is "Can I use HDCloud if my S3 data locates in a region that doesn't support HDCloud? If I can use HDCloud, what's about the performance and price?"