About lifeng_ai

lifeng_ai · ‎03-27-2017

Hi Binu, Thank you for your advice. I've done some experiments based on Hortonwork's Hive Benchmark to compare the performance of Hive and Spark to analyse S3 data. I assume that both the two methods need to load S3 data into HDFS and create hive tables pointing to the HDFS data. The reason I also create hive tables for Spark is that I want to use HiveQL and I don't want to write too many codes for registering temp tables for Spark. I observed the following things: for tpcds_10GB, load S3 text table into HDFS took 233 seconds, which is acceptable create orc_table and table analysing took very long time (more than one hour, so I terminate it manually), which is unacceptable. execute query12.sql, hive text table (17.64 secs), hive orc table(6.013 secs), spark (45 secs). There are also some example that spark outperform hive (i.e. query15.sql) . My questions: As table analysing for orc table taking very long time, is there a way to avoid re-analysing tables when load S3 to HDFS? If there is no way to avoid the long-time table optimisation operation, I might not be able to use the Hive method because in my project there are many tables and all of them are very large. Should I alway use HiveContext rather than SQLContex? Because I find when I use SQLContext class, some of hive script can't execute. Looking forward to your reply! Thank you very much!

lifeng_ai · ‎03-26-2017

Hi, Can you give me some suggestions on where I should start to learn GC tuning? Is there any good tutorial? Thank you very much!

lifeng_ai · ‎03-23-2017

Thanks for your reply! The key requirements are listed as follows: one PDF document is very small, less than 1MB, but the number is huge. Total size will be 10 TB or more. they are archive files. Once they are indexed, update will not happen usually. They are just used for querying and downloading. Any suggestions for this scenario? Thanks

lifeng_ai · ‎03-23-2017

I am investigating how to index and search a huge number of pdf documents using Hadoop technology stack. My data contain two parts: 1) raw pdf documents 2) fields data about the pdf documents, which have already been extracted by external applications. I find Solr is a good tool to index pdf documents based on the fields data (part 2), but where should I store raw pdf documents (part 1)? My initial plan is to store pdf documents in HDFS and add the "HDFS path" to field data when building index using Solr. But I found some websites mention that HDFS is not good to store a huge number of small files. Can some give some suggestions for my scenario? Should I store the pdf documents in HBase or use other document-orient database like MongoDB?

lifeng_ai · ‎03-22-2017

I solved the problem by removing "-D fs.s3a.fast.upload=true" but still don't know the reason. Do some one know why "fs.s3a.fast.upload=true" will cause java heap space problem? Thanks

lifeng_ai · ‎03-22-2017

I encountered a "Error: Java heap space" problem when using distcp to copy data from HDFS to S3. Can someone tell me how to solve the problem? File size on HDFS: 1.5GB. Command: hadoop distcp -D fs.s3a.fast.upload=true 'hdfs://<address>:8020/tmp/tpcds-generate/10/web_sales' s3a://<address>/tpcds-generate/web_sales I found the Hadoop job has the following parameters(mapreduce.map.memory.mb=1024), which might caused the problem. But I actually have set the memory from Ambari UI. Below is the capture of relevant parameters.

lifeng_ai · ‎03-21-2017

Thanks Stevel. My project will have lots of tables and some of them will have partitions. In addition, I need to do table analyse to improve query performance. My question is that "if I save these data to S3 and load them back to HDFS next time when a new cluster starts, is there a way to avoid repairing partitions and doing table analyse again?" Because I find that table analyse usually will take lots of time. Another question is "Can I use HDCloud if my S3 data locates in a region that doesn't support HDCloud? If I can use HDCloud, what's about the performance and price?"

lifeng_ai · ‎03-21-2017

Thank you very much!

lifeng_ai · ‎03-21-2017

Thank you very much for your valuable suggestion. I am going to try SparkSQL. Actually, I am now facing a couple of problems of letting Hadoop tools (sqoop, hive) to work with S3 data. Hopefully, I can get some further guides from you. My project is to integrate data from different resources (i.e. S3 and databases), so as to analyse the data. My customer wants to store everything in S3, and use Hadoop as ephemeral cluster. I am now trying to figure out the following things : 1) how to transfer data from database to S3, 2) and how to analyse data already in S3. I've already done some experiments as listed below: 1) using sqoop to transfer data from database to S3 directly (but failed). My question is "Is sqoop a good choice in this scenario?" 2) query data and write into a hive table pointing to S3. (but very slow when there are many partitions). My test was based on the hive benchmark data "tpcds_bin_partitioned_orc_10”. I guess it is because there are many sub-directories and many small files in each directory, leading to write data to S3 very costly. I am going to try SparkSQL according to your suggestion. But I am still curious about which way is better to use Hive with S3 data in terms of performance, loading S3 data to HDFS for Hive analyse and then saving results back to S3, or doing analyse directly on hive table point to S3 data. Thank you very much for your time.

lifeng_ai · ‎03-21-2017

I am working on a project to analyse S3 data with Hive. I've found there are different ways to let hive operate S3 data. use S3 as the default file system to replace HDFS create hive table directly pointing to S3 data load S3 data to HDFS first, and create hive table for analysing, and load data back to S3 I am wondering which way is most popular to analyse S3 data when performance is a big concern. Because the S3 data might be very large.

Online	Offline
Last Visited	‎06-08-2017 12:59 AM

Member Since	‎03-21-2017 12:26 AM
Last Visited	‎06-08-2017 12:59 AM
Posts	13

Cloudera Community

Re: which way is the best when using hive to analy...

how to understand the gc log?

Re: Where to store documents if I use Solr to stor...

Where to store documents if I use Solr to store in...

Re: "Error: Java heap space" caused using distcp t...

"Error: Java heap space" caused using distcp to co...

Re: which way is the best when using hive to analy...

Re: which way is the best when using hive to analy...

Re: which way is the best when using hive to analy...

which way is the best when using hive to analyse S...