Member since
03-21-2017
13
Posts
0
Kudos Received
0
Solutions
08-05-2019
05:47 AM
Hi, @Ai You will get a complete guide here about JVM Performance tuning here.
... View more
03-23-2017
10:53 PM
Thanks for your reply! The key requirements are listed as follows: one PDF document is very small, less than 1MB, but the number is huge. Total size will be 10 TB or more. they are archive files. Once they are indexed, update will not happen usually. They are just used for querying and downloading. Any suggestions for this scenario? Thanks
... View more
03-24-2017
11:19 AM
Yes, I'm afraid that fast upload can overload the buffers in Hadoop 2.5, as it uses JVM heap to store blocks while it uploads them. The bigger the mismatch between the data generated (i.e. how fast things can be read) and the upload bandwidth, the more heap you need. On a long-haul upload you usually have limited bandwidth, and the more distcp workers, the more the bandwidth is divided between them, the bigger the mismatch a In Hadoop 2.5 you can get away with tuning the fast uploader to use less heap. It's tricky enough to configure that in the HDP 2.5 docs we chose not to mention the fs.s3a.fast.upload option entirely. It was just too confusing and we couldn't come up with some good defaults which would work reliably. Which is why I rewrote it completely for HDP 2.6. The HDP 2.6/Apache Hadoop 2.8 (and already in HDCloud) block output stream can buffer on disk (default), or via byte buffers, as well as heap, and tries to do better queueing of writes. For HDP 2.5. the tuning options are measured in the Hadoop 2.7 docs, Essentially a lower value of fs.s3a.threads.core and fs.s3a.threads.max keeps the number of buffered blocks down, while changing the size of fs.s3a.multipart.size to something like 10485760 (10 MB) and setting fs.s3a.multipart.threshold to the same value reduces the buffer size before the uploads begin. Like I warned, you can end up spending time tuning, because the heap consumed increases with the threads.max value, and decreases on the multipart threshold and size values. And over a remote connection, the more workers you have in the distcp operation (controlled by the -m option), the less bandwidth each one gets, so again: more heap overflows. And you will invariably find out on the big uploads that there are limits. As a result In HDP-2.5, I'd recommend avoiding the fast upload except in the special case of: you have a very high speed connection to an S3 server in the same infrastructure, and use it for code generating data, rather than big distcp operations, which can read data as fast as it can be streamed off multiple disks.
... View more
03-27-2017
03:33 AM
Hi Binu, Thank you for your advice. I've done some experiments based on Hortonwork's Hive Benchmark to compare the performance of Hive and Spark to analyse S3 data. I assume that both the two methods need to load S3 data into HDFS and create hive tables pointing to the HDFS data. The reason I also create hive tables for Spark is that I want to use HiveQL and I don't want to write too many codes for registering temp tables for Spark. I observed the following things: for tpcds_10GB, load S3 text table into HDFS took 233
seconds, which is acceptable create orc_table and table analysing took very long time (more than one hour, so I terminate it manually), which is unacceptable. execute query12.sql, hive text table (17.64
secs), hive orc table(6.013
secs), spark (45
secs). There are also some example that spark outperform hive (i.e. query15.sql) . My questions: As table analysing for orc table taking very long time, is there a way to avoid re-analysing tables when load S3 to HDFS? If there is no way to avoid the long-time table optimisation operation, I might not be able to use the Hive method because in my project there are many tables and all of them are very large. Should I alway use HiveContext rather than SQLContex? Because I find when I use SQLContext class, some of hive script can't execute. Looking forward to your reply! Thank you very much!
... View more