About rbalamohan

rbalamohan · ‎12-22-2016

Is it anon access or IAM policy?. Which version of HDP/Hadoop are you using?. Debug logs would provide details on which credential provider it is trying to use.

rbalamohan · ‎12-17-2016

Can you check for any errors in hive.log (/tmp/<user>/hive.log). If it launched the appmaster, then could you share the am logs?

rbalamohan · ‎12-15-2016

Couple of things you can check Check if the dataset has changed from previous run to current run. Not sure how you are running your query. For e.g, if you are using hive cli, you can use "hive --hiveconf hive.tez.exec.print.summary=true". This should print the pre-execution (compilation, job submission), DAG execution times after the job is complete. That can give hints on where the time is spent If you have tez-ui, that is the best place to start checking the details on where the time is spent. It would be good to share the query and "explain <sql>" output with "--hiveconf hive.explain.user=false". If possible, share "explain formatted <sql>" output which dumps the plan information in JSON format. Check if vertices are running slow due to resource constraints (i.e, some tasks would have started, but others are in waiting mode as resources are not available in queue or in cluster).

rbalamohan · ‎12-10-2016

Can you take the "jstack" output of the hive cli from when it is stuck and share it here? Would be helpful if you could share hive.log as well.

rbalamohan · ‎11-11-2016

s3n is pretty much deprecated. Please use "s3a". Which version of HDP are you using? Check if you have relevant s3a libraries (aws-java-sdk-s3*.jar) in hadoop and add "-Dfs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem"

rbalamohan · ‎09-27-2016

This is not specific to "hive.tez.exec.print.summary=true", which prints the summary details of the DAG. In this case, DAG ran a lot faster and the delay you are observing is due to the file movement from S3 to S3 as a part of final cleanup activity of the job. Hive moves the job output to final location and this activity is carried out in the hive-client. In S3, rename is a "copy + delete" operation. So even though this rename is done in the AWS side, it takes time depending on the amount of data that is churned out by the job. In HDFS, rename is a lot cheaper operation and hence you do not observe this delay in HDFS. Alternate option is to write the data to local HDFS and move the data to S3 via distcp.

rbalamohan · ‎09-24-2016

are there any exceptions reported in client log or in metastore log?

rbalamohan · ‎09-22-2016

Check if you have provided the aws access keys correctly and if there are any exceptions reported in hive client log (e.g /tmp/<user>/hive.log).

rbalamohan · ‎09-20-2016

Optimizations applied for S3 are common for HDFS as well. It is just that these optimizations would be visible in S3 when compared with HDFS (for e.g, getFileStatus/listFiles operations are lot cheaper in HDFS as compared to S3). If you are using ORC in S3, it would be good to use the latest S3a connectors available in HDP 2.4/2.5 (or even easier, try out HDP Cloud which has the latest patches for S3a). Couple of things specific to ORC. ORC is columnar format and can incur random reads (e.g It reads the end of the file before it starts reading the data blocks). With recent S3A connector, you can set "fs.s3a.experimental.input.fadvise=random" which helps in random reads. Without which, it ends up breaking the https connection with S3 everytime backwards seek is performed. There are other optimizations that have been made internally which helps reducing the connection aborts. Split computation can take a lot longer with S3 when compared to HDFS. Fortunately, ORC internally has threadpool to compute this in parallel. "hive.orc.compute.splits.num.threads (default 10)" can tuned based on the amount of data processed. Again this is not specific to S3, but tuning these params can give significant perf difference in S3. In case ORC ETL strategy is chosen (default is HYBRID. hive.exec.orc.split.strategy), one can reduce the footer being read in task by enabling "hive.orc.splits.include.file.footer=true". In earlier versions of hive, this used to have memory pressure in AM side. But in recent versions, this has been fixed. So, this piggy backs the ORC meta information with the split payload and the task does not need to read the metadata again, which can help reduce the number of calls to S3. There are couple of fixes which helped in reducing the number of times the footer information was being read. Some additional changes which are not specific to ORC, which can have impact in AWS/S3 environments There have been couple of fixes in Tez side, which helps improving the splits grouping logic. S3 always provides "localhost" as its locality information and this could have adverse impact when used with Tez due to the grouping nature in earlier versions. This is fixed in recent versions, where in Tez does not aggresively group if it does not have enough information about the data locality. This is not specific to ORC, but helps in other formats as well. There is no concept of racks in AWS. When capacity scheduler is used, it is good to set "yarn.scheduler.capacity.node-locality-delay=0" to avoid container launch delays When using hive, it would be good to set "hive.metastore.pre.event.listeners= " (empty value) as there is not concept of user group permissions in S3. For ETL operations in Hive,"hive.metastore.fshandler.threads", "hive.mv.files.thread" can be tuned to improve the performance of file move/metastore related activities. Set "hive.warehouse.subdir.inherit.perms=false" when using S3 data with Hive If it is very specific to MR, it would be good to enable "hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" which helps in reducing the amount of data movement at the end of the job. From connector perspective, there is a lot of work happening in HADOOP-11694, HADOOP-13204, HADOOP-13345

rbalamohan · ‎04-04-2016

Can you try with "set hive.optimize.sort.dynamic.partition = true;"

Online	Offline
Last Visited	‎08-14-2019 02:15 AM

Member Since	‎09-29-2015 12:17 AM
Last Visited	‎08-14-2019 02:15 AM
Posts	30
Kudos received	16

Cloudera Community

Re: unable to write hive query output to s3

Re: How can I enable S3 request logging?

Re: Hive Query slowness

Re: set hive.tez.exec.print.summary=true causes od...

Re: Are there any special considerations or optimi...

Re: We are getting below error when trying to read...

Re: execution error, return code 1 from org.apache...

Re: Hive Query slowness

Re: MSCK REPAIR TABLE hangs when hdfs directories ...

Re: How to use s3a with HDP

Re: set hive.tez.exec.print.summary=true causes od...

Re: how to change hive external table location.

Re: how to change hive external table location.

Re: Are there any special considerations or optimi...

Re: Tez OutOfMemoryError inserting data into a par...