Member since
09-29-2015
30
Posts
16
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6218 | 12-26-2016 10:50 PM | |
5254 | 12-22-2016 11:36 PM | |
28363 | 12-15-2016 10:59 AM | |
3285 | 09-27-2016 01:04 AM | |
6894 | 09-20-2016 06:39 AM |
12-22-2016
11:06 PM
Is it anon access or IAM policy?. Which version of HDP/Hadoop are you using?. Debug logs would provide details on which credential provider it is trying to use.
... View more
12-17-2016
09:25 AM
Can you check for any errors in hive.log (/tmp/<user>/hive.log). If it launched the appmaster, then could you share the am logs?
... View more
12-15-2016
10:59 AM
1 Kudo
Couple of things you can check Check if the dataset has changed from previous run to current run. Not sure how you are running your query. For e.g, if you are using hive cli, you can use "hive --hiveconf hive.tez.exec.print.summary=true". This should print the pre-execution (compilation, job submission), DAG execution times after the job is complete. That can give hints on where the time is spent If you have tez-ui, that is the best place to start checking the details on where the time is spent. It would be good to share the query and "explain <sql>" output with "--hiveconf hive.explain.user=false". If possible, share "explain formatted <sql>" output which dumps the plan information in JSON format. Check if vertices are running slow due to resource constraints (i.e, some tasks would have started, but others are in waiting mode as resources are not available in queue or in cluster).
... View more
12-10-2016
12:43 AM
Can you take the "jstack" output of the hive cli from when it is stuck and share it here? Would be helpful if you could share hive.log as well.
... View more
11-11-2016
03:26 AM
2 Kudos
s3n is pretty much deprecated. Please use "s3a". Which version of HDP are you using? Check if you have relevant s3a libraries (aws-java-sdk-s3*.jar) in hadoop and add "-Dfs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem"
... View more
09-27-2016
01:04 AM
1 Kudo
This is not specific to "hive.tez.exec.print.summary=true", which prints the summary details of the DAG. In this case, DAG ran a lot faster and the delay you are observing
is due to the file movement from S3 to S3 as a part of final cleanup activity of the job. Hive moves the job output to final location and this activity is carried out in the hive-client. In S3, rename is a "copy + delete" operation. So even though this rename is done in the AWS side, it takes time depending on the amount
of data that is churned out by the job. In HDFS, rename is a lot cheaper operation and hence you do not observe this delay in HDFS. Alternate option is to write the data to local HDFS and move the data to S3 via distcp.
... View more
09-24-2016
01:30 AM
are there any exceptions reported in client log or in metastore log?
... View more
09-22-2016
09:17 AM
Check if you have provided the aws access keys correctly and if there are any exceptions reported in hive client log (e.g /tmp/<user>/hive.log).
... View more
09-20-2016
06:39 AM
5 Kudos
Optimizations applied for S3 are common for HDFS as well. It is just that these optimizations would be visible in S3 when compared with HDFS (for e.g, getFileStatus/listFiles operations are lot cheaper in HDFS as compared to S3).
If you are using ORC in S3, it would be good to use the latest S3a connectors available in HDP 2.4/2.5 (or even easier, try out HDP Cloud which has the latest patches for S3a). Couple of things specific to ORC.
ORC is columnar format and can incur random reads (e.g It reads the end of the file before it starts reading the data blocks). With recent S3A connector, you can set "fs.s3a.experimental.input.fadvise=random" which helps in random reads. Without which, it ends up breaking the https connection with S3 everytime backwards seek is performed. There are other optimizations that have been made internally which helps reducing the connection aborts. Split computation can take a lot longer with S3 when compared to HDFS. Fortunately, ORC internally has threadpool to compute this in parallel. "hive.orc.compute.splits.num.threads (default 10)" can tuned based on the amount of data processed. Again this is not specific to S3, but tuning these params can give significant perf difference in S3. In case ORC ETL strategy is chosen (default is HYBRID. hive.exec.orc.split.strategy), one can reduce the footer being read in task by enabling "hive.orc.splits.include.file.footer=true". In earlier versions of hive, this used to have memory pressure in AM side. But in recent versions, this has been fixed. So, this piggy backs the ORC meta information with the split payload and the task does not need to read the metadata again, which can help reduce the number of calls to S3. There are couple of fixes which helped in reducing the number of times the footer information was being read. Some additional changes which are not specific to ORC, which can have impact in AWS/S3 environments
There have been couple of fixes in Tez side, which helps improving the splits grouping logic. S3 always provides "localhost" as its locality information and this could have adverse impact when used with Tez due to the grouping nature in earlier versions. This is fixed in recent versions, where in Tez does not aggresively group if it does not have enough information about the data locality. This is not specific to ORC, but helps in other formats as well. There is no concept of racks in AWS. When capacity scheduler is used, it is good to set "yarn.scheduler.capacity.node-locality-delay=0" to avoid container launch delays When using hive, it would be good to set "hive.metastore.pre.event.listeners= " (empty value) as there is not concept of user group permissions in S3. For ETL operations in Hive,"hive.metastore.fshandler.threads", "hive.mv.files.thread" can be tuned to improve the performance of file move/metastore related activities. Set "hive.warehouse.subdir.inherit.perms=false" when using S3 data with Hive If it is very specific to MR, it would be good to enable "hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" which helps in reducing the amount of data movement at the end of the job. From connector perspective, there is a lot of work happening in HADOOP-11694, HADOOP-13204, HADOOP-13345
... View more
04-04-2016
09:35 AM
Can you try with "set hive.optimize.sort.dynamic.partition = true;"
... View more
- « Previous
-
- 1
- 2
- Next »