Hi, we are performing some tests copying HDFS data to AWS S3 using S3A and it is taking about 3 hours for 136 small files totalling to 7GB size and we are seeing multiple connection timeouts in all our mapper jobs. 2016-05-20 10:40:43,148 INFO [s3a-transfer-shared--pool1-t3] com.cloudera.com.amazonaws.http.AmazonHttpClient: Unable to execute HTTP request: Connection timed out
java.net.SocketException: Connection timed out Here is the command I'm using and please note that I used fs.s3a.connection.timeout here but still seems to be hitting timeouts. hadoop distcp -D mapred.task.timeout=1800000 -Dfs.s3a.awsAccessKeyId=xxxx -Dfs.s3a.awsSecretAccessKey=xxx -Dfs.s3a.connection.timeout=1800000 -log /grp/cai_dba/dev/core/pawsdistcplogs hdfs://nameservice1/grp/cai_dba/dev/core/pawsdistcptests/ s3a://ah-distcp-poc-task/weblogs Here are other tests that worked fine.. 1) Able to copy single 4 GB file successfully withing 4 mins using same distcp/s3a method from same hadoop clsuter. 2) I copied all the 136 files (7 GB total) from above test case on to local filesystem of one of the hosts in the same network as our hadoop cluster and perform direct copy to S3 using aws client copy command. Entire 7 GB copied over successfully in about 8 mins. So, it seems to me that there is some bottlenect with distcp/s3a on how it handles multiple files. Has anyone experienced this issue and have any ideas with regards to tuning s3a parameters ?
... View more
Hi, we have a similar issue and wondering if those steps listed are the resolution. we have our cluster kerberised and we also deployed Sentry, as part of the setup in hive we disabled impersonation. so all the HIVE queries are being executed by the HIVE user. We configured Dynamic resource manager pools, setting up 3 queues. HighPriority, LowPriority and Default. Everybody can submit jobs to the default queue, that is working as expected. The HighPriority, LowPriority are managed by group membership to two different AD groups. I assigned a test user both groups so it could submit jobs to both queues (HighPriority, LowPriority) when i submitted a job we got the following error message ERROR : Job Submission failed with exception 'java.io.IOException(Failed to run job : User hive cannot submit applications to queue root.HighPriority)' java.io.IOException: Failed to run job : User hive cannot submit applications to queue root.HighPriority this is correct because the hive user doesn't is not a member of any of those groups. I modified the submission access control to add the hive user to the pool and this time the job completed, however that breaks the access control model we are trying to implement because now all hive users can make use of both pools even though they don't belong any of the AD groups that are supposed to be controlling who can submit jobs to the pool. Is there a way to control which users can submit to specific resource pools in HIVE and leverage the Ad groups created for this purpose?
... View more