Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Loading to S3 Fails - CDH 5.3.0

Highlighted

Loading to S3 Fails - CDH 5.3.0

New Contributor

Since upgrading our cluster from 5.1.2 to 5.3.0, we have been unable to load data to a Hive table that points to S3. It fails with the following error:

 

Loading data to table schema.table_name partition (dt=null)
Failed with exception Wrong FS: s3n://<s3_bucket>/converted_installs/.hive-staging_hive_2015-01-26_11-05-32_849_2677145287515034575-1/-ext-10000/dt=2015-01-25/000000_0.gz, expected: hdfs://<name_node>:8020
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask

The table itself was created using the following DDL (I removed the columns, since they are not very important):

 

...
ROW FORMAT SERDE 'com.bizo.hive.serde.csv.CSVSerde'
STORED AS TEXTFILE
LOCATION 's3n://<s3_bucket>/data/warehouse_v1/converted_installs';

We don't have any issues writing to tables that reside on HDFS locally, but for some reason, writing to S3 fails. Anyone have an idea how to fix this?

27 REPLIES 27

Re: Loading to S3 Fails - CDH 5.3.0

Contributor

Can you please check HS2 and paste any relevant exception here?  Thanks.  

 

Also wondering if this worked before in 5.1.2?

Re: Loading to S3 Fails - CDH 5.3.0

New Contributor

I'm afraid I do not see anything in the HiveServer2 logs related to this problem.

 

Loading to S3 worked fine in CDH4.

Re: Loading to S3 Fails - CDH 5.3.0

New Contributor

We tried loading data to S3 on a second cluster and encountered the same problem. Here's the HiveServer2 log output:

 

015-01-30 18:04:15,497 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: Moving tmp dir: s3n://<s3_bucket>/test_table/.hive-staging_hive_2015-01-30_18-03-09_871_2770339221568578012-1/_tmp.-ext-10000 to: s3n://<s3_bucket/test_table/.hive-staging_hive_2015-01-30_18-03-09_871_2770339221568578012-1/-ext-10000
2015-01-30 18:04:20,350 INFO org.apache.hadoop.hive.ql.log.PerfLogger: <PERFLOG method=task.MOVE.Stage-0 from=org.apache.hadoop.hive.ql.Driver>
2015-01-30 18:04:20,359 INFO org.apache.hadoop.hive.ql.exec.Task: Loading data to table test.test_table from s3n://<s3_bucket>/test_table/.hive-staging_hive_2015-01-30_18-03-09_871_2770339221568578012-1/-ext-10000
2015-01-30 18:04:20,564 INFO org.apache.hive.service.cli.CLIService: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=0ad4737c-4d4a-4585-86c1-5717fc72fd40]: getLog()
2015-01-30 18:04:24,727 ERROR org.apache.hadoop.hive.ql.exec.Task: Failed with exception Wrong FS: s3n://<s3_bucket>/test_table/.hive-staging_hive_2015-01-30_18-03-09_871_2770339221568578012-1/-ext-10000/000000_0, expected: hdfs://vm-cluster-node1:8020
java.lang.IllegalArgumentException: Wrong FS: s3n://<s3_bucket>/test_table/.hive-staging_hive_2015-01-30_18-03-09_871_2770339221568578012-1/-ext-10000/000000_0, expected: hdfs://vm-cluster-node1:8020
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:192)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getEZForPath(DistributedFileSystem.java:1877)
	at org.apache.hadoop.hdfs.client.HdfsAdmin.getEncryptionZoneForPath(HdfsAdmin.java:262)
	at org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:961)
	at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2280)
	at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2356)
	at org.apache.hadoop.hive.ql.metadata.Table.copyFiles(Table.java:686)
	at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1493)
	at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:284)
	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
	at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1554)
	at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1321)
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1139)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:962)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:957)
	at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:145)
	at org.apache.hive.service.cli.operation.SQLOperation.access$000(SQLOperation.java:69)
	at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:200)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
	at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:502)
	at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:213)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

2015-01-30 18:04:24,727 ERROR org.apache.hadoop.hive.ql.Driver: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
2015-01-30 18:04:24,728 INFO org.apache.hadoop.hive.ql.log.PerfLogger: </PERFLOG method=Driver.execute start=1422640997058 end=1422641064728 duration=67670 from=org.apache.hadoop.hive.ql.Driver>
2015-01-30 18:04:24,728 INFO org.apache.hadoop.hive.ql.Driver: MapReduce Jobs Launched: 
2015-01-30 18:04:24,728 INFO org.apache.hadoop.hive.ql.Driver: Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.29 sec   HDFS Read: 5598 HDFS Write: 0 SUCCESS
2015-01-30 18:04:24,730 INFO org.apache.hadoop.hive.ql.Driver: Total MapReduce CPU Time Spent: 5 seconds 290 msec
2015-01-30 18:04:24,730 INFO org.apache.hadoop.hive.ql.log.PerfLogger: <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
2015-01-30 18:04:24,730 INFO ZooKeeperHiveLockManager:  about to release lock for test/test_table
2015-01-30 18:04:24,738 INFO ZooKeeperHiveLockManager:  about to release lock for test
2015-01-30 18:04:24,744 INFO ZooKeeperHiveLockManager:  about to release lock for default/clicks
2015-01-30 18:04:24,749 INFO ZooKeeperHiveLockManager:  about to release lock for default
2015-01-30 18:04:24,753 INFO org.apache.hadoop.hive.ql.log.PerfLogger: </PERFLOG method=releaseLocks start=1422641064730 end=1422641064753 duration=23 from=org.apache.hadoop.hive.ql.Driver>
2015-01-30 18:04:24,756 ERROR org.apache.hive.service.cli.operation.Operation: Error running hive query: 
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
	at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:147)
	at org.apache.hive.service.cli.operation.SQLOperation.access$000(SQLOperation.java:69)
	at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:200)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
	at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:502)
	at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:213)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
2015-01-30 18:04:24,760 INFO org.apache.hive.service.cli.CLIService: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=0ad4737c-4d4a-4585-86c1-5717fc72fd40]: getLog()

 Is this a bug in CDH 5.3.0 or did we set up our clusters incorrectly?

Re: Loading to S3 Fails - CDH 5.3.0

New Contributor

Hello,

I am experiencing a similar issue with the Google Cloud Storage connector for hadoop + Hive on CDH 5.3

It appears as though hive expects to be able to write to only the local hdfs, even though it is able to read/write from the remote fs. I have the same error when reading from / writing to a gs://<bucket> location.

 

Not sure if this is a Hive bug or a configuration issue. 

 

Any progress on determining if it is a bug with Hive/S3 on CDH 5.3?

Re: Loading to S3 Fails - CDH 5.3.0

New Contributor

We were unable to get Hive tables pointing to S3 to work in CDH 5.3.0, so we downgraded to CDH 5.2.0, which works fine.

Re: Loading to S3 Fails - CDH 5.3.0

Explorer

 I experienced this problem, which has proven disasterous for a number of workflows which used S3 locations for external tables in hive. I tried a number of configuration changes as well as manually uploading freshly compiled hadoop-s3a binaries to the cluster as a workaround.

 

Theres no documents or info on how to use the S3A filesystem bundled inside of 5.3.0, it just says it supports the new S3A filesystem. Quick tests proved that untrue, Im guessing I may have to move items into the classpath for it to work, but with zero reference documentation on how to get this feature to work, I manually copied the hadoop-s3a project binaries to the path. 

 

The stacktrace I get happens inside of hiveserver2 during the movefiles/copyfiles step, the same as above. This is either a regression or a new bug that stops other filesystems for external tables, which basically defeats the purpose of having external tables pretty much.

 

Is there any solution in 5.3.2 or should I follow the example of the user above and basically .... nuke my cluster and install an old version. It will be a lot of wasted time.

Re: Loading to S3 Fails - CDH 5.3.0

Explorer

Heres the relevant stacktrace:

 

15/03/13 18:08:15 INFO log.PerfLogger: <PERFLOG method=task.MOVE.Stage-4 from=org.apache.hadoop.hive.ql.Driver>
15/03/13 18:08:15 INFO exec.Task: Moving data to: s3a://datapipe-usage/tmp/hive-staging_hive_2015-03-13_18-07-13_821_8375490564348952212-1/-ext-10000 from s3a://datapipe-usage/tmp/hive-staging_hive_2015-03-13_18-07-13_821_8375490564348952212-1/-ext-10002
15/03/13 18:08:15 INFO s3a.S3AFileSystem: Getting path status for s3a://datapipe-usage/tmp/hive-staging_hive_2015-03-13_18-07-13_821_8375490564348952212-1/-ext-10002 (tmp/hive-staging_hive_2015-03-13_18-07-13_821_8375490564348952212-1/-ext-10002)
15/03/13 18:08:16 INFO s3a.S3AFileSystem: Getting path status for s3a://datapipe-usage/tmp/hive-staging_hive_2015-03-13_18-07-13_821_8375490564348952212-1 (tmp/hive-staging_hive_2015-03-13_18-07-13_821_8375490564348952212-1)
15/03/13 18:08:16 INFO s3a.S3AFileSystem: Delete path s3a://datapipe-usage/tmp/hive-staging_hive_2015-03-13_18-07-13_821_8375490564348952212-1/-ext-10000 - recursive true
15/03/13 18:08:16 INFO s3a.S3AFileSystem: Getting path status for s3a://datapipe-usage/tmp/hive-staging_hive_2015-03-13_18-07-13_821_8375490564348952212-1/-ext-10000 (tmp/hive-staging_hive_2015-03-13_18-07-13_821_8375490564348952212-1/-ext-10000)
15/03/13 18:08:16 ERROR exec.Task: Failed with exception Wrong FS: s3a://datapipe-usage/tmp/hive-staging_hive_2015-03-13_18-07-13_821_8375490564348952212-1/-ext-10002, expected: hdfs://smq-cloudera-04.dpcloud.local:8020
java.lang.IllegalArgumentException: Wrong FS: s3a://datapipe-usage/tmp/hive-staging_hive_2015-03-13_18-07-13_821_8375490564348952212-1/-ext-10002, expected: hdfs://smq-cloudera-04.dpcloud.local:8020
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:192)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getEZForPath(DistributedFileSystem.java:1877)
	at org.apache.hadoop.hdfs.client.HdfsAdmin.getEncryptionZoneForPath(HdfsAdmin.java:262)
	at org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:961)
	at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2280)
	at org.apache.hadoop.hive.ql.exec.MoveTask.moveFile(MoveTask.java:92)
	at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:209)
	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
	at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1554)
	at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1321)
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1139)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:962)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:957)
	at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:145)
	at org.apache.hive.service.cli.operation.SQLOperation.access$000(SQLOperation.java:69)
	at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:200)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
	at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:502)
	at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:213)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

 

Re: Loading to S3 Fails - CDH 5.3.0

New Contributor

This thread sort of trailed off. Has there been any resolution for this. We are experiencing this exact issue in CDH 5.4.3.

Re: Loading to S3 Fails - CDH 5.3.0

Explorer

I reverted to 5.2.4 for a period, which didn't have this same issue. Unfortunately, I had to bring the cluster to 5.4.2 for other reasons, and I implemented a workaround using a staging hdfs location and distcp.

 

The issue remains. Reading the stack trace and looking around online I'm pretty sure it has to do with the new HDFS encryption support.