Reply
New Contributor
Posts: 5
Registered: ‎08-18-2013

hive s3 andhive.exec.skips3 scratch

running Ubuntu 12.04.2 LTS. Installed CDH4 w MRv1.

Using hive with Derby. All services are functioning normally.

 

I want to use hive to query large tables I have on S3. I set fs.s3.awsAccessKeyId,fs.s3n.awsAccessKeyId and the secret etc.

 

TLDR: I can only read a table from s3 with hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; I can write from hive to s3 with hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat and hive.exec.skips3 scratch has no effect.

 

 

By default hive has hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

 

Now if I point hive to my external data:

 

hive>create external table test (mp string) location 's3://my_stuff/test';
OK
Time taken: 1.103 seconds

 

However,

 

hive> select count(1) from test; 
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
java.io.FileNotFoundException: File does not exist: /test/000000_0

....

....

 

But if I SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;  It works fine but I cant write to S3

 

hive> select count(1) from test; 
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201308180808_0016, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201308180808_0016
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201308180808_0016
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2013-08-18 09:52:34,833 Stage-1 map = 0%, reduce = 0%
2013-08-18 09:52:42,940 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 3.37 sec
2013-08-18 09:52:43,967 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.39 sec
2013-08-18 09:52:44,987 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.39 sec
2013-08-18 09:52:45,997 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.39 sec
2013-08-18 09:52:47,029 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 9.32 sec
2013-08-18 09:52:48,048 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 9.32 sec
2013-08-18 09:52:49,072 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 9.32 sec
2013-08-18 09:52:50,090 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 9.32 sec
MapReduce Total cumulative CPU time: 9 seconds 320 msec
Ended Job = job_201308180808_0016
MapReduce Jobs Launched: 
Job 0: Map: 2 Reduce: 1 Cumulative CPU: 9.32 sec HDFS Read: 378 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 320 msec
OK
67612
Time taken: 21.617 seconds

 

Now if I try (xxx has some strings on hdfs)

 

hive> insert overwrite table test select * from xxx;
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201308180808_0017, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201308180808_0017
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201308180808_0017
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2013-08-18 09:53:52,546 Stage-1 map = 0%, reduce = 0%
2013-08-18 09:54:01,598 Stage-1 map = 50%, reduce = 0%
2013-08-18 09:54:02,606 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.53 sec
2013-08-18 09:54:03,616 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.08 sec
2013-08-18 09:54:04,630 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.08 sec
2013-08-18 09:54:05,640 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.08 sec
MapReduce Total cumulative CPU time: 5 seconds 80 msec
Ended Job = job_201308180808_0017
Ended Job = -1167089537, job is filtered out (removed at runtime).
Ended Job = -386132227, job is filtered out (removed at runtime).
Launching Job 3 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
java.io.FileNotFoundException: File does not exist: /tmp/hive-ds/hive_2013-08-18_09-53-44_714_504498166730819372/-ext-10002/000000_0
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:824)
at org.apache.hadoop.mapred.lib.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:462)
at org.apache.hadoop.mapred.lib.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:256)
at org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:411)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:377)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.processPaths(CombineHiveInputFormat.java:419)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:390)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1090)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1082)
at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:992)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:919)
at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:448)
at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:138)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1374)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1160)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:973)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:893)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:412)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Job Submission failed with exception 'java.io.FileNotFoundException(File does not exist: /tmp/hive-ds/hive_2013-08-18_09-53-44_714_504498166730819372/-ext-10002/000000_0)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched: 
Job 0: Map: 2 Cumulative CPU: 5.08 sec HDFS Read: 1243032 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 80 msec

 

The query tries to creat temp tables on S3 which I try to supress with  set hive.exec.skips3scratch=true with no effect.

Could someone please help me with setting so that I can read/write to s3 from a single query. 

Posts: 1,903
Kudos: 435
Solutions: 307
Registered: ‎07-31-2013

Re: hive s3 andhive.exec.skips3 scratch

 


@rckjhn wrote:

The query tries to creat temp tables on S3 which I try to supress with  set hive.exec.skips3scratch=true with no effect.

Could someone please help me with setting so that I can read/write to s3 from a single query. 



The property hive.exec.skips3scratch=true seems to only be available in Amazon's Hive offering, and was not contributed upstream by them.

 

What you'll need to do is perhaps explicitly set the URI for hive.exec.scratchdir. That is, try to set it to "hdfs://<namenode or nameservice>/tmp/hive-${user.name}" instead of just "/tmp/hive-${user.name}" which is its default.

New Contributor
Posts: 5
Registered: ‎08-18-2013

Re: hive s3 and hive.exec.skips3 scratch

Harsh, thank you for the quick reply!

 

I am running in pseudo distributed mode. So I set hive.exec.scratchdir=hdfs://localhost:8020/tmp/hive-rj

Unfortunately the tmp files are still created on s3. 

 

I am less concerned about the tmp files if all else works.

 

How do I get around using for reading from s3

 

SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

 

and using 

 

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; ?

 

I would really like to use CDH on top of ec2 - I just need to get CDH hive to talk to s3. We are trying to get into the CDH ecosystem

to eventuallly use Impala.

Highlighted
Expert Contributor
Posts: 63
Registered: ‎08-06-2013

Re: hive s3 and hive.exec.skips3 scratch

File does not exist: /test/000000_0
File does not exist: /tmp/hive-ds/hive_2013-08-18_09-53-44_714_504498166730819372/-ext-10002/000000_0

 


The files not found are extension-less files. 

 

"Hive doesn’t play well with extension-less files in s3. Make sure each file has an extension (eg by default it can’t read file-00000, but it can read file-0000.tar.gz)."

http://blog.matthewrathbone.com/2012/03/09/tips-for-using-cdhs-hadoop-distribution-with-amazons-s3.h...

New Contributor
Posts: 5
Registered: ‎08-18-2013

Re: hive s3 and hive.exec.skips3 scratch

How do I force hive to create its normal/temp files with extensions? Do I have to force it to compress them?

Expert Contributor
Posts: 63
Registered: ‎08-06-2013

Re: hive s3 and hive.exec.skips3 scratch

New Contributor
Posts: 5
Registered: ‎08-18-2013

Re: hive s3 and hive.exec.skips3 scratch

Thank you very much for this. This works fine for compressed..gz files, however it will not read .csv files but this is not

too big a deal as it may be better to have everything compressed on s3 anyway.

 

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
java.io.FileNotFoundException: File does not exist: /myprefixt/Myfile.csv

 

Weird that hive has this problem.

 

Thanks again.