Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Pig OrcStorage read from different directories

Highlighted

Pig OrcStorage read from different directories

New Contributor

I am trying to load orcfiles from subdirectories in pig using hadoop file glob utilities but it does not seem to work :(

/user/test/testdata/20160901/orcfile
/user/test/testdata/20160902/orcfile
/user/test/testdata/20160903/orcfile
/user/test/testdata/20160904/orcfile
/user/test/testdata/20160905/orcfile

In case of non-orc loaders the directories can be expanded just fine

data = LOAD '/user/test/testnoonorcdata/201609{01..05}' USING NonOrcLoader(); 

But the same do not work with ORCStorage.

data = LOAD '/user/test/testdata/201609{01..05}' USING OrcStorage(); 

Thanks for looking into this.

3 REPLIES 3
Highlighted

Re: Pig OrcStorage read from different directories

Guru

Some questions to help troubleshoot:

  1. Could you post the error message and any other outcome information that would be useful
  2. Does the loading work for a single ORC file? (try with path explicit and also with path using glob but pointing to one ORC file)
Highlighted

Re: Pig OrcStorage read from different directories

New Contributor

I observe exactly the same issue.

Loading multiple files using globs and PigStorage() works fine, but the same using OrcStorage() does not work.

If pointing to a single ORC file explicitly, the loading works fine.

If pointing to a single ORC file using globs, the loading does not work (whatever the glob).

Following is the error trace when using glob '?' in the file path :

2017-02-01 12:44:20,905 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob - PigLatin:DefaultJobName got an error while submitting
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: serious problem
  at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279)
  at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
  at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
  at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
  at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
  at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:422)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
  at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
  at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:128)
  at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:194)
  at java.lang.Thread.run(Thread.java:745)
  at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
Caused by: java.lang.RuntimeException: serious problem
  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1172)
  at org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat.getSplits(OrcNewInputFormat.java:121)
  at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:265)
  ... 18 more
Caused by: java.util.concurrent.ExecutionException: java.io.FileNotFoundException: File hdfs://bigdata-current/user/gstat/private/data/temp/w1/hourly/201701150?/resolved_w1_logs does not exist.
  at java.util.concurrent.FutureTask.report(FutureTask.java:122)
  at java.util.concurrent.FutureTask.get(FutureTask.java:192)
  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1149)
  ... 20 more
Caused by: java.io.FileNotFoundException: File hdfs://bigdata-current/user/gstat/private/data/temp/w1/hourly/201701150?/resolved_w1_logs does not exist.
  at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1006)
  at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:985)
  at org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:930)
  at org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:926)
  at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:926)
  at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1694)
  at org.apache.hadoop.hive.shims.Hadoop23Shims.listLocatedStatus(Hadoop23Shims.java:690)
  at org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:375)
  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.callInternal(OrcInputFormat.java:742)
  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.access$600(OrcInputFormat.java:710)
  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator$1.run(OrcInputFormat.java:732)
  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator$1.run(OrcInputFormat.java:729)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:422)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:729)
  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:710)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)

Re: Pig OrcStorage read from different directories

@Sayan Dasgupta The issue is OrcInputFormat does not extend FileInputFormat, which is able to handle globbing. This would be a feature request for Hive.

Don't have an account?
Coming from Hortonworks? Activate your account here