Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Pig LZO Inputsplits

Solved Go to solution
Highlighted

Pig LZO Inputsplits

Contributor

I've sqooped a fairly large table into a CDH4.5 cluster. To save space and still have splittable files I've used LZO compression and setup LZO as per the cloudera instructions [1].

 

Sqoop makes a single large LZO file(+index) for the initial import and adds smaller LZO+index files for the subsequent incremental sqoop imports. 

 

My problem is that Pig doesn't seem to split the LZO files for its import, while Hive does so perfectly. Running a simple pig script gives me exactly 7 mappers, which is identical to the lzo files in the HDFS dir. 6 complete quickly(the incremental files) but 1 takes a very long time. Can't verify but that has to be the one large file(90+ Gigs)

 

At the same time Hive can do a query on the exact same table and get 700+ mappers no problem. So the LZO is splittable, but Pig seems not split them. 

 

LZO in the Pig script seems to be enabled since I'm getting these messages:

INFO com.hadoop.compression.lzo.GPLNativeCodeLoader - loaded native gpl library

INFO com.hadoop.compression.lzo.lzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev null]

 

Seems like Pig can read the LZO files, but does not read the index files with those filed to determine split points.

 

Any suggestions?

 

 

[1] https://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Insta...

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Pig LZO Inputsplits

Master Guru
Pig's default PigStorage loader may not understand how to use the
index files created alongside. You'll need to use the ElephantBird
loader functions available at
https://github.com/kevinweil/elephant-bird to properly load them in a
scalable way (you need its
com.twitter.elephantbird.pig.load.LzoTextLoader loader specifically,
for indexed LZO text files).

View solution in original post

3 REPLIES 3
Highlighted

Re: Pig LZO Inputsplits

Master Guru
Pig's default PigStorage loader may not understand how to use the
index files created alongside. You'll need to use the ElephantBird
loader functions available at
https://github.com/kevinweil/elephant-bird to properly load them in a
scalable way (you need its
com.twitter.elephantbird.pig.load.LzoTextLoader loader specifically,
for indexed LZO text files).

View solution in original post

Highlighted

Re: Pig LZO Inputsplits

Contributor

Is there another way to sqoop  data into a compressed container format(any) and have Hive and Pig understand its splits?

 

afik:

- Sqooping with snappy will not result in splittable files

- Sqooping to hive and using --as-avrofile + snappy is not compatible

 

Asside from the EB libs, the only way I see is not using compression, is this correct?

Re: Pig LZO Inputsplits

Contributor

Thnx for the hint, to be complete:

 

Sqooped data to Hive uses '\u0001' as a field delimiter. LzoTextLoader does not support adding a custom delimiter, use LzoTokenizedLoader for that. Works like a charm after that :smileywink:

Don't have an account?
Coming from Hortonworks? Activate your account here