Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Pig LZO Inputsplits

avatar
Explorer

I've sqooped a fairly large table into a CDH4.5 cluster. To save space and still have splittable files I've used LZO compression and setup LZO as per the cloudera instructions [1].

 

Sqoop makes a single large LZO file(+index) for the initial import and adds smaller LZO+index files for the subsequent incremental sqoop imports. 

 

My problem is that Pig doesn't seem to split the LZO files for its import, while Hive does so perfectly. Running a simple pig script gives me exactly 7 mappers, which is identical to the lzo files in the HDFS dir. 6 complete quickly(the incremental files) but 1 takes a very long time. Can't verify but that has to be the one large file(90+ Gigs)

 

At the same time Hive can do a query on the exact same table and get 700+ mappers no problem. So the LZO is splittable, but Pig seems not split them. 

 

LZO in the Pig script seems to be enabled since I'm getting these messages:

INFO com.hadoop.compression.lzo.GPLNativeCodeLoader - loaded native gpl library

INFO com.hadoop.compression.lzo.lzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev null]

 

Seems like Pig can read the LZO files, but does not read the index files with those filed to determine split points.

 

Any suggestions?

 

 

[1] https://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Insta...

1 ACCEPTED SOLUTION

avatar
Mentor
Pig's default PigStorage loader may not understand how to use the
index files created alongside. You'll need to use the ElephantBird
loader functions available at
https://github.com/kevinweil/elephant-bird to properly load them in a
scalable way (you need its
com.twitter.elephantbird.pig.load.LzoTextLoader loader specifically,
for indexed LZO text files).

View solution in original post

3 REPLIES 3

avatar
Mentor
Pig's default PigStorage loader may not understand how to use the
index files created alongside. You'll need to use the ElephantBird
loader functions available at
https://github.com/kevinweil/elephant-bird to properly load them in a
scalable way (you need its
com.twitter.elephantbird.pig.load.LzoTextLoader loader specifically,
for indexed LZO text files).

avatar
Explorer

Is there another way to sqoop  data into a compressed container format(any) and have Hive and Pig understand its splits?

 

afik:

- Sqooping with snappy will not result in splittable files

- Sqooping to hive and using --as-avrofile + snappy is not compatible

 

Asside from the EB libs, the only way I see is not using compression, is this correct?

avatar
Explorer

Thnx for the hint, to be complete:

 

Sqooped data to Hive uses '\u0001' as a field delimiter. LzoTextLoader does not support adding a custom delimiter, use LzoTokenizedLoader for that. Works like a charm after that Smiley Wink