Support Questions

RobV · ‎02-04-2014

I've sqooped a fairly large table into a CDH4.5 cluster. To save space and still have splittable files I've used LZO compression and setup LZO as per the cloudera instructions [1].

Sqoop makes a single large LZO file(+index) for the initial import and adds smaller LZO+index files for the subsequent incremental sqoop imports.

My problem is that Pig doesn't seem to split the LZO files for its import, while Hive does so perfectly. Running a simple pig script gives me exactly 7 mappers, which is identical to the lzo files in the HDFS dir. 6 complete quickly(the incremental files) but 1 takes a very long time. Can't verify but that has to be the one large file(90+ Gigs)

At the same time Hive can do a query on the exact same table and get 700+ mappers no problem. So the LZO is splittable, but Pig seems not split them.

LZO in the Pig script seems to be enabled since I'm getting these messages:

INFO com.hadoop.compression.lzo.GPLNativeCodeLoader - loaded native gpl library

INFO com.hadoop.compression.lzo.lzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev null]

Seems like Pig can read the LZO files, but does not read the index files with those filed to determine split points.

Any suggestions?

[1] https://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Insta...

Harsh J · ‎02-04-2014

Pig's default PigStorage loader may not understand how to use the
index files created alongside. You'll need to use the ElephantBird
loader functions available at
https://github.com/kevinweil/elephant-bird to properly load them in a
scalable way (you need its
com.twitter.elephantbird.pig.load.LzoTextLoader loader specifically,
for indexed LZO text files).

View solution in original post

Harsh J · ‎02-04-2014

Pig's default PigStorage loader may not understand how to use the
index files created alongside. You'll need to use the ElephantBird
loader functions available at
https://github.com/kevinweil/elephant-bird to properly load them in a
scalable way (you need its
com.twitter.elephantbird.pig.load.LzoTextLoader loader specifically,
for indexed LZO text files).

RobV · ‎02-04-2014

Is there another way to sqoop data into a compressed container format(any) and have Hive and Pig understand its splits?

afik:

- Sqooping with snappy will not result in splittable files

- Sqooping to hive and using --as-avrofile + snappy is not compatible

Asside from the EB libs, the only way I see is not using compression, is this correct?

RobV · ‎02-05-2014

Thnx for the hint, to be complete:

Sqooped data to Hive uses '\u0001' as a field delimiter. LzoTextLoader does not support adding a custom delimiter, use LzoTokenizedLoader for that. Works like a charm after that

Cloudera Community

Support Questions

Pig LZO Inputsplits