Support Questions

RobV · ‎02-04-2014

I've sqooped a fairly large table into a CDH4.5 cluster. To save space and still have splittable files I've used LZO compression and setup LZO as per the cloudera instructions [1].

Sqoop makes a single large LZO file(+index) for the initial import and adds smaller LZO+index files for the subsequent incremental sqoop imports.

My problem is that Pig doesn't seem to split the LZO files for its import, while Hive does so perfectly. Running a simple pig script gives me exactly 7 mappers, which is identical to the lzo files in the HDFS dir. 6 complete quickly(the incremental files) but 1 takes a very long time. Can't verify but that has to be the one large file(90+ Gigs)

At the same time Hive can do a query on the exact same table and get 700+ mappers no problem. So the LZO is splittable, but Pig seems not split them.

LZO in the Pig script seems to be enabled since I'm getting these messages:

INFO com.hadoop.compression.lzo.GPLNativeCodeLoader - loaded native gpl library

INFO com.hadoop.compression.lzo.lzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev null]

Seems like Pig can read the LZO files, but does not read the index files with those filed to determine split points.

Any suggestions?

[1] https://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Insta...

Harsh J · ‎02-04-2014

Pig's default PigStorage loader may not understand how to use the
index files created alongside. You'll need to use the ElephantBird
loader functions available at
https://github.com/kevinweil/elephant-bird to properly load them in a
scalable way (you need its
com.twitter.elephantbird.pig.load.LzoTextLoader loader specifically,
for indexed LZO text files).

View solution in original post

Harsh J · ‎02-04-2014

Pig's default PigStorage loader may not understand how to use the
index files created alongside. You'll need to use the ElephantBird
loader functions available at
https://github.com/kevinweil/elephant-bird to properly load them in a
scalable way (you need its
com.twitter.elephantbird.pig.load.LzoTextLoader loader specifically,
for indexed LZO text files).

RobV · ‎02-04-2014

Is there another way to sqoop data into a compressed container format(any) and have Hive and Pig understand its splits?

afik:

- Sqooping with snappy will not result in splittable files

- Sqooping to hive and using --as-avrofile + snappy is not compatible

Asside from the EB libs, the only way I see is not using compression, is this correct?

RobV · ‎02-05-2014

Thnx for the hint, to be complete:

Sqooped data to Hive uses '\u0001' as a field delimiter. LzoTextLoader does not support adding a custom delimiter, use LzoTokenizedLoader for that. Works like a charm after that

Cloudera Community

Support Questions

Pig LZO Inputsplits

Unable to use lzo codec

Pig Doing Yoga: How to Build Superflexible Pig Scr...

Apache Pig IN operator, placeholder until PIG-4931...

Getting started with Pig-Eclipse

Installing LZO compression broke Hive completely

Enabling LZO compression using NiFi PutHDFS

Order by Operator in Pig

How to use pig in zeppelin

Import HBase data in csv format using pig

How to access data files stored in AWS S3 buckets ...