- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Pig LZO Inputsplits
- Labels:
-
Apache Pig
Created on 02-04-2014 05:37 AM - edited 09-16-2022 01:53 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've sqooped a fairly large table into a CDH4.5 cluster. To save space and still have splittable files I've used LZO compression and setup LZO as per the cloudera instructions [1].
Sqoop makes a single large LZO file(+index) for the initial import and adds smaller LZO+index files for the subsequent incremental sqoop imports.
My problem is that Pig doesn't seem to split the LZO files for its import, while Hive does so perfectly. Running a simple pig script gives me exactly 7 mappers, which is identical to the lzo files in the HDFS dir. 6 complete quickly(the incremental files) but 1 takes a very long time. Can't verify but that has to be the one large file(90+ Gigs)
At the same time Hive can do a query on the exact same table and get 700+ mappers no problem. So the LZO is splittable, but Pig seems not split them.
LZO in the Pig script seems to be enabled since I'm getting these messages:
INFO com.hadoop.compression.lzo.GPLNativeCodeLoader - loaded native gpl library
INFO com.hadoop.compression.lzo.lzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev null]
Seems like Pig can read the LZO files, but does not read the index files with those filed to determine split points.
Any suggestions?
Created 02-04-2014 05:51 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
index files created alongside. You'll need to use the ElephantBird
loader functions available at
https://github.com/kevinweil/elephant-bird to properly load them in a
scalable way (you need its
com.twitter.elephantbird.pig.load.LzoTextLoader loader specifically,
for indexed LZO text files).
Created 02-04-2014 05:51 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
index files created alongside. You'll need to use the ElephantBird
loader functions available at
https://github.com/kevinweil/elephant-bird to properly load them in a
scalable way (you need its
com.twitter.elephantbird.pig.load.LzoTextLoader loader specifically,
for indexed LZO text files).
Created 02-04-2014 06:52 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is there another way to sqoop data into a compressed container format(any) and have Hive and Pig understand its splits?
afik:
- Sqooping with snappy will not result in splittable files
- Sqooping to hive and using --as-avrofile + snappy is not compatible
Asside from the EB libs, the only way I see is not using compression, is this correct?
Created 02-05-2014 01:33 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thnx for the hint, to be complete:
Sqooped data to Hive uses '\u0001' as a field delimiter. LzoTextLoader does not support adding a custom delimiter, use LzoTokenizedLoader for that. Works like a charm after that
