I've sqooped a fairly large table into a CDH4.5 cluster. To save space and still have splittable files I've used LZO compression and setup LZO as per the cloudera instructions .
Sqoop makes a single large LZO file(+index) for the initial import and adds smaller LZO+index files for the subsequent incremental sqoop imports.
My problem is that Pig doesn't seem to split the LZO files for its import, while Hive does so perfectly. Running a simple pig script gives me exactly 7 mappers, which is identical to the lzo files in the HDFS dir. 6 complete quickly(the incremental files) but 1 takes a very long time. Can't verify but that has to be the one large file(90+ Gigs)
At the same time Hive can do a query on the exact same table and get 700+ mappers no problem. So the LZO is splittable, but Pig seems not split them.
LZO in the Pig script seems to be enabled since I'm getting these messages:
INFO com.hadoop.compression.lzo.GPLNativeCodeLoader - loaded native gpl library
INFO com.hadoop.compression.lzo.lzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev null]
Seems like Pig can read the LZO files, but does not read the index files with those filed to determine split points.
Is there another way to sqoop data into a compressed container format(any) and have Hive and Pig understand its splits?
- Sqooping with snappy will not result in splittable files
- Sqooping to hive and using --as-avrofile + snappy is not compatible
Asside from the EB libs, the only way I see is not using compression, is this correct?
Thnx for the hint, to be complete:
Sqooped data to Hive uses '\u0001' as a field delimiter. LzoTextLoader does not support adding a custom delimiter, use LzoTokenizedLoader for that. Works like a charm after that :smileywink: