Reply
Explorer
Posts: 6
Registered: ‎02-10-2016

Pig and Parquet

All,

Please forgive me if the answer is obvious or has been answered.  I promised i spent considerable time searching for the answer prior to posting here.

 

Problem:

Ingest multipfle fixed width files into hdfs (4gb compressed using bz2 and 35gb uncompressed).  At the end of the day, the data must be available to analyst via hive or impala.  The ideal solution would be performant and storage friendly.

 

Attempts:

1. Create external table using the bz2 files.  Queries return NO DATA

 

2. Using pig load files into PigStorage and Store using Parquet.  

This works for 1 file, but once i set the load statement to a directory load, it fails after processing for an hour or so. The error message in HUE doesn't shed much light on the issue.

 

Is it possible to create a Hive or Impala table using bz2 compressed files?

 

Is it possible to convert fixed width bz2 files to parquet using pig?

 

Also, if i'm chasing the wrong approach, suggestions would be greatly appreciated.

 

Code:

Hive Create Table

Create External Table bzcompressed (Col1 string)

Location '/user/bzipped/'

 

 

select substr(col1, 1, 50) from bzcompressed -- returns 0 rows

 

Pig

A = Load '/user/bzipped/' using PigStorage();

B = do some foreach row stuff....

STORE B into '/user/parquet/' USING pig.parquet.ParquetStorer

-- this code workse if i explicity name a single file, but failes when i reference just a directory