Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Pig and Parquet

Pig and Parquet

Explorer

All,

Please forgive me if the answer is obvious or has been answered.  I promised i spent considerable time searching for the answer prior to posting here.

 

Problem:

Ingest multipfle fixed width files into hdfs (4gb compressed using bz2 and 35gb uncompressed).  At the end of the day, the data must be available to analyst via hive or impala.  The ideal solution would be performant and storage friendly.

 

Attempts:

1. Create external table using the bz2 files.  Queries return NO DATA

 

2. Using pig load files into PigStorage and Store using Parquet.  

This works for 1 file, but once i set the load statement to a directory load, it fails after processing for an hour or so. The error message in HUE doesn't shed much light on the issue.

 

Is it possible to create a Hive or Impala table using bz2 compressed files?

 

Is it possible to convert fixed width bz2 files to parquet using pig?

 

Also, if i'm chasing the wrong approach, suggestions would be greatly appreciated.

 

Code:

Hive Create Table

Create External Table bzcompressed (Col1 string)

Location '/user/bzipped/'

 

 

select substr(col1, 1, 50) from bzcompressed -- returns 0 rows

 

Pig

A = Load '/user/bzipped/' using PigStorage();

B = do some foreach row stuff....

STORE B into '/user/parquet/' USING pig.parquet.ParquetStorer

-- this code workse if i explicity name a single file, but failes when i reference just a directory  

Don't have an account?
Coming from Hortonworks? Activate your account here