Support Questions

Find answers, ask questions, and share your expertise

Pig ParquetStorer is not working

avatar
New Contributor

Hi There,

We are getting the following error when using ParquetStorer in Pig

ERROR org.apache.pig.PigServer - exception during parsing: Error during parsing. Cannot instantiate class org.apache.pig.builtin.ParquetStorer (parquet.pig.ParquetStorer) 

We are using HDP-2.3.4.0-3485 version.

Appreciate if any one have any pointers on this.

Thank you,

Ibrahim

1 ACCEPTED SOLUTION

avatar
Expert Contributor

--Register the jars

REGISTER lib/parquet-pig-1.3.1.jar;

REGISTER lib/parquet-column-1.3.1.jar;

REGISTER lib/parquet-common-1.3.1.jar;

REGISTER lib/parquet-format-2.0.0.jar;

REGISTER lib/parquet-hadoop-1.3.1.jar;

REGISTER lib/parquet-pig-1.3.1.jar;

REGISTER lib/parquet-encoding-1.3.1.jar;

--store in parquet format

SET parquet.compression gzip or SNAPPY;

STORE table INTO '/path/to/table' USING parquet.pig.ParquetStorer;

-- options you might want to fiddle with

SET parquet.page.size 1048576 -- default. this is your min read/write unit.

SET parquet.block.size 134217728 -- default. your memory budget for buffering data

SET parquet.compression lzo -- or you can use none, gzip, snappy

STORE mydata into '/some/path' USING parquet.pig.ParquetStorer; --Reading mydata = LOAD '/some/path' USING parquet.pig.ParquetLoader AS (x: int, y int);

View solution in original post

5 REPLIES 5

avatar
Master Mentor

you need to download the parquet jar, upload it to the cluster and register the Parquet jar. HDP doesn't ship with Parquet out of the box. @Ibrahim Jarrar

here's an example.

avatar
Expert Contributor

--Register the jars

REGISTER lib/parquet-pig-1.3.1.jar;

REGISTER lib/parquet-column-1.3.1.jar;

REGISTER lib/parquet-common-1.3.1.jar;

REGISTER lib/parquet-format-2.0.0.jar;

REGISTER lib/parquet-hadoop-1.3.1.jar;

REGISTER lib/parquet-pig-1.3.1.jar;

REGISTER lib/parquet-encoding-1.3.1.jar;

--store in parquet format

SET parquet.compression gzip or SNAPPY;

STORE table INTO '/path/to/table' USING parquet.pig.ParquetStorer;

-- options you might want to fiddle with

SET parquet.page.size 1048576 -- default. this is your min read/write unit.

SET parquet.block.size 134217728 -- default. your memory budget for buffering data

SET parquet.compression lzo -- or you can use none, gzip, snappy

STORE mydata into '/some/path' USING parquet.pig.ParquetStorer; --Reading mydata = LOAD '/some/path' USING parquet.pig.ParquetLoader AS (x: int, y int);

avatar
Master Mentor

@Ibrahim Jarrar has this been resolved? Can you provide your solution or accept the best answer?

avatar
Master Mentor

Here's a much cleaner working example tested with HDP 2.6

wget http://central.maven.org/maven2/org/apache/parquet/parquet-pig-bundle/1.8.1/parquet-pig-bundle-1.8.1...
hdfs dfs -put parquet-pig-bundle-1.8.1.jar .
pig –x tez
REGISTER hdfs://dlm3ha/user/centos/parquet-pig-bundle-1.8.1.jar;
// words is a CSV file with five fields
data = load 'words' using PigStorage(',') as (f1:chararray,f2:chararray,f3:chararray,f4:chararray,f5:chararray);
store data into 'hdfs://dlm3ha/user/centos/output' using org.apache.parquet.pig.ParquetStorer;

avatar
New Contributor

i used:

register parquet-pig-1.10.1.jar;
register parquet-encoding-1.8.2.jar;
register parquet-column-1.8.2.jar;
register parquet-common-1.8.2.jar;
register parquet-hadoop-1.8.2.jar;
register parquet-format-2.3.1.jar;

base = LOAD '/XXX/yyy/archivo.parquet' USING org.apache.parquet.pig.ParquetLoader AS (

xxx:chararray,
yyyy:chararray,

...

)
;

and ok