Support Questions

Find answers, ask questions, and share your expertise

Create Hive table to read parquet files from parquet/avro schema

avatar
Rising Star

Hello Experts !

We are looking for a solution in order to create an external hive table to read data from parquet files according to a parquet/avro schema.

in other way, how to generate a hive table from a parquet/avro schema ?

thanks 🙂

tazimehdi.com
1 ACCEPTED SOLUTION

avatar
Rising Star

The solution is to create dynamically a table from avro, and then create a new table of parquet format from the avro one.

there is the source code from Hive, which this helped you

CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc'); 
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';
tazimehdi.com

View solution in original post

10 REPLIES 10

avatar

@Mehdi TAZI

Avro

CREATE EXTERNAL TABLE table_avro  
	ROW FORMAT SERDE   'org.apache.hadoop.hive.serde2.avro.AvroSerDe'  
	TBLPROPERTIES  ('avro.schema.url'='hdfs://user/schemas/table_avro.avsc')  
	STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
 	OUTPUTFORMAT   'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
	LOCATION '/user/table/table_avro';

From Hive 0.14 and Up (Easier DDL)

CREATE TABLE kst (
    string1 string,
    string2 string,
    int1 int,
    boolean1 boolean,
    long1 bigint,
    float1 float,
    double1 double,
    inner_record1 struct,
    enum1 string,
    array1 array,
    map1 map,
    union1 uniontype,
    fixed1 binary,
    null1 void,
    unionnullint int,
    bytes1 binary)
  PARTITIONED BY (ds string)
  STORED AS AVRO;
	

See Apache Hive language Docs also here for more examples on Avro and Parquet

However to get true performance benefits of Hive with Cost Base optimization and Vectorization you should consider having your Hive tables in the ORC format.

avatar
Rising Star

Thanks for your answer, Actualy this is what i'm trying to do,I already have parquet files, and i want dynamically create an external hive table to read from parquet files not Avro ones. according either an avro or parquet schema. ( the parquet was created from avro )

tazimehdi.com

avatar
Master Mentor

@Mehdi TAZI can you accept the best answer to close this thread?

avatar
Rising Star

actually, there is no answer to my question, i'll publish soon the answer and accept it

tazimehdi.com

avatar
Rising Star

The work to generically create a table by reading a schema from orc, parquet and avro is tracked in HIVE-10593.

avatar
Rising Star

The solution is to create dynamically a table from avro, and then create a new table of parquet format from the avro one.

there is the source code from Hive, which this helped you

CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc'); 
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';
tazimehdi.com

avatar
Explorer

I am also facing same problem as you. I tried your solution but my parquet table is not getting refreshed if I modify the avro schema. avro_test table is getting modified if any schema changes happens but parquet_test is not getting changes. I Dropped and created again but still changes are not getting reflected. Any idea?

avatar
Explorer

Thanks for this - works for Parquet, but how does one do this for a table from CSV? Let's say a CSV schema changes, I want to be able to use the Avro schema evolution to create the table.

I tried the same create statement, but using STORED AS TEXTFILE and with the ROW FORMAT DELIMITED etc. I end up getting null values.

avatar
Expert Contributor

As @Jean-Philippe Player mentions read Parquet directory as tables its not yet supported by Hive. Source: http://www.cloudera.com/documentation/archive/impala/2-x/2-0-x/topics/impala_parquet.html. You are able to do it in Impala:

# Using Impala:
CREATE EXTERNAL TABLE ingest_existing_files LIKE PARQUET '/user/etl/destination/datafile1.dat'
  STORED AS PARQUET
  LOCATION '/user/etl/destination';

With some spark/scala code you can generate the create table statement based on a parquet file:

spark.read.parquet("/user/etl/destination/datafile1.dat").registerTempTable("mytable")
val df = sqlContext.sql("describe mytable")
// "colname (space) data-type"
val columns = df.map(row => row(0) + " " + row(1)).collect()

// Print the Hive create table statement:
println("CREATE EXTERNAL TABLE mytable")
println(s"  (${columns.mkString(", ")})")
println("STORED AS PARQUET ")
println("LOCATION '/user/etl/destination/datafile1.dat';")