Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Create Hive table to read parquet files from parquet/avro schema

Solved Go to solution

Create Hive table to read parquet files from parquet/avro schema

Contributor

Hello Experts !

We are looking for a solution in order to create an external hive table to read data from parquet files according to a parquet/avro schema.

in other way, how to generate a hive table from a parquet/avro schema ?

thanks :)

tazimehdi.com
1 ACCEPTED SOLUTION

Accepted Solutions

Re: Create Hive table to read parquet files from parquet/avro schema

Contributor

The solution is to create dynamically a table from avro, and then create a new table of parquet format from the avro one.

there is the source code from Hive, which this helped you

CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc'); 
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';
tazimehdi.com
9 REPLIES 9

Re: Create Hive table to read parquet files from parquet/avro schema

@Mehdi TAZI

Avro

CREATE EXTERNAL TABLE table_avro  
	ROW FORMAT SERDE   'org.apache.hadoop.hive.serde2.avro.AvroSerDe'  
	TBLPROPERTIES  ('avro.schema.url'='hdfs://user/schemas/table_avro.avsc')  
	STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
 	OUTPUTFORMAT   'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
	LOCATION '/user/table/table_avro';

From Hive 0.14 and Up (Easier DDL)

CREATE TABLE kst (
    string1 string,
    string2 string,
    int1 int,
    boolean1 boolean,
    long1 bigint,
    float1 float,
    double1 double,
    inner_record1 struct,
    enum1 string,
    array1 array,
    map1 map,
    union1 uniontype,
    fixed1 binary,
    null1 void,
    unionnullint int,
    bytes1 binary)
  PARTITIONED BY (ds string)
  STORED AS AVRO;
	

See Apache Hive language Docs also here for more examples on Avro and Parquet

However to get true performance benefits of Hive with Cost Base optimization and Vectorization you should consider having your Hive tables in the ORC format.

Re: Create Hive table to read parquet files from parquet/avro schema

Contributor

Thanks for your answer, Actualy this is what i'm trying to do,I already have parquet files, and i want dynamically create an external hive table to read from parquet files not Avro ones. according either an avro or parquet schema. ( the parquet was created from avro )

tazimehdi.com

Re: Create Hive table to read parquet files from parquet/avro schema

Mentor

@Mehdi TAZI can you accept the best answer to close this thread?

Re: Create Hive table to read parquet files from parquet/avro schema

Contributor

actually, there is no answer to my question, i'll publish soon the answer and accept it

tazimehdi.com

Re: Create Hive table to read parquet files from parquet/avro schema

Contributor

The work to generically create a table by reading a schema from orc, parquet and avro is tracked in HIVE-10593.

Re: Create Hive table to read parquet files from parquet/avro schema

Contributor

The solution is to create dynamically a table from avro, and then create a new table of parquet format from the avro one.

there is the source code from Hive, which this helped you

CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc'); 
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';
tazimehdi.com

Re: Create Hive table to read parquet files from parquet/avro schema

New Contributor

I am also facing same problem as you. I tried your solution but my parquet table is not getting refreshed if I modify the avro schema. avro_test table is getting modified if any schema changes happens but parquet_test is not getting changes. I Dropped and created again but still changes are not getting reflected. Any idea?

Re: Create Hive table to read parquet files from parquet/avro schema

New Contributor

Thanks for this - works for Parquet, but how does one do this for a table from CSV? Let's say a CSV schema changes, I want to be able to use the Avro schema evolution to create the table.

I tried the same create statement, but using STORED AS TEXTFILE and with the ROW FORMAT DELIMITED etc. I end up getting null values.

Re: Create Hive table to read parquet files from parquet/avro schema

Contributor

As @Jean-Philippe Player mentions read Parquet directory as tables its not yet supported by Hive. Source: http://www.cloudera.com/documentation/archive/impala/2-x/2-0-x/topics/impala_parquet.html. You are able to do it in Impala:

# Using Impala:
CREATE EXTERNAL TABLE ingest_existing_files LIKE PARQUET '/user/etl/destination/datafile1.dat'
  STORED AS PARQUET
  LOCATION '/user/etl/destination';

With some spark/scala code you can generate the create table statement based on a parquet file:

spark.read.parquet("/user/etl/destination/datafile1.dat").registerTempTable("mytable")
val df = sqlContext.sql("describe mytable")
// "colname (space) data-type"
val columns = df.map(row => row(0) + " " + row(1)).collect()

// Print the Hive create table statement:
println("CREATE EXTERNAL TABLE mytable")
println(s"  (${columns.mkString(", ")})")
println("STORED AS PARQUET ")
println("LOCATION '/user/etl/destination/datafile1.dat';")