Created 12-10-2015 01:02 PM
Hello Experts !
We are looking for a solution in order to create an external hive table to read data from parquet files according to a parquet/avro schema.
in other way, how to generate a hive table from a parquet/avro schema ?
thanks 🙂
Created 02-02-2016 03:08 PM
The solution is to create dynamically a table from avro, and then create a new table of parquet format from the avro one.
there is the source code from Hive, which this helped you
CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc');
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';
Created 12-10-2015 03:29 PM
Avro
CREATE EXTERNAL TABLE table_avro ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' TBLPROPERTIES ('avro.schema.url'='hdfs://user/schemas/table_avro.avsc') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/user/table/table_avro';
From Hive 0.14 and Up (Easier DDL)
CREATE TABLE kst ( string1 string, string2 string, int1 int, boolean1 boolean, long1 bigint, float1 float, double1 double, inner_record1 struct, enum1 string, array1 array, map1 map, union1 uniontype, fixed1 binary, null1 void, unionnullint int, bytes1 binary) PARTITIONED BY (ds string) STORED AS AVRO;
See Apache Hive language Docs also here for more examples on Avro and Parquet
However to get true performance benefits of Hive with Cost Base optimization and Vectorization you should consider having your Hive tables in the ORC format.
Created 12-10-2015 05:01 PM
Thanks for your answer, Actualy this is what i'm trying to do,I already have parquet files, and i want dynamically create an external hive table to read from parquet files not Avro ones. according either an avro or parquet schema. ( the parquet was created from avro )
Created 02-02-2016 01:51 AM
@Mehdi TAZI can you accept the best answer to close this thread?
Created 02-02-2016 03:04 PM
actually, there is no answer to my question, i'll publish soon the answer and accept it
Created 12-10-2015 05:43 PM
The work to generically create a table by reading a schema from orc, parquet and avro is tracked in HIVE-10593.
Created 02-02-2016 03:08 PM
The solution is to create dynamically a table from avro, and then create a new table of parquet format from the avro one.
there is the source code from Hive, which this helped you
CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc');
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';
Created 02-17-2016 05:59 AM
I am also facing same problem as you. I tried your solution but my parquet table is not getting refreshed if I modify the avro schema. avro_test table is getting modified if any schema changes happens but parquet_test is not getting changes. I Dropped and created again but still changes are not getting reflected. Any idea?
Created 12-02-2016 04:32 AM
Thanks for this - works for Parquet, but how does one do this for a table from CSV? Let's say a CSV schema changes, I want to be able to use the Avro schema evolution to create the table.
I tried the same create statement, but using STORED AS TEXTFILE and with the ROW FORMAT DELIMITED etc. I end up getting null values.
Created 09-07-2016 10:14 AM
As @Jean-Philippe Player mentions read Parquet directory as tables its not yet supported by Hive. Source: http://www.cloudera.com/documentation/archive/impala/2-x/2-0-x/topics/impala_parquet.html. You are able to do it in Impala:
# Using Impala: CREATE EXTERNAL TABLE ingest_existing_files LIKE PARQUET '/user/etl/destination/datafile1.dat' STORED AS PARQUET LOCATION '/user/etl/destination';
With some spark/scala code you can generate the create table statement based on a parquet file:
spark.read.parquet("/user/etl/destination/datafile1.dat").registerTempTable("mytable") val df = sqlContext.sql("describe mytable") // "colname (space) data-type" val columns = df.map(row => row(0) + " " + row(1)).collect() // Print the Hive create table statement: println("CREATE EXTERNAL TABLE mytable") println(s" (${columns.mkString(", ")})") println("STORED AS PARQUET ") println("LOCATION '/user/etl/destination/datafile1.dat';")