question Re: Create Hive table to read parquet files from parquet/avro schema in Archives of Support Questions (Read Only)

Create Hive table to read parquet files from parquet/avro schema

TAZIMehdi — Thu, 10 Dec 2015 21:02:11 GMT

Hello Experts !

We are looking for a solution in order to create an external hive table to read data from parquet files according to a parquet/avro schema.

in other way, how to generate a hive table from a parquet/avro schema ?

thanks 🙂

Re: Create Hive table to read parquet files from parquet/avro schema

amcbarnett — Thu, 10 Dec 2015 23:29:42 GMT

@Mehdi TAZI

Avro

CREATE EXTERNAL TABLE table_avro  
	ROW FORMAT SERDE   'org.apache.hadoop.hive.serde2.avro.AvroSerDe'  
	TBLPROPERTIES  ('avro.schema.url'='hdfs://user/schemas/table_avro.avsc')  
	STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
 	OUTPUTFORMAT   'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
	LOCATION '/user/table/table_avro';

From Hive 0.14 and Up (Easier DDL)

CREATE TABLE kst (
    string1 string,
    string2 string,
    int1 int,
    boolean1 boolean,
    long1 bigint,
    float1 float,
    double1 double,
    inner_record1 struct,
    enum1 string,
    array1 array,
    map1 map,
    union1 uniontype,
    fixed1 binary,
    null1 void,
    unionnullint int,
    bytes1 binary)
  PARTITIONED BY (ds string)
  STORED AS AVRO;

See Apache Hive language Docs also here for more examples on Avro and Parquet

However to get true performance benefits of Hive with Cost Base optimization and Vectorization you should consider having your Hive tables in the ORC format.

Re: Create Hive table to read parquet files from parquet/avro schema

TAZIMehdi — Fri, 11 Dec 2015 01:01:29 GMT

Thanks for your answer, Actualy this is what i'm trying to do,I already have parquet files, and i want dynamically create an external hive table to read from parquet files not Avro ones. according either an avro or parquet schema. ( the parquet was created from avro )

Re: Create Hive table to read parquet files from parquet/avro schema

jpp — Fri, 11 Dec 2015 01:43:31 GMT

The work to generically create a table by reading a schema from orc, parquet and avro is tracked in HIVE-10593.

Re: Create Hive table to read parquet files from parquet/avro schema

aervits — Tue, 02 Feb 2016 09:51:29 GMT

@Mehdi TAZI can you accept the best answer to close this thread?

Re: Create Hive table to read parquet files from parquet/avro schema

TAZIMehdi — Tue, 02 Feb 2016 23:04:38 GMT

actually, there is no answer to my question, i'll publish soon the answer and accept it

Re: Create Hive table to read parquet files from parquet/avro schema

TAZIMehdi — Tue, 02 Feb 2016 23:08:11 GMT

The solution is to create dynamically a table from avro, and then create a new table of parquet format from the avro one.

there is the source code from Hive, which this helped you

CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc'); 
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';

Re: Create Hive table to read parquet files from parquet/avro schema

adhumal — Wed, 17 Feb 2016 13:59:22 GMT

I am also facing same problem as you. I tried your solution but my parquet table is not getting refreshed if I modify the avro schema. avro_test table is getting modified if any schema changes happens but parquet_test is not getting changes. I Dropped and created again but still changes are not getting reflected. Any idea?

Re: Create Hive table to read parquet files from parquet/avro schema

MrBee — Wed, 07 Sep 2016 17:14:17 GMT

As @Jean-Philippe Player mentions read Parquet directory as tables its not yet supported by Hive. Source: http://www.cloudera.com/documentation/archive/impala/2-x/2-0-x/topics/impala_parquet.html. You are able to do it in Impala:

# Using Impala:
CREATE EXTERNAL TABLE ingest_existing_files LIKE PARQUET '/user/etl/destination/datafile1.dat'
  STORED AS PARQUET
  LOCATION '/user/etl/destination';

With some spark/scala code you can generate the create table statement based on a parquet file:

spark.read.parquet("/user/etl/destination/datafile1.dat").registerTempTable("mytable")
val df = sqlContext.sql("describe mytable")
// "colname (space) data-type"
val columns = df.map(row => row(0) + " " + row(1)).collect()

// Print the Hive create table statement:
println("CREATE EXTERNAL TABLE mytable")
println(s"  (${columns.mkString(", ")})")
println("STORED AS PARQUET ")
println("LOCATION '/user/etl/destination/datafile1.dat';")

Re: Create Hive table to read parquet files from parquet/avro schema

jastang — Fri, 02 Dec 2016 12:32:24 GMT

Thanks for this - works for Parquet, but how does one do this for a table from CSV? Let's say a CSV schema changes, I want to be able to use the Avro schema evolution to create the table.

I tried the same create statement, but using STORED AS TEXTFILE and with the ROW FORMAT DELIMITED etc. I end up getting null values.

Re: Create Hive table to read parquet files from parquet/avro schema

obrobecker — Thu, 20 Feb 2020 06:49:10 GMT

with newer versions of spark, the sqlContext is not load by default, you have to specify it explicitly :

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@6179af64

scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> sqlContext.sql("describe mytable")
res2: org.apache.spark.sql.DataFrame = [col_name: string, data_type: string ... 1 more field]

I'm working with spark 2.3.2