- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Does Parquet support a notion of defining and managing schemas externally?
- Labels:
-
HDFS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
, in a similar way to Avro with avsc schema files which can be referenced in CREATE TABLE statements?
Thanks,
Martin
Created on ‎08-14-2016 11:35 AM - edited ‎08-14-2016 11:41 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The whole support around Parquet is documented at http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_parquet.html
Impala's support for Parquet is ahead of Hive at this moment, while https://issues.apache.org/jira/browse/HIVE-8950 will help it catch up in future. In Hive you will still need to manually specify a column, but you may alternatively create the table in Impala and use it then in Hive.
Parquet's loader in Pig supports reading the schema off the file [1] [2], as does Spark's Parquet support [3]. None of the eco system approaches use an external schema file as was the case with Avro storages.
[1] - https://github.com/Parquet/parquet-mr/blob/master/parquet-pig/src/main/java/parquet/pig/ParquetLoade...
[2] - https://github.com/Parquet/parquet-mr/blob/master/parquet-pig/src/test/java/parquet/pig/TestParquetL...
[3] - http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
Created ‎08-14-2016 03:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
there's no separate schema file concept in the Parquet storage
implementation today.
The LIKE 'FILE' feature is described further at
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_parquet.html#parquet_ddl,
after which if you want to evolve the schema you can read on at
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_parquet.html#parquet_schema_e...
Created ‎08-14-2016 06:07 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Harsh for confirming there is no external schema file concept in Parquet and for sharing the link for CREATE TABLE ... LIKE PARQUET ... syntax.
This seems to be specific to Impala however, is there a generic approach to use across a stack of tools including Spark, Pig, Hive as well as Impala (and with Spark and Pig not using HCatalog)?
Many thanks,
Martin
Created on ‎08-14-2016 11:35 AM - edited ‎08-14-2016 11:41 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The whole support around Parquet is documented at http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_parquet.html
Impala's support for Parquet is ahead of Hive at this moment, while https://issues.apache.org/jira/browse/HIVE-8950 will help it catch up in future. In Hive you will still need to manually specify a column, but you may alternatively create the table in Impala and use it then in Hive.
Parquet's loader in Pig supports reading the schema off the file [1] [2], as does Spark's Parquet support [3]. None of the eco system approaches use an external schema file as was the case with Avro storages.
[1] - https://github.com/Parquet/parquet-mr/blob/master/parquet-pig/src/main/java/parquet/pig/ParquetLoade...
[2] - https://github.com/Parquet/parquet-mr/blob/master/parquet-pig/src/test/java/parquet/pig/TestParquetL...
[3] - http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
