Options
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Super Collaborator
Created on 12-28-2016 04:12 PM
PROBLEM:
1. Create a source external hive table as below:
CREATE EXTERNAL TABLE `casesclosed`( `number` int, `manager` string, `owner` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://sumeshhdp/tmp/casesclosed' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='true', 'numFiles'='1', 'totalSize'='3693', 'transient_lastDdlTime'='1478557456')2. Create an ORC table with CTAS from the source table as below: CREATE TABLE casesclosed_mod STORED AS ORC tblproperties("orc.compress"="ZLIB", "orc.compress.size"="8192") AS SELECT cast(number as int) as number, cast(manager as varchar(40)) as manager, cast(owner as varchar(40)) as owner FROM cases closed; 3. On creating the Spark DataFrame against both non-orc table ( source ) and the orc table, we are unable to list out the column names in the ORC table : scala> val df = sqlContext.table("default.casesclosed") df: org.apache.spark.sql.DataFrame = number: int, manager: string, owner: string scala> val df = sqlContext.table("default.casesclosed_mod") 16/11/07 22:41:48 INFO OrcRelation: Listing hdfs://sumeshhdp/apps/hive/warehouse/casesclosed_mod on driver df: org.apache.spark.sql.DataFrame = _col0: int, _col1: string, _col2: string3. On creating the Spark DataFrame against both non-orc table ( source ) and the orc table, we are unable to list out the column names in the ORC table : scala> val df = sqlContext.table("default.casesclosed") df: org.apache.spark.sql.DataFrame = number: int, manager: string, owner: string scala> val df = sqlContext.table("default.casesclosed_mod") 16/11/07 22:41:48 INFO OrcRelation: Listing hdfs://sumeshhdp/apps/hive/warehouse/casesclosed_mod on driver df: org.apache.spark.sql.DataFrame = _col0: int, _col1: string, _col2: string TWO WORKAROUNDS:
ROOT CAUSE: The table "casesclosed_mod" is "STORED AS ORC tblproperties("orc.compress"="ZLIB", "orc.compress.size"="8192")". Spark supports ORC data source format internally, and has its own logic/ method to deal with ORC format, which is different from Hive's. So in this bug, Spark can not "understand" the format of the ORC file created by Hive. In Hive, if create a table "casesclosed_mod" without "STORED AS ORC tblproperties("orc.compress"="ZLIB", "orc.compress.size"="8192")", everything works fine. In Hive: hive> CREATE TABLE casesclosed_mod0007 > AS > SELECT > cast(number as int) as number, > cast(manager as varchar(40)) as manager, > cast(owner as varchar(40)) as owner > FROM casesclosed007; In Spark-shell: scala> val df = sqlContext.table("casesclosed_mod0007") ; df: org.apache.spark.sql.DataFrame = [number: int, manager: string, owner: string]This is a known bug which is tracked through the Apache Bug : https://issues.apache.org/jira/browse/SPARK-16628 |
3,894 Views