Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar
Super Collaborator
PROBLEM: 1. Create a source external hive table as below:
CREATE EXTERNAL TABLE `casesclosed`(
`number` int,
`manager` string,
`owner` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://sumeshhdp/tmp/casesclosed'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1',
'totalSize'='3693',
'transient_lastDdlTime'='1478557456')
2. Create an ORC table with CTAS from the source table as below:
CREATE TABLE casesclosed_mod
STORED AS ORC tblproperties("orc.compress"="ZLIB", "orc.compress.size"="8192")
AS
SELECT
cast(number as int) as number,
cast(manager as varchar(40)) as manager,
cast(owner as varchar(40)) as owner
FROM cases closed;
3. On creating the Spark DataFrame against both non-orc table ( source ) and the orc table, we are unable to list out the column names in the ORC table :
scala> val df = sqlContext.table("default.casesclosed") 
df: org.apache.spark.sql.DataFrame = number: int, manager: string, owner: string
scala> val df = sqlContext.table("default.casesclosed_mod") 
16/11/07 22:41:48 INFO OrcRelation: Listing hdfs://sumeshhdp/apps/hive/warehouse/casesclosed_mod on driver 
df: org.apache.spark.sql.DataFrame = _col0: int, _col1: string, _col2: string
3. On creating the Spark DataFrame against both non-orc table ( source ) and the orc table, we are unable to list out the column names in the ORC table :
scala> val df = sqlContext.table("default.casesclosed") 
df: org.apache.spark.sql.DataFrame = number: int, manager: string, owner: string
scala> val df = sqlContext.table("default.casesclosed_mod") 
16/11/07 22:41:48 INFO OrcRelation: Listing hdfs://sumeshhdp/apps/hive/warehouse/casesclosed_mod on driver 
df: org.apache.spark.sql.DataFrame = _col0: int, _col1: string, _col2: string

TWO WORKAROUNDS:

  • - Use Spark to create the tables instead of Hive.
    - Set: sqlContext.setConf("spark.sql.hive.convertMetastoreOrc", "false")
    
    

ROOT CAUSE: The table "casesclosed_mod" is "STORED AS ORC tblproperties("orc.compress"="ZLIB", "orc.compress.size"="8192")". Spark supports ORC data source format internally, and has its own logic/ method to deal with ORC format, which is different from Hive's. So in this bug, Spark can not "understand" the format of the ORC file created by Hive.

In Hive, if create a table "casesclosed_mod" without "STORED AS ORC tblproperties("orc.compress"="ZLIB", "orc.compress.size"="8192")", everything works fine.

In Hive:

hive> CREATE TABLE casesclosed_mod0007
    > AS
    > SELECT
    > cast(number as int) as number,
    > cast(manager as varchar(40)) as manager,
    > cast(owner as varchar(40)) as owner
    > FROM casesclosed007;

In Spark-shell:
scala> val df = sqlContext.table("casesclosed_mod0007") ;
df: org.apache.spark.sql.DataFrame = [number: int, manager: string, owner: string]
This is a known bug which is tracked through the Apache Bug : https://issues.apache.org/jira/browse/SPARK-16628
3,623 Views