Community Articles

skurup · ‎12-28-2016

PROBLEM: 1. Create a source external hive table as below:

CREATE EXTERNAL TABLE `casesclosed`(
`number` int,
`manager` string,
`owner` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://sumeshhdp/tmp/casesclosed'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1',
'totalSize'='3693',
'transient_lastDdlTime'='1478557456')

2. Create an ORC table with CTAS from the source table as below:

CREATE TABLE casesclosed_mod
STORED AS ORC tblproperties("orc.compress"="ZLIB", "orc.compress.size"="8192")
AS
SELECT
cast(number as int) as number,
cast(manager as varchar(40)) as manager,
cast(owner as varchar(40)) as owner
FROM cases closed;
3. On creating the Spark DataFrame against both non-orc table ( source ) and the orc table, we are unable to list out the column names in the ORC table :
scala> val df = sqlContext.table("default.casesclosed") 
df: org.apache.spark.sql.DataFrame = number: int, manager: string, owner: string
scala> val df = sqlContext.table("default.casesclosed_mod") 
16/11/07 22:41:48 INFO OrcRelation: Listing hdfs://sumeshhdp/apps/hive/warehouse/casesclosed_mod on driver 
df: org.apache.spark.sql.DataFrame = _col0: int, _col1: string, _col2: string

3. On creating the Spark DataFrame against both non-orc table ( source ) and the orc table, we are unable to list out the column names in the ORC table :

scala> val df = sqlContext.table("default.casesclosed") 
df: org.apache.spark.sql.DataFrame = number: int, manager: string, owner: string
scala> val df = sqlContext.table("default.casesclosed_mod") 
16/11/07 22:41:48 INFO OrcRelation: Listing hdfs://sumeshhdp/apps/hive/warehouse/casesclosed_mod on driver 
df: org.apache.spark.sql.DataFrame = _col0: int, _col1: string, _col2: string

TWO WORKAROUNDS:

- Use Spark to create the tables instead of Hive.
- Set: sqlContext.setConf("spark.sql.hive.convertMetastoreOrc", "false")

ROOT CAUSE: The table "casesclosed_mod" is "STORED AS ORC tblproperties("orc.compress"="ZLIB", "orc.compress.size"="8192")". Spark supports ORC data source format internally, and has its own logic/ method to deal with ORC format, which is different from Hive's. So in this bug, Spark can not "understand" the format of the ORC file created by Hive.

In Hive, if create a table "casesclosed_mod" without "STORED AS ORC tblproperties("orc.compress"="ZLIB", "orc.compress.size"="8192")", everything works fine.

In Hive:

hive> CREATE TABLE casesclosed_mod0007
    > AS
    > SELECT
    > cast(number as int) as number,
    > cast(manager as varchar(40)) as manager,
    > cast(owner as varchar(40)) as owner
    > FROM casesclosed007;

In Spark-shell:
scala> val df = sqlContext.table("casesclosed_mod0007") ;
df: org.apache.spark.sql.DataFrame = [number: int, manager: string, owner: string]

This is a known bug which is tracked through the Apache Bug : https://issues.apache.org/jira/browse/SPARK-16628

Cloudera Community

Community Articles

Column names not getting created for Spark DataFrame

Apache Spark

Getting Started with Spark GraphFrames in Cloudera...

rename columns of the dataframe

Can we create hive table/ database name/ column na...

Accessing Hbase tables and querying on Dataframes ...

not able to use '_' in column name or cannot use r...

Save Spark DataFrame table into Phoenix

Not able to split the column into multiple columns...

TimestampType format for Spark DataFrames

Spark to support REGEX column specification for Hi...

Hive: Query keyword column name