Support Questions

mqazi · ‎02-01-2016

I am using Spark 1.3.1 to create a hive table from a CSV file (in which the first row is the header row). I have set the hive table property to skip the header row:

TBLPROPERTIES ("skip.header.line.count"="1")

I validated with a "show create table BOP" that the table property is set to ignore the header row. But when i execute "select count(*) from mytable" i get the correct count from HUE/beeswax/beeline, but if i execute the same query via Spark i get a result that is count+1 (i.e. it counts the header row as a data row). Why is Spark reading the hive metadata and still not ignoring the header row?

jmeyer · ‎02-06-2016

It looks like this may be a bug: https://issues.apache.org/jira/browse/SPARK-11374

View solution in original post

aervits · ‎02-01-2016

@Maleeha Qazi have you tried with Spark 1.4.1 in the latest sandbox environment?

mqazi · ‎02-02-2016

@Artem Ervits

Here are steps to reproduce in 1.4.1 sandbox. Still getting the issue, too many characters for a comment. Thoughts?

STEPS:

$pyspark

sqlContext.sql("create table names(name string, age int) row format delimited fields terminated by ',' stored as textfile TBLPROPERTIES('skip.header.line.count'='1')")

sqlContext.sql("LOAD DATA INPATH '/user/root/test.txt' overwrite into table names")

sqlContext.sql("Select count(*) from names").show()
3

sqlContext.sql("Select * from names").show()
+-----+-----+
| name|	age |
| name|	null|
|  aaa|    1|
|  bbb|    2|

hive> select count(*) from names;
2

INPUT FILE: "test.txt", with contents:

name,age
aaa,1
bbb,2

aervits · ‎02-02-2016

@azeltov help

aervits · ‎02-03-2016

@vshukla ??

vshukla · ‎02-03-2016

Can you please try using HiveContext and report back? @Maleeha Qazi

jwiden · ‎02-03-2016

@vshukla I just recreated the scenario @Maleeha Qazi mentioned using a HiveContext in both pyspark, and spark-shell with the sandbox and spark 1.4.1. Still getting the same erroneous output that was mentioned. I created the table using the HiveContext. The show create table looks good in hive. Looks like when spark-sql queries the table, its not handling the header correctly. Not respecting the table property when querying.

Hive is handling the header just fine.

jmeyer · ‎02-06-2016

It looks like this may be a bug: https://issues.apache.org/jira/browse/SPARK-11374

Cloudera Community

Support Questions

Spark 1.3.1 not pulling all hive metadata when executing query - header row of CSV datafile not ignored

Specify Schema for CSV files with no header and pe...

Row vs Columnar Storage For Hive

error executing queries in hive cli and spark-shel...

Unable to read topic containing a csv file with on...

Read CSV File with Header and Filter rows and then...

Error while executing hive query on Spark as execu...

Hive insert query optimization

Metadata in Cloudera data warehouses

Counting rows in multiple partitions in Hive query

NiFI - Converting CSV to Avro, header contains spa...