Support Questions

Find answers, ask questions, and share your expertise

Spark 1.3.1 not pulling all hive metadata when executing query - header row of CSV datafile not ignored

avatar
New Contributor

I am using Spark 1.3.1 to create a hive table from a CSV file (in which the first row is the header row). I have set the hive table property to skip the header row:

TBLPROPERTIES ("skip.header.line.count"="1")

I validated with a "show create table BOP" that the table property is set to ignore the header row. But when i execute "select count(*) from mytable" i get the correct count from HUE/beeswax/beeline, but if i execute the same query via Spark i get a result that is count+1 (i.e. it counts the header row as a data row). Why is Spark reading the hive metadata and still not ignoring the header row?

1 ACCEPTED SOLUTION

avatar
Contributor
7 REPLIES 7

avatar
Master Mentor

@Maleeha Qazi have you tried with Spark 1.4.1 in the latest sandbox environment?

avatar
New Contributor

@Artem Ervits

Here are steps to reproduce in 1.4.1 sandbox. Still getting the issue, too many characters for a comment. Thoughts?

STEPS:

$pyspark

sqlContext.sql("create table names(name string, age int) row format delimited fields terminated by ',' stored as textfile TBLPROPERTIES('skip.header.line.count'='1')")
sqlContext.sql("LOAD DATA INPATH '/user/root/test.txt' overwrite into table names")
sqlContext.sql("Select count(*) from names").show()
3
sqlContext.sql("Select * from names").show()
+-----+-----+
| name|	age |
| name|	null|
|  aaa|    1|
|  bbb|    2|
hive> select count(*) from names;
2

INPUT FILE: "test.txt", with contents:

name,age
aaa,1
bbb,2

avatar
Master Mentor

@azeltov help

avatar
Master Mentor

avatar

Can you please try using HiveContext and report back? @Maleeha Qazi

avatar
Super Collaborator

@vshukla I just recreated the scenario @Maleeha Qazi mentioned using a HiveContext in both pyspark, and spark-shell with the sandbox and spark 1.4.1. Still getting the same erroneous output that was mentioned. I created the table using the HiveContext. The show create table looks good in hive. Looks like when spark-sql queries the table, its not handling the header correctly. Not respecting the table property when querying.

Hive is handling the header just fine.

avatar
Contributor

It looks like this may be a bug: https://issues.apache.org/jira/browse/SPARK-11374