I am using Spark 1.3.1 to create a hive table from a CSV file (in which the first row is the header row). I have set the hive table property to skip the header row:
I validated with a "show create table BOP" that the table property is set to ignore the header row. But when i execute "select count(*) from mytable" i get the correct count from HUE/beeswax/beeline, but if i execute the same query via Spark i get a result that is count+1 (i.e. it counts the header row as a data row). Why is Spark reading the hive metadata and still not ignoring the header row?
Here are steps to reproduce in 1.4.1 sandbox. Still getting the issue, too many characters for a comment. Thoughts?
sqlContext.sql("create table names(name string, age int) row format delimited fields terminated by ',' stored as textfile TBLPROPERTIES('skip.header.line.count'='1')")
sqlContext.sql("LOAD DATA INPATH '/user/root/test.txt' overwrite into table names")
sqlContext.sql("Select count(*) from names").show() 3
sqlContext.sql("Select * from names").show() +-----+-----+ | name| age | | name| null| | aaa| 1| | bbb| 2|
hive> select count(*) from names; 2
INPUT FILE: "test.txt", with contents:
name,age aaa,1 bbb,2
@vshukla I just recreated the scenario @Maleeha Qazi mentioned using a HiveContext in both pyspark, and spark-shell with the sandbox and spark 1.4.1. Still getting the same erroneous output that was mentioned. I created the table using the HiveContext. The show create table looks good in hive. Looks like when spark-sql queries the table, its not handling the header correctly. Not respecting the table property when querying.
Hive is handling the header just fine.