About mqazi

mqazi · ‎02-02-2016

@Artem Ervits Here are steps to reproduce in 1.4.1 sandbox. Still getting the issue, too many characters for a comment. Thoughts? STEPS: $pyspark sqlContext.sql("create table names(name string, age int) row format delimited fields terminated by ',' stored as textfile TBLPROPERTIES('skip.header.line.count'='1')") sqlContext.sql("LOAD DATA INPATH '/user/root/test.txt' overwrite into table names") sqlContext.sql("Select count(*) from names").show() 3 sqlContext.sql("Select * from names").show() +-----+-----+ | name| age | | name| null| | aaa| 1| | bbb| 2| hive> select count(*) from names; 2 INPUT FILE: "test.txt", with contents: name,age aaa,1 bbb,2

mqazi · ‎02-01-2016

I am using Spark 1.3.1 to create a hive table from a CSV file (in which the first row is the header row). I have set the hive table property to skip the header row: TBLPROPERTIES ("skip.header.line.count"="1") I validated with a "show create table BOP" that the table property is set to ignore the header row. But when i execute "select count(*) from mytable" i get the correct count from HUE/beeswax/beeline, but if i execute the same query via Spark i get a result that is count+1 (i.e. it counts the header row as a data row). Why is Spark reading the hive metadata and still not ignoring the header row?

Online	Offline
Last Visited	‎02-04-2016 05:02 PM

Member Since	‎02-01-2016 11:13 PM
Last Visited	‎02-04-2016 05:02 PM
Posts	2
Kudos received	5

Cloudera Community

Re: Spark 1.3.1 not pulling all hive metadata when...

Spark 1.3.1 not pulling all hive metadata when exe...