Member since
02-01-2016
2
Posts
5
Kudos Received
0
Solutions
02-02-2016
10:12 PM
1 Kudo
@Artem Ervits Here are steps to reproduce in 1.4.1 sandbox. Still getting the issue, too many characters for a comment. Thoughts? STEPS: $pyspark sqlContext.sql("create table names(name string, age int) row format delimited fields terminated by ',' stored as textfile TBLPROPERTIES('skip.header.line.count'='1')")
sqlContext.sql("LOAD DATA INPATH '/user/root/test.txt' overwrite into table names")
sqlContext.sql("Select count(*) from names").show()
3
sqlContext.sql("Select * from names").show()
+-----+-----+
| name| age |
| name| null|
| aaa| 1|
| bbb| 2| hive> select count(*) from names;
2
INPUT FILE: "test.txt", with contents:
name,age
aaa,1
bbb,2
... View more
02-01-2016
11:14 PM
4 Kudos
I am using Spark 1.3.1 to create a hive table from a CSV file (in which the first row is the header row). I have set the hive table property to skip the header row:
TBLPROPERTIES ("skip.header.line.count"="1") I validated with a "show create table BOP" that the table property is set to ignore the header row. But when i execute "select count(*) from mytable" i get the correct count from HUE/beeswax/beeline, but if i execute the same query via Spark i get a result that is count+1 (i.e. it counts the header row as a data row). Why is Spark reading the hive metadata and still not ignoring the header row?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark