Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.
@Artem Ervits Here are steps to reproduce in 1.4.1 sandbox. Still getting the issue, too many characters for a comment. Thoughts? STEPS: $pyspark sqlContext.sql("create table names(name string, age int) row format delimited fields terminated by ',' stored as textfile TBLPROPERTIES('skip.header.line.count'='1')")
sqlContext.sql("LOAD DATA INPATH '/user/root/test.txt' overwrite into table names")
sqlContext.sql("Select count(*) from names").show()
sqlContext.sql("Select * from names").show()
| name| age |
| name| null|
| aaa| 1|
| bbb| 2| hive> select count(*) from names;
INPUT FILE: "test.txt", with contents:
... View more
I am using Spark 1.3.1 to create a hive table from a CSV file (in which the first row is the header row). I have set the hive table property to skip the header row:
TBLPROPERTIES ("skip.header.line.count"="1") I validated with a "show create table BOP" that the table property is set to ignore the header row. But when i execute "select count(*) from mytable" i get the correct count from HUE/beeswax/beeline, but if i execute the same query via Spark i get a result that is count+1 (i.e. it counts the header row as a data row). Why is Spark reading the hive metadata and still not ignoring the header row?
... View more