- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Spark 1.3.1 not pulling all hive metadata when executing query - header row of CSV datafile not ignored
- Labels:
-
Apache Hive
-
Apache Spark
Created ‎02-01-2016 11:14 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am using Spark 1.3.1 to create a hive table from a CSV file (in which the first row is the header row). I have set the hive table property to skip the header row:
TBLPROPERTIES ("skip.header.line.count"="1")
I validated with a "show create table BOP" that the table property is set to ignore the header row. But when i execute "select count(*) from mytable" i get the correct count from HUE/beeswax/beeline, but if i execute the same query via Spark i get a result that is count+1 (i.e. it counts the header row as a data row). Why is Spark reading the hive metadata and still not ignoring the header row?
Created ‎02-06-2016 12:29 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It looks like this may be a bug: https://issues.apache.org/jira/browse/SPARK-11374
Created ‎02-01-2016 11:18 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Maleeha Qazi have you tried with Spark 1.4.1 in the latest sandbox environment?
Created ‎02-02-2016 10:12 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here are steps to reproduce in 1.4.1 sandbox. Still getting the issue, too many characters for a comment. Thoughts?
STEPS:
$pyspark
sqlContext.sql("create table names(name string, age int) row format delimited fields terminated by ',' stored as textfile TBLPROPERTIES('skip.header.line.count'='1')")
sqlContext.sql("LOAD DATA INPATH '/user/root/test.txt' overwrite into table names")
sqlContext.sql("Select count(*) from names").show() 3
sqlContext.sql("Select * from names").show() +-----+-----+ | name| age | | name| null| | aaa| 1| | bbb| 2|
hive> select count(*) from names; 2
INPUT FILE: "test.txt", with contents:
name,age aaa,1 bbb,2
Created ‎02-02-2016 10:22 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@azeltov help
Created ‎02-03-2016 02:17 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@vshukla ??
Created ‎02-03-2016 02:27 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you please try using HiveContext and report back? @Maleeha Qazi
Created ‎02-03-2016 05:48 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@vshukla I just recreated the scenario @Maleeha Qazi mentioned using a HiveContext in both pyspark, and spark-shell with the sandbox and spark 1.4.1. Still getting the same erroneous output that was mentioned. I created the table using the HiveContext. The show create table looks good in hive. Looks like when spark-sql queries the table, its not handling the header correctly. Not respecting the table property when querying.
Hive is handling the header just fine.
Created ‎02-06-2016 12:29 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It looks like this may be a bug: https://issues.apache.org/jira/browse/SPARK-11374
