Hi. I have three Spark streaming applications. One of them saves data to Hive table (parquet format). The other two read data from that table and cache it every hour at the same moment. Both of them have the same code to read data. Writing to table and reading from it is never done at the same time. After couple of hours one of them seems to read only part of data.
You can see below how it looks in Storage tab. First application reads all data.
The second one omnited one partition.
Do you know what is the reason of this issue?
Two reading applications have the same part of code, which cache table.
sqlContext.clearCache() df = sqlContext.sql('select timestamp, col2 from table where timestamp > time') df.cache()
This code is executed every hour.
df.registerTempTable('temp') sqlContext.sql("INSERT INTO TABLE table SELECT * FROM temp")
Code above is also executed every hour.
I checked and I am sure that these codes finish properly.
Every hour when I insert new data into table Hive creates new partition for it, so every partition have data for only one hour. When I read table I want data from last hour (last added partition). The problem is that Spark streaming seems to not update number of partition in table. If we have desired data in one partition there's a chance Spark will not read it. Maybe I should do something with the way I insert data into Hive?
Spark version: 1.6
Can you share the portions of your source code from all three applications which read/write to Hive table?
Can you clarify - how are you seeing different results in your data query? Are you seeind different row counts - or is it the # of partitions cached in memory that is the problem?