I just used streaming to handle kafka data, and write it into hive table in hdfs.The hive table was partitioned by month/day which the time in the kafka data and I compute the month/day with it as below:
val sdf=new SimpleDateFormat("yyyy_MM")
val sdf1=new SimpleDateFormat("dd")
val adjustTime = data.getLong(12)
val month = sdf.format(new Date(adjustTime))
val day = sdf1.format(new Date(adjustTime))
and I used repartition when I parse the data
repartition($"month",$"day").write.mode(SaveMode.Append).partitionBy("month","day")
when I checked the data in hdfs I found the problems:
1.The partitioned day appeared like month="2018_09" day ="31",the problem is that Sep could not have 31th.
2.in the someday partition, the data is not belog to this day, just like the adjusttime is 2018-09-30,but the data in partition 2018_09_30 have more data with other time like 2018_03_08 and ... The data in it is not correct.
So I will be appreciated for it if any suggestions or ideas to solve these problems. Thanks~