I am new to Hadoop environment and need some experts suggestion. I have a requirement to put the data from Kafka topic in to Hive table.
Here is how i designed it .
KafkaConsumer process=>EvaluateJson ( to retrieve required attributes)=> Attribute to Json (Flatten Json)=> MergeContent ( Merge ~1000 Json flow files in to one file)=> PutHDFS ( Hive table on top of it)= > replacetext ( insert statement to final table) =>
I am using mergecontent as, i don't want to have a every record from kafka topic going in to single file in HDFS.
Though mergecontent processor concatenates json flow files into one large file and puts to HDFS but when I query the Hive table i get only data from the first flow file. What am I doing wrong here?
Here is my Hive table -
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION XXXXXXXX