I run HDP spark 1.4.1 and 1.6.1
I have to process the input of rapidly arriving XML from a Kafka topic with spark strmng. I am able to do the .print function and see my data is indeed coming into spark. I have them batched now at 10 s. Now I need to know:
1) Is there a way to delimit each XML message?
2) How can I apply a JAXB like schema function to each message? I have a process already doing this in plain java and it works fine using the standard kafka APIs and JAXB.
Sample output where I write the data with saveAsTextFiles() shows broken messages, seemed to be split on space and large XML messages are spread across more than one file.
At the moment I have not worked on this issue since but I will resurrect it and try somethings out.
First you could start with using the Databricks libraries. I did try one library off git but it was too difficult to work with. The schema I am using is quite complex. What schema do you have for your data. Some ideas that I learned, not tried, include pre-converting the XML to CSV or Avro before consuming it into spark and using Databricks CSV or other lib to process in the stream portion. Let me know how you are ingesting the XML. I still need to do this at some point.