Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

kafka to spark streaming with XML

Highlighted

kafka to spark streaming with XML

Explorer

I run HDP spark 1.4.1 and 1.6.1

I have to process the input of rapidly arriving XML from a Kafka topic with spark strmng. I am able to do the .print function and see my data is indeed coming into spark. I have them batched now at 10 s. Now I need to know:

1) Is there a way to delimit each XML message?

2) How can I apply a JAXB like schema function to each message? I have a process already doing this in plain java and it works fine using the standard kafka APIs and JAXB.

Sample output where I write the data with saveAsTextFiles() shows broken messages, seemed to be split on space and large XML messages are spread across more than one file.

Thanks, M

2 REPLIES 2
Highlighted

Re: kafka to spark streaming with XML

New Contributor

Hi Mike @Mike Krauss,

Am having a same could you please let me know how you solved this problem.

Thank You for your help,

Thanks,

Ankush Reddy

Highlighted

Re: kafka to spark streaming with XML

Explorer

At the moment I have not worked on this issue since but I will resurrect it and try somethings out.

First you could start with using the Databricks libraries. I did try one library off git but it was too difficult to work with. The schema I am using is quite complex. What schema do you have for your data. Some ideas that I learned, not tried, include pre-converting the XML to CSV or Avro before consuming it into spark and using Databricks CSV or other lib to process in the stream portion. Let me know how you are ingesting the XML. I still need to do this at some point.

Don't have an account?
Coming from Hortonworks? Activate your account here