Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How best to store near-real time data from Spark Streaming

Solved Go to solution
Highlighted

How best to store near-real time data from Spark Streaming

Contributor

Hi All,

I'm new to the Hadoop world and I have a general question about how others are storing data from spark-streaming jobs. I'm working on a concept using Spark streaming to stream data from Kafka and do a streaming ETL job. The job will be processing and storing data in near-real time. In the process I want to persist the data at different stages of the transformation and also to do lookups from other tables. One of the basic examples would be to take the record, check to see if it exists in the data store (which I originally was thinking might be a Hive table) and insert it if it doesn't. I've looked at Hive-Streaming, but I don't see any talk anywhere about spark streaming integration and all of the research I've done about inserting into Hive warns about having many small files created and it causing problems. My question is what are other people doing to store their data from spark-streaming? Should I be using HBase or something else for this instead of Hive. Thanks in advance for your responses.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: How best to store near-real time data from Spark Streaming

Hbase works for your use case:

1. Need to quickly write streaming data coming in at a high velocity

2. Being able to perform random lookups against the dataset that your are writing to

View solution in original post

2 REPLIES 2
Highlighted

Re: How best to store near-real time data from Spark Streaming

Hbase works for your use case:

1. Need to quickly write streaming data coming in at a high velocity

2. Being able to perform random lookups against the dataset that your are writing to

View solution in original post

Re: How best to store near-real time data from Spark Streaming

Contributor

Thank you Binu, I was thinking that was probably the answer, but I was hoping there was a way to get Hive to work for me. Now, off to figure out HBase......

Don't have an account?
Coming from Hortonworks? Activate your account here