we have a requirement to read avro messages from kafka using spark streaming , converting them to json and store to a data store , i followed this awesome article by Timothy https://community.hortonworks.com/articles/33275/receiving-avro-messages-through-kafka-in-a-spark-s.... ..but in our case the schema is 116 lines long ..i'm not sure if using schemabuilder to build schema for such a long file may be the right approach for our case. It will be of great help if you could please guide us on what is the right approach
Note: we are open to any version of scala and spark
What version of Spark and Scala do you currently have? Case classes are limited to 22 fields in Scala 2.10.x, but that limit has been removed in Scala 2.11.x.
A really big schema, might make sense to do something like that in NIFI
No long case class required.
Then Store with PutHDFS, PutFile, etc...
Use the latest Spark 1.6.2 or 2.0 with Scala 2.11.
In Spark 1.6.2 or 2.0
It could be easiest to save the AVRO to HDFS
then load with SparkSQL and just write to JSON
something like this
// import needed for the .avro method to be added import com.databricks.spark.avro._ import org.apache.spark.sql.SparkSession val spark = SparkSession.builder().master("local").getOrCreate() // The Avro records get converted to Spark types, filtered, and // then written back out as Avro records val df = spark.read.avro("src/test/resources/episodes.avro") df.filter("doctor > 5").write.format("json").save("namesAndAges.json")
That should work.
It's really simple use case, better for something like NIFI.
If you are doing ML or some more complex conversions and reductions, I would use Spark. For something simple as streaming ingest of AVRO and save as JSON (or better yet ORC) use Apache NiFi 1.0 / HDF 2.0. Don't write ETL code for simple use cases. Use your Spark for complex stuff.
Thanks for the detailed answer Timothy with dependencies ...yep if our were use case was as simple as reading from avro file we would have used this solution or Nifi but our's is a complex use case where we will be using ML, and this solution works great if we are reading from avro files ..but we are reading avro messages using spark streaming from kafka and want to convert them to json to do some session transformations would appreciate if you could please suggest any solution that fits our use case
Hi @Aditya MamidaladityaMamidala you mentioned that you were storing the data into a DataStore, wondering whether you are writing the spark streaming output files into a single file or multiple files ? because i am working on the similar usecase too.