Support Questions
Find answers, ask questions, and share your expertise

Read avro messages consumed using spark streaming and convert them to json (for a avro schema which is 116 lines long)

Explorer

we have a requirement to read avro messages from kafka using spark streaming , converting them to json and store to a data store , i followed this awesome article by Timothy https://community.hortonworks.com/articles/33275/receiving-avro-messages-through-kafka-in-a-spark-s.... ..but in our case the schema is 116 lines long ..i'm not sure if using schemabuilder to build schema for such a long file may be the right approach for our case. It will be of great help if you could please guide us on what is the right approach

Note: we are open to any version of scala and spark

6 REPLIES 6

Expert Contributor

What version of Spark and Scala do you currently have? Case classes are limited to 22 fields in Scala 2.10.x, but that limit has been removed in Scala 2.11.x.

Explorer

Hi Joe we are using spark 1.6.1 and scala 2.10

Super Guru

A really big schema, might make sense to do something like that in NIFI

Convert AVRO->JSON

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.avro.ConvertAvroToJSON/

No long case class required.

Then Store with PutHDFS, PutFile, etc...

Use the latest Spark 1.6.2 or 2.0 with Scala 2.11.

In Spark 1.6.2 or 2.0

It could be easiest to save the AVRO to HDFS

then load with SparkSQL and just write to JSON

https://github.com/databricks/spark-avro

something like this

// import needed for the .avro method to be added
import com.databricks.spark.avro._
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().master("local").getOrCreate()

// The Avro records get converted to Spark types, filtered, and
// then written back out as Avro records
val df = spark.read.avro("src/test/resources/episodes.avro")

df.filter("doctor > 5").write.format("json").save("namesAndAges.json")

That should work.

It's really simple use case, better for something like NIFI.

If you are doing ML or some more complex conversions and reductions, I would use Spark. For something simple as streaming ingest of AVRO and save as JSON (or better yet ORC) use Apache NiFi 1.0 / HDF 2.0. Don't write ETL code for simple use cases. Use your Spark for complex stuff.

Explorer

Thanks for the detailed answer Timothy with dependencies ...yep if our were use case was as simple as reading from avro file we would have used this solution or Nifi but our's is a complex use case where we will be using ML, and this solution works great if we are reading from avro files ..but we are reading avro messages using spark streaming from kafka and want to convert them to json to do some session transformations would appreciate if you could please suggest any solution that fits our use case

Super Guru

thanks for liking the article!

Hi @Aditya MamidaladityaMamidala you mentioned that you were storing the data into a DataStore, wondering whether you are writing the spark streaming output files into a single file or multiple files ? because i am working on the similar usecase too.