Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Read an Avro file and produce multiple pair RDDs

Highlighted

Read an Avro file and produce multiple pair RDDs

New Contributor

Sorry for the long problem description. Here it is.

I'm replicating Oracle databases to Parquet tables in S3. In S3, I have time ordered Avro files that contain Oracle database transactions. Each Avro file contains a mix of metadata records and transaction records (inserts, updates, deletes) for all Oracle database tables. I want to write a Spark job to read the Avro files and produce multiple RDDs -- one for each Oracle database table. The RDDs will be pair RDDs with one column for the primary key value, and a second column containing a JSON blob. The JSON blob includes a single transaction and its corresponding metadata (read earlier in the Avro file). I'll want the pair RDDs partitioned on primary key value with a fixed number of partitions (say 10,000). Later I will join pair RDDs with dataframes from S3 that are also partitioned on the primary key value and have the same number of partitions.

I've done a bunch of reading and worked through some tutorials, but it's still not clear to me how to do this. Can you suggest how to get started? Some pseudocode would be awesome.

Thanks and best,

Bill

Don't have an account?
Coming from Hortonworks? Activate your account here