Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to filter a Spark RDD based on particular field value in Java?

avatar
Contributor
 
2 REPLIES 2

avatar
Guru

RDDs do not really have fields per-se, unless for example your have an RDD of Row objects. You would usually filter on an index:

rdd.filter(x => x(1) == "thing")

(example in scala for clarity, same thing applies to Java)

If you have an RDD of a typed object, the same thing applies, but you can use a getter for example in the lambda / filter function.

If you want field references you would do better wih the dataframes API. Compare for example:

sqlContext.createDataFrame(rdd, schema).where($"fieldname" == "thing")

(see Spark Official Docs - DataFrames for more on the schema)

avatar

As Simon mentioned RDDs don't have schema attached. DataFrame (conceptually similar to a DB Table) do have an attached schema (column name, column type etc) and you can quite easily filter on a column .

You can also create a DF from RDD and then go about filter.

See http://hortonworks.com/hadoop-tutorial/a-lap-around-apache-spark/ section about programmatically specifying schema (that attaches schema to RDD to get a DataFrame) and see the section Additional DataFrame API Example to see a DF filter example.