In the exercise 3 of the tutorial, there are following imports:
import org.apache.hadoop.mapreduce.Job import org.apache.hadoop.mapreduce.lib.input.FileInputFormat import org.apache.avro.generic.GenericRecord import parquet.hadoop.ParquetInputFormat import parquet.avro.AvroReadSupport import org.apache.spark.rdd.RDD
I would like to know what are these:
import parquet.hadoop.ParquetInputFormat import parquet.avro.AvroReadSupport
Wherever I look all spark/hadoop related libraries have org.apache prefix. Is there something special in those 2 libs? Are they treated differently by Cloudera
OK, apparently I have posted my question too quickly. I have finaly found some packages which start with parquet, like parquet.avro. For example here https://github.com/stripe/parquet-mr/blob/master/parquet-avro/src/main/java/parquet/avro/AvroReadSup....
BTW - I might be doing something wrong during my googling, but I cannot find a consistent, javadoc or whatever other format, documentation on parquet modules. Sure, there is a page parquet.apache.org and there is some general documentation there. But I cannot see any API documentation. Take AvroReadSupport class for example: is there a place where official API documentation is available?