I saw your question related to this on the impala-user mailing list. I'll repeate the suggestion there for anyone that finds this thread.
What you'd change:
1. Configure Flume to read from Kafka
2. Set the format to Parquet when creating the dataset:
mvn kite:create-dataset \
Is there more information you need to get started?
We inherit Avro's schema resolution rules for doing evolution:
You can update the schema used by a dataset using the Kite command line tool:
kite-dataset update movies --schema movies2.avsc
The tool will check compatibility before completing the update. Since we store a copy of the Avro schema in the Parquet files, we can resolve the schema with the current dataset schema when reading data, so no data migration is needed.
This also works with Hive/Impala backed datasets as we update the table definition when we update the dataset.
You would need to update your morphlines configuraiton to update the conversion from JSON to Avro if the JSON schema changes.
I posted on CDH-user list as well. I'm went through the steps mentioned on the kite-examples/json:
When trying to start flume-agent even with: FLUME_JAVA_OPTS=-Xmx1024m I'm running into OOM error:
Thanks, got this resolved by uncommenting this line from /etc/flume-ng/conf/flume-env.sh
export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote"
Question regarding Parquet data file size - what are the options to ensure large file size (1GB)? Is adjusting the batchSize an option? If not then how to handle compaction?
I was able to get the JSON to Avro working but after deleting the dataset and creating it using this create-dataset goal with parquet format option, I notice this error in flume.log:
java.lang.IllegalArgumentException: Unsupported format: parquet
Also I found the 'toAvro' morphline/command but didn't find 'toParquet'. Please let me know how to fix this.