Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

JSON to Parquet

JSON to Parquet

Expert Contributor

We got Kafka to HDFS pipeline ingesting JSON and we want to convert to Parquet format.

 

How do I do this using Kite SDK?

 

Thanks!

8 REPLIES 8
Highlighted

Re: JSON to Parquet

Contributor

Hi!

 

I saw your question related to this on the impala-user mailing list. I'll repeate the suggestion there for anyone that finds this thread.

 

https://github.com/kite-sdk/kite-examples/tree/master/json

What you'd change:

1. Configure Flume to read from Kafka[1]
2. Set the format to Parquet when creating the dataset:

mvn kite:create-dataset \
  -Dkite.rootDirectory=/tmp/data \
  -Dkite.datasetName=users \
  -Dkite.avroSchemaFile=/etc/flume-ng/schemas/user.avsc \
  -Dkite.format=parquet

 

Is there more information you need to get started?

-Joey

[1] https://issues.apache.org/jira/browse/FLUME-2250

Re: JSON to Parquet

Expert Contributor

Thanks Joey, will explore this option.

 

I'm also interested in how schema evolution is handled or is it just handled through a migration?

Re: JSON to Parquet

Contributor

We inherit Avro's schema resolution rules for doing evolution:

 

http://kitesdk.org/docs/current/guide/Schema-Evolution/

 

You can update the schema used by a dataset using the Kite command line tool:

 

kite-dataset update movies --schema movies2.avsc

The tool will check compatibility before completing the update. Since we store a copy of the Avro schema in the Parquet files, we can resolve the schema with the current dataset schema when reading data, so no data migration is needed.

 

This also works with Hive/Impala backed datasets as we update the table definition when we update the dataset.

 

You would need to update your morphlines configuraiton to update the conversion from JSON to Avro if the JSON schema changes.

Re: JSON to Parquet

Expert Contributor

Hi!

 

I posted on CDH-user list as well. I'm went through the steps mentioned on the kite-examples/json:

 

 https://github.com/kite-sdk/kite-examples/tree/master/json

 

When trying to start flume-agent even with: FLUME_JAVA_OPTS=-Xmx1024m I'm running into OOM error:

 

~~~~~
ERROR [conf-file-poller-0] (org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run:149)  - Unhandled error
java.lang.OutOfMemoryError: Java heap space
at java.util.jar.Manifest$FastInputStream.<init>(Manifest.java:332)
at java.util.jar.Manifest$FastInputStream.<init>(Manifest.java:327)
at java.util.jar.Manifest.read(Manifest.java:195)
at java.util.jar.Manifest.<init>(Manifest.java:69)
at java.util.jar.JarFile.getManifestFromReference(JarFile.java:181)
at java.util.jar.JarFile.getManifest(JarFile.java:167)
at org.kitesdk.morphline.shaded.com.google.common.reflect.ClassPath$Scanner.scanJar(ClassPath.java:336)
~~~~
 
Anything else I can try to start the flume agent?
 
Thanks!

Re: JSON to Parquet

Expert Contributor

Forgot to mention that I tried the steps on the CDH quickstart vm.

Re: JSON to Parquet

Expert Contributor
Usually, this measn that you need to run java with more permgen memory - try something like -XX:MaxPermSize=128M

Re: JSON to Parquet

Expert Contributor

Thanks, got this resolved by uncommenting this line from /etc/flume-ng/conf/flume-env.sh

 

  export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote"

 

Question regarding Parquet data file size - what are the options to ensure large file size (1GB)? Is adjusting the batchSize an option? If not then how to handle compaction?

 

Thanks!

Re: JSON to Parquet

Expert Contributor

I was able to get the JSON to Avro working but after deleting the dataset and creating it using this create-dataset goal with parquet format option, I notice this error in flume.log: 

 

~~~~~

java.lang.IllegalArgumentException: Unsupported format: parquet
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
at org.apache.flume.sink.kite.DatasetSink.configure(DatasetSink.java:182)
at org.apache.flume.conf.Configurables.configure(Configurables.java:41)
at org.apache.flume.node.AbstractConfigurationProvider.loadSinks(AbstractConfigurationProvider.java:413)

~~~~~

 

Also I found the 'toAvro' morphline/command but didn't find 'toParquet'. Please let me know how to fix this.

 

Thanks!

Don't have an account?
Coming from Hortonworks? Activate your account here