Created 07-05-2022 03:57 AM
Hello,
We are using CDP7 Streams Messaging and we are studying the feature Kafka -> Kafka Connect -> HDFS .
From the CDP7 Streams Messaging UI, it seems that the configuration is very limited; and documentation also. So, we have the following questions:
1- Is it possible to configure a schema to use for a storage in HDFS parquet ?
2- Is it possible to tune the partitioning in HDFS parquet (or does it rely on the kafka topic partitioning ?)
Has anyone have examples ?
Created 07-05-2022 05:17 AM
The following connector configuration worked for me. My schema was stored in Schema Registry and the connector fetched it from there.
{
"connector.class": "com.cloudera.dim.kafka.connect.hdfs.HdfsSinkConnector",
"hdfs.output": "/tmp/topics_output/",
"hdfs.uri": "hdfs://nn1:8020",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"name": "asd",
"output.avro.passthrough.enabled": "true",
"output.storage": "com.cloudera.dim.kafka.connect.hdfs.HdfsPartitionStorage",
"output.writer": "com.cloudera.dim.kafka.connect.hdfs.parquet.ParquetPartitionWriter",
"tasks.max": "1",
"topics": "avro-topic",
"value.converter": "com.cloudera.dim.kafka.connect.converts.AvroConverter",
"value.converter.passthrough.enabled": "false",
"value.converter.schema.registry.url": "http://sr-1:7788/api/v1"
}
Cheers,
André