Support Questions

PedroGaVal · ‎05-09-2018

Hi,

We are analyzing kudu for production in real time ingestion but we are not sure about how much disk/ram needs, I have understood that kudu storage does not work with hdfs means that we need additional resources and we don´t know how much disk/ram for each 1gb of data ingested storaged from kafka.

how can I control cuota (%) for each process if kudu does not work with YARN?

Do you recomend separated cluster for kudu to control memory usage?

Are there any formula to calculate additional hardware for kudu? disk=replication factor * data? and memory?

If kudu does not need HDFS for storage probably HDFS is not necessary. Isn´t that right?

Thanks in advance

mpercy · ‎05-09-2018

Kudu does not use HDFS at all. It requires its own storage space.

If you use 3x replication (the default) and no compression then Kudu will take 3x the amount of space that you ingest. However Kudu tends to efficiently encode and compress data so you will have to evaluate how much space Kudu takes based on the schema and data ingestion patterns you have.

The more RAM you give Kudu the better it will perform... treat Kudu like a database (think MySQL or Vertica).

Right now there is no way to specify a quota, the only available settings related to that are: --fs_wal_dir_reserved_bytes ( https://kudu.apache.org/docs/configuration_reference.html#kudu-tserver_fs_wal_dir_reserved_bytes ) and --fs_data_dirs_reserved_bytes ( https://kudu.apache.org/docs/configuration_reference.html#kudu-tserver_fs_data_dirs_reserved_bytes )

If you need to closely control the amount of space Kudu uses then you can consider putting it on its own partitions or machines. However if it possible to put Kudu on the same machines that have HDFS running on them if you want to do that.

Hope that helps!

View solution in original post

mpercy · ‎05-09-2018

Kudu does not use HDFS at all. It requires its own storage space.

If you use 3x replication (the default) and no compression then Kudu will take 3x the amount of space that you ingest. However Kudu tends to efficiently encode and compress data so you will have to evaluate how much space Kudu takes based on the schema and data ingestion patterns you have.

The more RAM you give Kudu the better it will perform... treat Kudu like a database (think MySQL or Vertica).

Right now there is no way to specify a quota, the only available settings related to that are: --fs_wal_dir_reserved_bytes ( https://kudu.apache.org/docs/configuration_reference.html#kudu-tserver_fs_wal_dir_reserved_bytes ) and --fs_data_dirs_reserved_bytes ( https://kudu.apache.org/docs/configuration_reference.html#kudu-tserver_fs_data_dirs_reserved_bytes )

If you need to closely control the amount of space Kudu uses then you can consider putting it on its own partitions or machines. However if it possible to put Kudu on the same machines that have HDFS running on them if you want to do that.

Hope that helps!

Cloudera Community

Support Questions

Does Kudu need additional resources appart from hdfs?

Tactical modularity in CDE Airflow by loading code...

Does HDFS 3x replication still make sense?

Working with CDE Files Resources

Creation of additional HDFS superuser

How to Open Additional Ports on EC2 Security Group

Importing additional python modules while making u...

Comparison : Kudu Copy Command vs Spark backup uti...

Sizing CML Workspaces: Must-Knows for properly pla...

Migrating Apache Flume Flows to Apache NiFi: Kafka...

Increase HDFS capacity with additional disks