05-09-2018 12:27 AM - edited 05-09-2018 12:38 AM
We are analyzing kudu for production in real time ingestion but we are not sure about how much disk/ram needs, I have understood that kudu storage does not work with hdfs means that we need additional resources and we don´t know how much disk/ram for each 1gb of data ingested storaged from kafka.
how can I control cuota (%) for each process if kudu does not work with YARN?
Do you recomend separated cluster for kudu to control memory usage?
Are there any formula to calculate additional hardware for kudu? disk=replication factor * data? and memory?
If kudu does not need HDFS for storage probably HDFS is not necessary. Isn´t that right?
Thanks in advance
05-09-2018 06:14 PM - edited 05-09-2018 06:16 PM
Kudu does not use HDFS at all. It requires its own storage space.
If you use 3x replication (the default) and no compression then Kudu will take 3x the amount of space that you ingest. However Kudu tends to efficiently encode and compress data so you will have to evaluate how much space Kudu takes based on the schema and data ingestion patterns you have.
The more RAM you give Kudu the better it will perform... treat Kudu like a database (think MySQL or Vertica).
Right now there is no way to specify a quota, the only available settings related to that are: --fs_wal_dir_reserved_bytes ( https://kudu.apache.org/docs/configuration_reference.html#kudu-tserver_fs_wal_dir_reserved_bytes ) and --fs_data_dirs_reserved_bytes ( https://kudu.apache.org/docs/configuration_reference.html#kudu-tserver_fs_data_dirs_reserved_bytes )
If you need to closely control the amount of space Kudu uses then you can consider putting it on its own partitions or machines. However if it possible to put Kudu on the same machines that have HDFS running on them if you want to do that.
Hope that helps!