Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Understanding: Kudu-Tserver data storing technique


Hello to all, I want to understand how kudu stores data on disk as log files, I've the following question:


1) Which property decides the how big a log file would be in kudu tserver data directory? After which a new log file is created. Is it 


If not then which property? What should be the ideal size that I should keep?


2) Currently in my Production environment inside Tservers data directory, many 11GB log files exists, but upon checking documentation the size shown by ls -l is apprent size and to check the real size I should run du -h, but to my surprise all 11GB files don't have same real size, it ranges from 8KB to 600MB. Currently no compressing is enabled. So I wanted to understand if this behaviour is expected or unexpected and if yes why the real file size is different.


3) Accoring to what I understood that each tablet will have multi log files(because of reaching maximum log file size limit and after which it'll start appending to a new log file). Is my understanding correct? If yes, how can I know log files belongs to which tablet? currently I've close to 500 tablet, and more than 4K data files(log files). I wanted to understand the mapping of the files to a tablet.




Cloudera Employee
Re 1: Kudu stores its data using what's called the log block manager. Data is stored in container files, which are sparse files with a default size set to --log_container_max_size (10GiB by default). The default should be good and there's likely no reason to change it.

Re 2: It's expected. Container files will be populated with additional blocks over time. 10GiB is just the (sparse) allocated size and the maximum actual size the container file can reach.

Re 3: No, I don't think container files are dedicated to specific tablets. You may be confusing them with WAL files, which are stored differently, and are not stored in the data directory.

What version of Kudu are your using? In earlier versions, it was more likely to end up with many container files, especially for certain workloads.


First of all thanks for the reply.


Re Re1 - Okay currently --log_container_max_size is on default.


Re Re2 - So maximum size a log file in data directory can differ, what factor can cause that? What actually Log file contains, if WAL files have the real table data?


Re Re3 - Okay, so WAL files and tablets can be mapped? Then why number of WAL files and tablets differ? There are more WAL file than number of tablets.


My installed KUDU version is 1.6.0



Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.