Reply
Explorer
Posts: 20
Registered: ‎02-23-2016

Why the data size in KUDU is much larger than the one in HDFS?

I write 100 million rows into both KUDU and HFDS(Parquet in HDFS). The data size in KUDU is 20G, the data size in HDFS is 1G.Why ?Are there any configure for the KUDU?
Posts: 642
Topics: 3
Kudos: 103
Solutions: 66
Registered: ‎08-16-2016

Re: Why the data size in KUDU is much larger than the one in HDFS?

This is kind of a sub comment but do have some thoughts around it.

KUDU borrows from both HDFS and HBase. My guess is just like HBase the row key matters along with other factors that will cause it to balloon. that is where I would start to look.
Posts: 376
Topics: 11
Kudos: 58
Solutions: 32
Registered: ‎09-02-2016

Re: Why the data size in KUDU is much larger than the one in HDFS?

@heyi

 

Interesting!!

 

Kudu sasy it will take lesser space compare to parquest, hbase, avro, etc

 

but it will also come-up with one limitation that "Kudu is primarily designed for analytic use cases. You are likely to encounter issues if a single row contains multiple kilobytes of data". so this could be the problem in your case

Cloudera Employee
Posts: 47
Registered: ‎02-05-2016

Re: Why the data size in KUDU is much larger than the one in HDFS?

How are you measuring Kudu's data size? Currently every approach to measuring the size has a few pitfalls. If you're using the tablet size from the web UI, the limitations are documented here. If you're using something like "du", your total probably includes some 32 MB per non-full data container for a preallocated buffer. And if you're on XFS, disk space consumption can be much higher than normal due to this bug.

Cloudera Employee
Posts: 50
Registered: ‎09-28-2015

Re: Why the data size in KUDU is much larger than the one in HDFS?

Hi,

I would also note that Parquet uses compression and encoding by default,
whereas Kudu in version 1.2 (as included with CDH 5.10) does not.

When you create your table, you'll want to specify an encoding and
potentially compression codec for each column. As a starting point, I'd
recommend making the string columns use DICT encoding (assuming they
contain oft-repeating values) and SNAPPY or LZ4 compression (assuming they
are compressible). I'd recommend using BIT_SHUFFLE for the integer or
floating point columns. These encodings will be the new defaults in
upcoming releases of Kudu.

With the above encodings, we usually find that Kudu on-disk data is a
similar size to Parquet. Sometimes it comes out a bit larger due to the
extra overhead of primary key indexing added by Kudu, but sometimes it
comes out a bit smaller due to some more advanced encodings available. But,
20x is not to be expected.

-Todd
Highlighted
Explorer
Posts: 20
Registered: ‎02-23-2016

Re: Why the data size in KUDU is much larger than the one in HDFS?

@Todd Lipcon Thank you! Now Kudu on-disk data is a similar size to Parquet when we specify the encoding and compression codec for each column. BTW, there are always ERROR "Authentication Failed" when we click the link "ACCEPT AS SOLUTION".
Announcements
Unanswered Topics
No posts to display.