Created on 11-14-2017 09:11 AM - edited 09-16-2022 05:31 AM
Hi,
How is data replicated in Kudu? My understanding is that kudu has one replica of all data and 2 replicas with operational logs. From the apache docs i get this "Kudu does not replicate the on-disk storage of a tablet,
but rather just its operation log. The physical storage of each replica of a tablet is fully decoupled."
In case of a disk failure, if the disk contains the actual data and not operation log, how is it recovered??
Created 11-15-2017 01:12 PM
Somewhat. Take a look here for more details about the relationship between tablets and tablets: https://kudu.apache.org/docs/schema_design.html
There's an important distinction to be made: a tablet is a logical concept (it's a chunk of a table); a replica is a copy of a single tablet. There may be many replicas of a single tablet, depending on the user-specified properties of the table.
E.g. say I have "Table 1" with replication factor 3. This means that every tablet belonging to "Table 1" will always try to maintain 3 replicas/copies. Say "Table 1" has two tablets, "A" and "B", each will have three replicas. A replica of "A" could fail due to a server failure or somesuch, in which case "A" will try to replicate back up to having 3 healthy replicas. This is completely orthogonal to "B".
So yes, a tablet maintains its operational log, but also all of the data associated with it, because it is just a chunk of a table.
Hope this helped!
Created 11-14-2017 10:32 AM
Hi Rajesh,
Right, Kudu replicates data logically to multiple tservers based on each table's replication factor (typically 3), and in doing so, writes are only considered successful once durably written to a majority's write-ahead logs. From then on, each tserver can maintain the data via flushing and compactions, "decoupled" from the writes to the log.
Currently, in cases of disk failures, the single failed node will crash. Because the data is written to at least a majority, Kudu "re-replicates" back up to full replication, i.e. all of the tablets that lost a replica because of the crash will notice that one of the servers is down and make a new copy on another healthy server.
Created on 01-01-2018 06:53 PM - edited 01-01-2018 11:20 PM
Hi Awong,
Right, Kudu replicates data logically to multiple tservers based on each table's replication factor (typically 3), and in doing so, writes are only considered successful once durably written to a majority's write-ahead logs. From then on, each tserver can maintain the data via flushing and compactions, "decoupled" from the writes to the log.
After the flushing and compaction of tserver, each tablet will have 2 physical replications. And the subsequent CLOSEST_REPLICA scan don't have to compact the wal, is this right?
Best regards,
Tony
Created 01-02-2018 01:56 PM
I'm not Andrew but I don't understand your question. Tablet replicas are flushed and/or compacted independently of one another, which means the physical layout of each one may be different; thus it would be incorrect to consider them "physically replicated".
Furthermore, all scans (whether CLOSEST_REPLICA or otherwise) are read-only operations and thus don't trigger WAL garbage collection or any other kind of read-write server-side action.
Created 01-02-2018 05:09 PM
Hi adar,
I think I already get the answer from your post. Maybe the following explanation can clarify my original question.
When I insert data into kudu, only write to a majority's write-ahead logs. The internal flushing and/or compacting for each tablet will generate a set of CFiles as replicas.
And all scan only need to scan the replica (a set of CFiles which contain the base data and the delta data) and the MemRowset to return the query result. Is this right?
But for the tablet coping, it will only copy the wal or both the wal and the replica will be copied?
Best regards,
Tony
Created 01-03-2018 01:07 PM
Quick note: Kudu calls the "set of CFiles which contain the base data and the delta data" a DiskRowSet.
But your understanding is correct: during a scan, the contents of the MemRowSet and some DiskRowSets are scanned for data. During a tablet copy, both the WAL segments and the CFiles are copied.
Created 01-03-2018 04:57 PM
Created 11-25-2018 08:30 AM
Created 11-27-2018 02:54 AM
CFile is an on-disk columnar storage format which holds data and associated B-Tree indexes.
https://github.com/cloudera/kudu/blob/master/docs/design-docs/cfile.md
Created on 11-27-2018 01:02 PM - edited 11-27-2018 01:03 PM
A WAL file is a Kudu tablet write-ahead log file. You can read an overview of how the Kudu write path works here (it's a fairly techincal blog post): https://blog.cloudera.com/blog/2017/04/apache-kudu-read-write-paths/
The WAL file location is controlled by the configuration parameter --fs_wal_dir which you can read about at https://kudu.apache.org/docs/configuration_reference.html#kudu-tserver_fs_wal_dir