Created 02-14-2017 02:50 PM
My question is NOT about HIVE/HBASE replication across clusters.
But rather about whether HIVE and HBASE since they sit on top of HDFS, will the default HDFS replication factor affect HIVE and HBASE data. So within a single cluster, on a HIVE or HBASE setup, are there three copies (default replication factor) of each HIVE/HBASE table sitting across the HDFS?
Appreciate the insights.
Created 02-14-2017 03:15 PM
For Hive, files created in /apps/hive/warehouse/<database name>/<tablename>/data dfs.replication factor will be honored by default (Unless user explicitly sets replication factor for a files/files under directory).
For example I have database testnumber and table name numberstringtest (stored as Text format) and data inside has files with each file consisting of one row. In below output column 2 says replication factor which is 3 in my case.
$hdfs dfs -ls /apps/hive/warehouse/testnumber.db/numberstringtest/ Found 5 items -rw-r--r-- 3 hadoopadmin hdfs 9 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0 -rw-r--r-- 3 hadoopadmin hdfs 9 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0_copy_1 -rw-r--r-- 3 hadoopadmin hdfs 10 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0_copy_2 -rw-r--r-- 3 hadoopadmin hdfs 10 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0_copy_3 -rw-r--r-- 3 hadoopadmin hdfs 10 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0_copy_4
Below is command I would use to find replicated block storage information for a file.
$ hdfs fsck /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0 -files -locations -blocks
Connecting to namenode via http://hdp-ranger-1.openstacklocal:50070/fsck?ugi=hdfs&files=1&locations=1&blocks=1&path=%2Fapps%2Fh... FSCK started by hdfs (auth:KERBEROS_SSL) from /172.26.92.141 for path /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0 at Tue Feb 14 15:07:58 UTC 2017 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0 9 bytes, 1 block(s): OK 0. BP-1221127906-172.26.92.141-1485863848635:blk_1073744644_3922 len=9 repl=3 [DatanodeInfoWithStorage[172.26.92.142:1019,DS-bc1af702-2112-4c84-880c-506934af5309,DISK], DatanodeInfoWithStorage[172.26.92.141:1019,DS-51baba7f-9220-481c-9957-8a33fb1c1bb7,DISK], DatanodeInfoWithStorage[172.26.92.143:1019,DS-ef06873b-50e4-4c3b-a423-4b174cd465d8,DISK]] Status: HEALTHY Total size: 9 B Total dirs: 0 Total files: 1 Total symlinks: 0 Total blocks (validated): 1 (avg. block size 9 B) Minimally replicated blocks: 1 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 4 Number of racks: 1 FSCK ended at Tue Feb 14 15:07:58 UTC 2017 in 3 milliseconds The filesystem under path '/apps/hive/warehouse/testnumber.db/numberstringtest/000000_0' is HEALTHY
I am not sure about HBase, I guess dfs.replication factor should be honored by default, unless explicitly given for a file in HDFS.
Created 02-14-2017 02:58 PM
Hello,
the short answer is yes. For example HBase stores all of its files on HDFS, so these files will be replicated based on the replication factor of the underlying HDFS configuration. HBase itself does not even take care of storing data multiple times, because it is the responsibility of the underlying file system.
Created 02-14-2017 06:11 PM
Thanks for the insights.
Created 02-14-2017 03:15 PM
For Hive, files created in /apps/hive/warehouse/<database name>/<tablename>/data dfs.replication factor will be honored by default (Unless user explicitly sets replication factor for a files/files under directory).
For example I have database testnumber and table name numberstringtest (stored as Text format) and data inside has files with each file consisting of one row. In below output column 2 says replication factor which is 3 in my case.
$hdfs dfs -ls /apps/hive/warehouse/testnumber.db/numberstringtest/ Found 5 items -rw-r--r-- 3 hadoopadmin hdfs 9 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0 -rw-r--r-- 3 hadoopadmin hdfs 9 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0_copy_1 -rw-r--r-- 3 hadoopadmin hdfs 10 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0_copy_2 -rw-r--r-- 3 hadoopadmin hdfs 10 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0_copy_3 -rw-r--r-- 3 hadoopadmin hdfs 10 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0_copy_4
Below is command I would use to find replicated block storage information for a file.
$ hdfs fsck /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0 -files -locations -blocks
Connecting to namenode via http://hdp-ranger-1.openstacklocal:50070/fsck?ugi=hdfs&files=1&locations=1&blocks=1&path=%2Fapps%2Fh... FSCK started by hdfs (auth:KERBEROS_SSL) from /172.26.92.141 for path /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0 at Tue Feb 14 15:07:58 UTC 2017 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0 9 bytes, 1 block(s): OK 0. BP-1221127906-172.26.92.141-1485863848635:blk_1073744644_3922 len=9 repl=3 [DatanodeInfoWithStorage[172.26.92.142:1019,DS-bc1af702-2112-4c84-880c-506934af5309,DISK], DatanodeInfoWithStorage[172.26.92.141:1019,DS-51baba7f-9220-481c-9957-8a33fb1c1bb7,DISK], DatanodeInfoWithStorage[172.26.92.143:1019,DS-ef06873b-50e4-4c3b-a423-4b174cd465d8,DISK]] Status: HEALTHY Total size: 9 B Total dirs: 0 Total files: 1 Total symlinks: 0 Total blocks (validated): 1 (avg. block size 9 B) Minimally replicated blocks: 1 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 4 Number of racks: 1 FSCK ended at Tue Feb 14 15:07:58 UTC 2017 in 3 milliseconds The filesystem under path '/apps/hive/warehouse/testnumber.db/numberstringtest/000000_0' is HEALTHY
I am not sure about HBase, I guess dfs.replication factor should be honored by default, unless explicitly given for a file in HDFS.
Created 02-14-2017 03:40 PM
Are 000000_0_copy_1, 000000_0_copy_2, 000000_0_copy_3 the hdfs replication copies of 000000_0 ?
Or are they independent tables that you had created?
Appreciate the feedback.
Created 02-14-2017 04:12 PM
Another related question is if cluster replication is enabled for HBASE/HIVE for HA, is HDFS replication still required? In such cases, isn't default replication factor of 3 a overkill? Is it possible to reduce HDFS replication factor to 2 (one copy) in such cases?
Any insights on what the standard practice across the industry is?
Appreciate the feedback.