Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HIVE/HBASE hdfs replication

avatar
Rising Star

My question is NOT about HIVE/HBASE replication across clusters.

But rather about whether HIVE and HBASE since they sit on top of HDFS, will the default HDFS replication factor affect HIVE and HBASE data. So within a single cluster, on a HIVE or HBASE setup, are there three copies (default replication factor) of each HIVE/HBASE table sitting across the HDFS?

Appreciate the insights.

1 ACCEPTED SOLUTION

avatar
Super Collaborator

For Hive, files created in /apps/hive/warehouse/<database name>/<tablename>/data dfs.replication factor will be honored by default (Unless user explicitly sets replication factor for a files/files under directory).

For example I have database testnumber and table name numberstringtest (stored as Text format) and data inside has files with each file consisting of one row. In below output column 2 says replication factor which is 3 in my case.

$hdfs dfs -ls /apps/hive/warehouse/testnumber.db/numberstringtest/
Found 5 items
-rw-r--r--   3 hadoopadmin hdfs          9 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0
-rw-r--r--   3 hadoopadmin hdfs          9 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0_copy_1
-rw-r--r--   3 hadoopadmin hdfs         10 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0_copy_2
-rw-r--r--   3 hadoopadmin hdfs         10 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0_copy_3
-rw-r--r--   3 hadoopadmin hdfs         10 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0_copy_4

Below is command I would use to find replicated block storage information for a file.

$ hdfs fsck /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0 -files -locations -blocks

Connecting to namenode via http://hdp-ranger-1.openstacklocal:50070/fsck?ugi=hdfs&files=1&locations=1&blocks=1&path=%2Fapps%2Fh...
FSCK started by hdfs (auth:KERBEROS_SSL) from /172.26.92.141 for path /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0 at Tue Feb 14 15:07:58 UTC 2017
/apps/hive/warehouse/testnumber.db/numberstringtest/000000_0 9 bytes, 1 block(s):  OK
0. BP-1221127906-172.26.92.141-1485863848635:blk_1073744644_3922 len=9 repl=3 [DatanodeInfoWithStorage[172.26.92.142:1019,DS-bc1af702-2112-4c84-880c-506934af5309,DISK], DatanodeInfoWithStorage[172.26.92.141:1019,DS-51baba7f-9220-481c-9957-8a33fb1c1bb7,DISK], DatanodeInfoWithStorage[172.26.92.143:1019,DS-ef06873b-50e4-4c3b-a423-4b174cd465d8,DISK]]


Status: HEALTHY
 Total size:	9 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	1 (avg. block size 9 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	3.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		4
 Number of racks:		1
FSCK ended at Tue Feb 14 15:07:58 UTC 2017 in 3 milliseconds




The filesystem under path '/apps/hive/warehouse/testnumber.db/numberstringtest/000000_0' is HEALTHY

I am not sure about HBase, I guess dfs.replication factor should be honored by default, unless explicitly given for a file in HDFS.

View solution in original post

5 REPLIES 5

avatar
Explorer

Hello,

the short answer is yes. For example HBase stores all of its files on HDFS, so these files will be replicated based on the replication factor of the underlying HDFS configuration. HBase itself does not even take care of storing data multiple times, because it is the responsibility of the underlying file system.

avatar
Rising Star

Thanks for the insights.

avatar
Super Collaborator

For Hive, files created in /apps/hive/warehouse/<database name>/<tablename>/data dfs.replication factor will be honored by default (Unless user explicitly sets replication factor for a files/files under directory).

For example I have database testnumber and table name numberstringtest (stored as Text format) and data inside has files with each file consisting of one row. In below output column 2 says replication factor which is 3 in my case.

$hdfs dfs -ls /apps/hive/warehouse/testnumber.db/numberstringtest/
Found 5 items
-rw-r--r--   3 hadoopadmin hdfs          9 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0
-rw-r--r--   3 hadoopadmin hdfs          9 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0_copy_1
-rw-r--r--   3 hadoopadmin hdfs         10 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0_copy_2
-rw-r--r--   3 hadoopadmin hdfs         10 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0_copy_3
-rw-r--r--   3 hadoopadmin hdfs         10 2017-02-09 16:31 /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0_copy_4

Below is command I would use to find replicated block storage information for a file.

$ hdfs fsck /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0 -files -locations -blocks

Connecting to namenode via http://hdp-ranger-1.openstacklocal:50070/fsck?ugi=hdfs&files=1&locations=1&blocks=1&path=%2Fapps%2Fh...
FSCK started by hdfs (auth:KERBEROS_SSL) from /172.26.92.141 for path /apps/hive/warehouse/testnumber.db/numberstringtest/000000_0 at Tue Feb 14 15:07:58 UTC 2017
/apps/hive/warehouse/testnumber.db/numberstringtest/000000_0 9 bytes, 1 block(s):  OK
0. BP-1221127906-172.26.92.141-1485863848635:blk_1073744644_3922 len=9 repl=3 [DatanodeInfoWithStorage[172.26.92.142:1019,DS-bc1af702-2112-4c84-880c-506934af5309,DISK], DatanodeInfoWithStorage[172.26.92.141:1019,DS-51baba7f-9220-481c-9957-8a33fb1c1bb7,DISK], DatanodeInfoWithStorage[172.26.92.143:1019,DS-ef06873b-50e4-4c3b-a423-4b174cd465d8,DISK]]


Status: HEALTHY
 Total size:	9 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	1 (avg. block size 9 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	3.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		4
 Number of racks:		1
FSCK ended at Tue Feb 14 15:07:58 UTC 2017 in 3 milliseconds




The filesystem under path '/apps/hive/warehouse/testnumber.db/numberstringtest/000000_0' is HEALTHY

I am not sure about HBase, I guess dfs.replication factor should be honored by default, unless explicitly given for a file in HDFS.

avatar
Rising Star

Are 000000_0_copy_1, 000000_0_copy_2, 000000_0_copy_3 the hdfs replication copies of 000000_0 ?

Or are they independent tables that you had created?

Appreciate the feedback.

avatar
Rising Star

Another related question is if cluster replication is enabled for HBASE/HIVE for HA, is HDFS replication still required? In such cases, isn't default replication factor of 3 a overkill? Is it possible to reduce HDFS replication factor to 2 (one copy) in such cases?

Any insights on what the standard practice across the industry is?

Appreciate the feedback.