Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Contributor

Introduction

The purpose of this article is to compare the upload time between three different methods for uploading same structured datasets into Hadoop (two methods) and MariaDB (one method).

Assumptions & Design

a small environment is used to deploy three node Hadoop cluster (one master node, two worker nodes). The exercise will be run from my laptop which has the following specs:

Processor Name: Intel Core i7

Processor Speed: 2.5 GHz

Number of Processors: 1

Total Number of Cores: 4

Memory: 16 GB

The Hadoop cluster will be virtualized on top of my Mac machine by “Oracle VM VirtualBox Manager”. The virtual Hadoop nodes running will have the following specs:

Table 1: Hadoop Cluster –nodes’ specifications

Specification Namenode (Master node) Datanode#1 (Worker node #1) Datanode #2 (Worker node #2)
Hostname hdpnn.lab1.com hdpdn1.lab1.com hdpdn2.lab1.com
Memory 4646 MB 3072 MB 3072 MB
CPU Number 3 2 2
Hard disk size 20 GB 20 GB 20 GB
OS CentOS-7-x86_64-Minimal CentOS-7-x86_64-Minimal CentOS-7-x86_64-Minimal
IP Address 192.168.43.15 192.168.43.16 192.168.43.17

The MariaDB standalone virtual machine was used for installing MariaDB database with the following specs:

Hostname: mariadb.lab1.com

IP Address: 192.168.43.55

Memory: 12GB

Disk: 40 GB

O.S: Linux Centos 7

The Semi-structured Datasets were used is the mail archive for Apache Software Foundation (ASF), it was around 200GB of total size. The mail archive contains communications happened regarding more than 80 open-source projects, such as: (such as Hadoop, Hive, Sqoop, Zookeeper, Hbase, Storm, Kafka and much more). The mail archive could be downloaded simply using "wget" command or any other tool from this URL: http://mail-archives.apache.org/mod_mbox/

Results

Results collected from uploading mails files to Hadoop cluster

The following results table was collected after distinct 14 uploads for different 13 sub-directories that vary in sizes, number of contained files and sizes of contained files. The last upload was done for testing upload of all previous 13 sub-directories at once. Two upload methods used that are significantly changed in upload time, the first method used is the normal upload for all files directly from local files system of the Hadoop cluster. The second method used is Hadoop Archive (HAR), which is a Hadoop capability used to combine files together in an archiver before writing it back to HDFS.

Table 2: Results collected from uploading mails files to Hadoop cluster

Loaded directory/directories Directory/Directories size (KB) Number of uploaded files Avg. size of uploaded files (KB) Load time (1st attempt) Load time (2nd attempt) Load using Hadoop Archive (1st attempt) Load using Hadoop Archive (2nd attempt)
lucene-dev 1214084 53547 22.67 89m16.790s 70m43.092s 2m54.563s 2m28.416s
tomcat-users 1023156 61303 16.69 101m45.870s 86m48.927s 3m17.214s 2m59.006s
cxf-commits 612216 22173 27.61 36m30.333s 29m37.924s 1m28.457s 1m15.189s
usergrid-commits 325740 9838 33.11 14m50.757s 14m8.545s 0m54.038s 0m44.905s
accumulo-notifications 163596 14482 11.30 24m38.159s 24m49.356s 1m3.650s 0m27.550s
zookeeper-user 82116 8187 10.03 14m40.461s 14m34.136s 0m47.865s 0m40.913s
synapse-user 41396 3690 11.22 5m24.744s 4m47.196s 0m38.167s 0m29.043s
incubator-ace-commits 20836 1146 18.18 2m28.330s 2m4.168s 0m25.042s 0m23.401s
incubator-batchee-dev 10404 1086 9.58 2m18.903s 2m19.044s 0m27.165s 0m23.166s
incubator-accumulo-user 5328 577 9.23 1m10.201s 1m2.328s 0m26.572s 0m23.300s
subversion-announce 2664 255 10.45 0m50.596s 0m32.339s 0m29.247s 0m21.578s
www-small-events-discuss 1828 218 8.39 0m45.215s 0m20.160s 0m22.847s 0m21.035s
openoffice-general-ja 912 101 9.03 0m26.837s 0m7.898s 0m21.764s 0m19.905s
All previous directories 3504280 176603 19.84 224m49.673s Not tested 8m13.950s 8m46.144s

Results collected from uploading mails files to MariaDB

The following results were collected after distinct 14 uploads for different 13 sub-directories that vary in sizes, number of contained files and sizes of contained files. The last upload was done for testing upload of all previous 13 sub-directories at once.

Table 3: Results collected from uploading mails files to MariaDB

Loaded directory Total size (KB) Number of loaded files Avg. size of loaded files (KB) Load time (1st attempt) Load time (2nd attempt)
lucene-dev 1214084 53547 22.67 4m56.730s 4m57.884s
tomcat-users 1023156 61303 16.69 5m40.320s 5m38.747s
cxf-commits 612216 22173 27.61 2m4.504s 2m2.992s
usergrid-commits 325740 9838 33.11 2m30.519s 0m55.091s
accumulo-notifications 163596 14482 11.30 1m14.929s 1m16.046s
zookeeper-user 82116 8187 10.03 0m39.250s 0m40.822s
synapse-user 41396 3690 11.22 0m18.205s 0m18.580s
incubator-ace-commits 20836 1146 18.18 0m5.794s 0m5.733s
incubator-batchee-dev 10404 1086 9.58 0m5.310s 0m5.276s
incubator-accumulo-user 5328 577 9.23 0m2.869s 0m2.657s
subversion-announce 2664 255 10.45 0m1.219s 0m1.228s
www-small-events-discuss 1828 218 8.39 0m1.045s 0m1.027s
openoffice-general-ja 912 101 9.03 0m0.535s 0m0.496s
All previous directories 3504280 176603 19.84 46m55.311s 17m31.941s

Figure 1: Uploaded data size in KB vs upload time in sec.

38450-upload-time-vs-uploaded-directory-size.png

Figure 2: No of uploaded files vs upload time in sec.

38451-upload-time-vs-uploaded-no-of-files.png

Conclusion

Traditional data warehouses could be tuned to store small-sized semi-structured data. This could be valid and applicable for small-size upload. By increasing number of files, it may not be the best option, especially when uploading massive number of files of file (millions and above).

Uploading small files into Hadoop is a resource consuming process, uploading massive number of small files could affect the performance of the Hadoop cluster dramatically; normal files upload to HDFS is creating a separate Map-Reduce process for every single file.

Using Hadoop Archive (HAR) tool is critical when loading massive number of small files at once. The HAR concept is to append files together by using a special delimiter before being uploaded to HDFS which reduces uploading time significantly. It’s important to note that the query time of a HAR from Hadoop will not be equivalent to Hadoop direct uploading without using HAR; because processing HAR for query requires an additional process for internal de-indexing.

Future Work

I’ll try doing the same exercise using bigger cluster with higher hardware specs to validate the same conclusion.

References:

https://docs.hortonworks.com/HDPDocuments/Ambari-2.5.1.0/bk_ambari-installation/content/ch_Getting_R...

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_hdfs-administration/content/ch_hadoop_ar...

https://mariadb.org/

762 Views
0 Kudos