Member since
11-04-2016
17
Posts
4
Kudos Received
0
Solutions
10-30-2017
11:29 AM
I did something similar for pushing data to Kafka using few million rows CSV file by the same concept of multiple splits: https://community.hortonworks.com/content/kbentry/144771/ingesting-a-big-csv-file-into-kafka-using-a-multi.html
... View more
10-30-2017
08:05 AM
3 Kudos
Introduction The
purpose of this article is to use Apache NiFi to ingest a huge csv file (few
million rows) into a Kafka topic. The
tricky point when dealing with big file (millions of rows) is that processing
the file by NiFi on row-level will generate flow files for every row, this can produce
java memory error. As a work around, we need to limit the generated number of
rows by splitting the rows on multiple stages. In this article, the file was
around 7 million rows, and 6 dividing stages (100K, 10K, 1K, 100, 10 then 1) were
used to limit number of generated flow files. The other trick is to increase the “Splits”
queue’s “Back Pressure Data Size Threshold” to an adequate size for handling
file size, by default it’s 1GB. In this article, 2GB is used.
Assumptions & Design A standalone NiFi and Kafka instance
is to be used for this exercise. The
following NiFi flow will be used to split the workload of the multi-million row
csv file to be ingested by dividing the ingestion into multi-stages. Figure 1: the NiFi flow Figure 2: Properties for “SplitText-100000” Figure 3: Properties for “SplitText-10000” Figure 4: Properties for “SplitText-1000” Figure 5: Properties for “SplitText-100” Figure 6: Properties for “SplitText-10” Figure 7: Properties for “SplitText-1” Figure 8: Properties for the six “splits” queues Results The
csv rows were ingested properly to the Kafka topic. The only drawback for this
flow is that it took almost 30 minutes for ingesting a csv of around 7 million
rows. This can be enhanced by using multi NiFi instances and a clustered Kafka,
which should be tested in the future. Future Work I’ll try doing the same exercise using bigger file size
using bigger NiFi and Kafka clusters with higher hardware specs to validate the
same conclusion. References: https://community.hortonworks.com/questions/122858/nifi-splittext-big-file.html https://kafka.apache.org/documentation/
... View more
Labels:
09-08-2017
05:29 AM
Introduction The objective of this article is
to describe how use “DataImport” tool inside Apache Solr to index an Oracle DB
table. Assumptions
& Design In this exercise, I’ve used two
virtual machines:
1-Oracle Database App Development VM – 2GB RAM
2-Linux Centos VM – 3GB RAM (Solr node)
Please note that: Hortonworks HDP
Sandbox comes with out-of-the-box Solr service that can be easily provisioned
or enabled and used as well for this exercise through Ambari UI, instead of
installing Solr service on a standalone node. Oracle side - Create a dummy table with the
following structure: [1] - Insert some sample data into the
created table. On Solr Node #yum install java-1.8.0-openjdk.x86_64 #java
-version #wget
http://apache.org/dist/lucene/solr/6.6.0/solr-6.6.0.tgz #tar xzf solr-6.6.0.tgz solr-6.6.0/bin/install_solr_service.sh --strip-components=2 #sudo bash ./install_solr_service.sh solr-6.6.0.tgz #sudo service solr restart
you should see something
like following Started Solr server on port 8983 (pid=[….]).
Happy searching!
from Oracle machine, copy
the ojdbc to solr server #scp ojdbc6.jar
root@[Solr-IP-address]:/opt/solr/dist/ Create a new
collection by invoking the “solr create –c” command from the path
“/opt/solr/bin” as following: [2] From Solr
portal (URL: http://[Solr-IP-Address]:8983/solr/#/), make sure that the new
collection is appeared [3]
from the left panel of Solr home page, and
after selecting the “Oracle_table” core, select “Schema”, add the schema for
the new table created in Oracle DB. on the right side, press “Add Field” button and make sure not to delete one
of the main “Fields”. [4][5] [6][7] after
creating the schema fields, they should appear in the “Fields” list. [8] Create the
“data-config.xml” file under “/var/solr/data/Oracle_table/conf/”.
make sure of the column/field mapping between the Oracle DB table and Solr’s
Schema fields are properly configured properly. <dataConfig> <dataSource
name="jdbc" driver="oracle.jdbc.OracleDriver"
url="jdbc:oracle:thin:@//[DB-IP-Address]:[DB-Port]/[DBInstanceName]"
user="myDBuser" password="myDBpass"/> <entity
name="solr_test" query="select * from solr_test"> <field
column="EMP_ID" name="id" /> <field
column="FIRST_NAME" name="first_name" /> <field
column="LAST_NAME" name="last_name" /> <field column="DOB"
name="dob" /> </entity> </document> </dataConfig>
Add the following DataImport handler in “solrconfig.xml”
file. <requestHandler
name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst
name="defaults"> <str
name="config">data-config.xml</str> </lst> </requestHandler>
Add the following <lib/> element
in solrconfig.xml <lib
dir="/opt/solr-6.6.0/dist/" regex=".*\.jar" /> From the
Solr web UI, make sure that the “DataImport” under the created collection
“Oracle_table” is as following without errors or warnings: [9] press
“Execute” button, and wait for a while or press “Refresh Status” button till a
green notification panel is appeared, such as following: [10] Results Solr Side from the
left panel in Solr, select “Query”, and make sure that you’ll get results (on
the right side) after pressing on “Execute Query” button, as following: [11] Future
Work The future work will be extending Solr standalone node to
be within a small cluster for maintaining the cores’ replication and high
availability. References http://www.oracle.com/technetwork/database/enterprise-edition/databaseappdev-vm-161299.html https://cwiki.apache.org/confluence/display/solr/Running+Solr
... View more
Labels:
08-29-2017
08:07 AM
Introduction The
purpose of this article is to compare the upload time between three different
methods for uploading same structured datasets into Hadoop (two methods) and
MariaDB (one method).
Assumptions & Design a
small environment is used to deploy three node Hadoop cluster (one master node,
two worker nodes). The exercise will be run from my laptop which has the
following specs: Processor Name: Intel Core i7 Processor Speed: 2.5 GHz Number of Processors: 1 Total Number of Cores: 4 Memory: 16 GB The Hadoop
cluster will be virtualized on top of my Mac machine by “Oracle VM
VirtualBox Manager”. The virtual Hadoop nodes running will have the following
specs: Table 1: Hadoop Cluster –nodes’ specifications
Specification
Namenode (Master node)
Datanode#1 (Worker node #1)
Datanode #2 (Worker node #2)
Hostname
hdpnn.lab1.com
hdpdn1.lab1.com
hdpdn2.lab1.com
Memory
4646 MB
3072 MB
3072 MB
CPU Number
3
2
2
Hard disk size
20 GB
20 GB
20 GB
OS
CentOS-7-x86_64-Minimal
CentOS-7-x86_64-Minimal
CentOS-7-x86_64-Minimal
IP Address
192.168.43.15
192.168.43.16
192.168.43.17
The MariaDB
standalone virtual machine was used for installing MariaDB
database with the following specs: Hostname: mariadb.lab1.com IP Address: 192.168.43.55 Memory: 12GB Disk: 40 GB O.S: Linux Centos 7 The Semi-structured Datasets were
used is the mail archive for Apache Software Foundation (ASF), it was
around 200GB of total size. The mail archive contains
communications happened regarding more than 80 open-source projects, such as: (such
as Hadoop, Hive, Sqoop, Zookeeper, Hbase, Storm, Kafka and much more). The mail
archive could be downloaded simply using "wget" command or any other
tool from this URL: http://mail-archives.apache.org/mod_mbox/ Results Results collected from uploading mails files to
Hadoop cluster The
following results table was collected after distinct 14 uploads for different
13 sub-directories that vary in sizes, number of contained files and sizes of
contained files. The last upload was done for testing upload of all previous 13
sub-directories at once. Two upload methods used that are significantly changed
in upload time, the first method used is the normal upload for all files
directly from local files system of the Hadoop cluster. The second method used
is Hadoop Archive (HAR), which is a Hadoop capability used to combine files
together in an archiver before writing it back to HDFS. Table 2: Results collected from uploading mails files to Hadoop cluster
Loaded directory/directories
Directory/Directories size (KB)
Number of uploaded files
Avg. size of
uploaded files (KB)
Load time (1st attempt)
Load time (2nd attempt)
Load using Hadoop Archive (1st attempt)
Load using Hadoop Archive (2nd attempt)
lucene-dev
1214084
53547
22.67
89m16.790s
70m43.092s
2m54.563s
2m28.416s
tomcat-users
1023156
61303
16.69
101m45.870s
86m48.927s
3m17.214s
2m59.006s
cxf-commits
612216
22173
27.61
36m30.333s
29m37.924s
1m28.457s
1m15.189s
usergrid-commits
325740
9838
33.11
14m50.757s
14m8.545s
0m54.038s
0m44.905s
accumulo-notifications
163596
14482
11.30
24m38.159s
24m49.356s
1m3.650s
0m27.550s
zookeeper-user
82116
8187
10.03
14m40.461s
14m34.136s
0m47.865s
0m40.913s
synapse-user
41396
3690
11.22
5m24.744s
4m47.196s
0m38.167s
0m29.043s
incubator-ace-commits
20836
1146
18.18
2m28.330s
2m4.168s
0m25.042s
0m23.401s
incubator-batchee-dev
10404
1086
9.58
2m18.903s
2m19.044s
0m27.165s
0m23.166s
incubator-accumulo-user
5328
577
9.23
1m10.201s
1m2.328s
0m26.572s
0m23.300s
subversion-announce
2664
255
10.45
0m50.596s
0m32.339s
0m29.247s
0m21.578s
www-small-events-discuss
1828
218
8.39
0m45.215s
0m20.160s
0m22.847s
0m21.035s
openoffice-general-ja
912
101
9.03
0m26.837s
0m7.898s
0m21.764s
0m19.905s
All previous directories
3504280
176603
19.84
224m49.673s
Not tested
8m13.950s
8m46.144s
Results collected from uploading mails files to
MariaDB The
following results were collected after distinct 14 uploads for different 13
sub-directories that vary in sizes, number of contained files and sizes of
contained files. The last upload was done for testing upload of all previous 13
sub-directories at once. Table
3: Results collected from uploading mails files to MariaDB
Loaded directory
Total size (KB)
Number of loaded files
Avg. size of loaded
files (KB)
Load time (1st attempt)
Load time (2nd attempt)
lucene-dev
1214084
53547
22.67
4m56.730s
4m57.884s
tomcat-users
1023156
61303
16.69
5m40.320s
5m38.747s
cxf-commits
612216
22173
27.61
2m4.504s
2m2.992s
usergrid-commits
325740
9838
33.11
2m30.519s
0m55.091s
accumulo-notifications
163596
14482
11.30
1m14.929s
1m16.046s
zookeeper-user
82116
8187
10.03
0m39.250s
0m40.822s
synapse-user
41396
3690
11.22
0m18.205s
0m18.580s
incubator-ace-commits
20836
1146
18.18
0m5.794s
0m5.733s
incubator-batchee-dev
10404
1086
9.58
0m5.310s
0m5.276s
incubator-accumulo-user
5328
577
9.23
0m2.869s
0m2.657s
subversion-announce
2664
255
10.45
0m1.219s
0m1.228s
www-small-events-discuss
1828
218
8.39
0m1.045s
0m1.027s
openoffice-general-ja
912
101
9.03
0m0.535s
0m0.496s
All previous directories
3504280
176603
19.84
46m55.311s
17m31.941s
Figure 1: Uploaded data size in KB vs upload time in sec. Figure 2: No of uploaded files vs upload time in sec. Conclusion Traditional data warehouses could be tuned to store small-sized
semi-structured data. This could be valid and applicable for small-size upload.
By increasing number of files, it may not be the best option, especially when uploading
massive number of files of file (millions and above). Uploading small files into Hadoop is a resource consuming
process, uploading massive number of small files could affect the performance
of the Hadoop cluster dramatically; normal files upload to HDFS is creating a
separate Map-Reduce process for every single file. Using Hadoop Archive (HAR) tool is critical when loading
massive number of small files at once. The HAR concept is to append files
together by using a special delimiter before being uploaded to HDFS which
reduces uploading time significantly. It’s important to note that the query
time of a HAR from Hadoop will not be equivalent to Hadoop direct uploading without
using HAR; because processing HAR for query requires an additional process for internal
de-indexing. Future Work I’ll try doing the same exercise using bigger cluster with
higher hardware specs to validate the same conclusion. References: https://docs.hortonworks.com/HDPDocuments/Ambari-2.5.1.0/bk_ambari-installation/content/ch_Getting_Ready.html https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_hdfs-administration/content/ch_hadoop_archives.html https://mariadb.org/
... View more
Labels:
08-28-2017
05:22 AM
In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standbystate. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary. In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs. In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both. It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called “split-brain scenario,” the JournalNodes will only ever allow a single NameNode to be a writer at a time. During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active to safely proceed with failover. Reference: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
... View more
03-28-2017
03:25 AM
Pls, check this: https://community.hortonworks.com/answers/22307/view.html
... View more
11-10-2016
08:20 AM
Hello, I'm wondering when should we use Tez or MR as an execution engine for our queries running in hive?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
-
Apache Tez