Member since
11-04-2016
17
Posts
4
Kudos Received
0
Solutions
11-06-2018
06:09 PM
A couple things. 1. I have no idea what Kafka is. 2. 30 minutes for 7 million records is great as my flow if 40 minutes for a meager 70k records. In regards to the above multi SplitText usage my question is regarding the Settings tab for the SplitText. How should it be set? My flow does not execute PutFile still until everything has gone through. I have 2 SplitTexts currently and am about to put in a 3rd to see if that helps but it is just slow processing of the data. Overall flow i have is Source -> SplitText (5000) -> SplitText (250) -> Processing -> PutFile Any tips greatly appreciated.
... View more
09-08-2017
05:29 AM
Introduction The objective of this article is
to describe how use “DataImport” tool inside Apache Solr to index an Oracle DB
table. Assumptions
& Design In this exercise, I’ve used two
virtual machines:
1-Oracle Database App Development VM – 2GB RAM
2-Linux Centos VM – 3GB RAM (Solr node)
Please note that: Hortonworks HDP
Sandbox comes with out-of-the-box Solr service that can be easily provisioned
or enabled and used as well for this exercise through Ambari UI, instead of
installing Solr service on a standalone node. Oracle side - Create a dummy table with the
following structure: [1] - Insert some sample data into the
created table. On Solr Node #yum install java-1.8.0-openjdk.x86_64 #java
-version #wget
http://apache.org/dist/lucene/solr/6.6.0/solr-6.6.0.tgz #tar xzf solr-6.6.0.tgz solr-6.6.0/bin/install_solr_service.sh --strip-components=2 #sudo bash ./install_solr_service.sh solr-6.6.0.tgz #sudo service solr restart
you should see something
like following Started Solr server on port 8983 (pid=[….]).
Happy searching!
from Oracle machine, copy
the ojdbc to solr server #scp ojdbc6.jar
root@[Solr-IP-address]:/opt/solr/dist/ Create a new
collection by invoking the “solr create –c” command from the path
“/opt/solr/bin” as following: [2] From Solr
portal (URL: http://[Solr-IP-Address]:8983/solr/#/), make sure that the new
collection is appeared [3]
from the left panel of Solr home page, and
after selecting the “Oracle_table” core, select “Schema”, add the schema for
the new table created in Oracle DB. on the right side, press “Add Field” button and make sure not to delete one
of the main “Fields”. [4][5] [6][7] after
creating the schema fields, they should appear in the “Fields” list. [8] Create the
“data-config.xml” file under “/var/solr/data/Oracle_table/conf/”.
make sure of the column/field mapping between the Oracle DB table and Solr’s
Schema fields are properly configured properly. <dataConfig> <dataSource
name="jdbc" driver="oracle.jdbc.OracleDriver"
url="jdbc:oracle:thin:@//[DB-IP-Address]:[DB-Port]/[DBInstanceName]"
user="myDBuser" password="myDBpass"/> <entity
name="solr_test" query="select * from solr_test"> <field
column="EMP_ID" name="id" /> <field
column="FIRST_NAME" name="first_name" /> <field
column="LAST_NAME" name="last_name" /> <field column="DOB"
name="dob" /> </entity> </document> </dataConfig>
Add the following DataImport handler in “solrconfig.xml”
file. <requestHandler
name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst
name="defaults"> <str
name="config">data-config.xml</str> </lst> </requestHandler>
Add the following <lib/> element
in solrconfig.xml <lib
dir="/opt/solr-6.6.0/dist/" regex=".*\.jar" /> From the
Solr web UI, make sure that the “DataImport” under the created collection
“Oracle_table” is as following without errors or warnings: [9] press
“Execute” button, and wait for a while or press “Refresh Status” button till a
green notification panel is appeared, such as following: [10] Results Solr Side from the
left panel in Solr, select “Query”, and make sure that you’ll get results (on
the right side) after pressing on “Execute Query” button, as following: [11] Future
Work The future work will be extending Solr standalone node to
be within a small cluster for maintaining the cores’ replication and high
availability. References http://www.oracle.com/technetwork/database/enterprise-edition/databaseappdev-vm-161299.html https://cwiki.apache.org/confluence/display/solr/Running+Solr
... View more
Labels:
08-29-2017
08:07 AM
Introduction The
purpose of this article is to compare the upload time between three different
methods for uploading same structured datasets into Hadoop (two methods) and
MariaDB (one method).
Assumptions & Design a
small environment is used to deploy three node Hadoop cluster (one master node,
two worker nodes). The exercise will be run from my laptop which has the
following specs: Processor Name: Intel Core i7 Processor Speed: 2.5 GHz Number of Processors: 1 Total Number of Cores: 4 Memory: 16 GB The Hadoop
cluster will be virtualized on top of my Mac machine by “Oracle VM
VirtualBox Manager”. The virtual Hadoop nodes running will have the following
specs: Table 1: Hadoop Cluster –nodes’ specifications
Specification
Namenode (Master node)
Datanode#1 (Worker node #1)
Datanode #2 (Worker node #2)
Hostname
hdpnn.lab1.com
hdpdn1.lab1.com
hdpdn2.lab1.com
Memory
4646 MB
3072 MB
3072 MB
CPU Number
3
2
2
Hard disk size
20 GB
20 GB
20 GB
OS
CentOS-7-x86_64-Minimal
CentOS-7-x86_64-Minimal
CentOS-7-x86_64-Minimal
IP Address
192.168.43.15
192.168.43.16
192.168.43.17
The MariaDB
standalone virtual machine was used for installing MariaDB
database with the following specs: Hostname: mariadb.lab1.com IP Address: 192.168.43.55 Memory: 12GB Disk: 40 GB O.S: Linux Centos 7 The Semi-structured Datasets were
used is the mail archive for Apache Software Foundation (ASF), it was
around 200GB of total size. The mail archive contains
communications happened regarding more than 80 open-source projects, such as: (such
as Hadoop, Hive, Sqoop, Zookeeper, Hbase, Storm, Kafka and much more). The mail
archive could be downloaded simply using "wget" command or any other
tool from this URL: http://mail-archives.apache.org/mod_mbox/ Results Results collected from uploading mails files to
Hadoop cluster The
following results table was collected after distinct 14 uploads for different
13 sub-directories that vary in sizes, number of contained files and sizes of
contained files. The last upload was done for testing upload of all previous 13
sub-directories at once. Two upload methods used that are significantly changed
in upload time, the first method used is the normal upload for all files
directly from local files system of the Hadoop cluster. The second method used
is Hadoop Archive (HAR), which is a Hadoop capability used to combine files
together in an archiver before writing it back to HDFS. Table 2: Results collected from uploading mails files to Hadoop cluster
Loaded directory/directories
Directory/Directories size (KB)
Number of uploaded files
Avg. size of
uploaded files (KB)
Load time (1st attempt)
Load time (2nd attempt)
Load using Hadoop Archive (1st attempt)
Load using Hadoop Archive (2nd attempt)
lucene-dev
1214084
53547
22.67
89m16.790s
70m43.092s
2m54.563s
2m28.416s
tomcat-users
1023156
61303
16.69
101m45.870s
86m48.927s
3m17.214s
2m59.006s
cxf-commits
612216
22173
27.61
36m30.333s
29m37.924s
1m28.457s
1m15.189s
usergrid-commits
325740
9838
33.11
14m50.757s
14m8.545s
0m54.038s
0m44.905s
accumulo-notifications
163596
14482
11.30
24m38.159s
24m49.356s
1m3.650s
0m27.550s
zookeeper-user
82116
8187
10.03
14m40.461s
14m34.136s
0m47.865s
0m40.913s
synapse-user
41396
3690
11.22
5m24.744s
4m47.196s
0m38.167s
0m29.043s
incubator-ace-commits
20836
1146
18.18
2m28.330s
2m4.168s
0m25.042s
0m23.401s
incubator-batchee-dev
10404
1086
9.58
2m18.903s
2m19.044s
0m27.165s
0m23.166s
incubator-accumulo-user
5328
577
9.23
1m10.201s
1m2.328s
0m26.572s
0m23.300s
subversion-announce
2664
255
10.45
0m50.596s
0m32.339s
0m29.247s
0m21.578s
www-small-events-discuss
1828
218
8.39
0m45.215s
0m20.160s
0m22.847s
0m21.035s
openoffice-general-ja
912
101
9.03
0m26.837s
0m7.898s
0m21.764s
0m19.905s
All previous directories
3504280
176603
19.84
224m49.673s
Not tested
8m13.950s
8m46.144s
Results collected from uploading mails files to
MariaDB The
following results were collected after distinct 14 uploads for different 13
sub-directories that vary in sizes, number of contained files and sizes of
contained files. The last upload was done for testing upload of all previous 13
sub-directories at once. Table
3: Results collected from uploading mails files to MariaDB
Loaded directory
Total size (KB)
Number of loaded files
Avg. size of loaded
files (KB)
Load time (1st attempt)
Load time (2nd attempt)
lucene-dev
1214084
53547
22.67
4m56.730s
4m57.884s
tomcat-users
1023156
61303
16.69
5m40.320s
5m38.747s
cxf-commits
612216
22173
27.61
2m4.504s
2m2.992s
usergrid-commits
325740
9838
33.11
2m30.519s
0m55.091s
accumulo-notifications
163596
14482
11.30
1m14.929s
1m16.046s
zookeeper-user
82116
8187
10.03
0m39.250s
0m40.822s
synapse-user
41396
3690
11.22
0m18.205s
0m18.580s
incubator-ace-commits
20836
1146
18.18
0m5.794s
0m5.733s
incubator-batchee-dev
10404
1086
9.58
0m5.310s
0m5.276s
incubator-accumulo-user
5328
577
9.23
0m2.869s
0m2.657s
subversion-announce
2664
255
10.45
0m1.219s
0m1.228s
www-small-events-discuss
1828
218
8.39
0m1.045s
0m1.027s
openoffice-general-ja
912
101
9.03
0m0.535s
0m0.496s
All previous directories
3504280
176603
19.84
46m55.311s
17m31.941s
Figure 1: Uploaded data size in KB vs upload time in sec. Figure 2: No of uploaded files vs upload time in sec. Conclusion Traditional data warehouses could be tuned to store small-sized
semi-structured data. This could be valid and applicable for small-size upload.
By increasing number of files, it may not be the best option, especially when uploading
massive number of files of file (millions and above). Uploading small files into Hadoop is a resource consuming
process, uploading massive number of small files could affect the performance
of the Hadoop cluster dramatically; normal files upload to HDFS is creating a
separate Map-Reduce process for every single file. Using Hadoop Archive (HAR) tool is critical when loading
massive number of small files at once. The HAR concept is to append files
together by using a special delimiter before being uploaded to HDFS which
reduces uploading time significantly. It’s important to note that the query
time of a HAR from Hadoop will not be equivalent to Hadoop direct uploading without
using HAR; because processing HAR for query requires an additional process for internal
de-indexing. Future Work I’ll try doing the same exercise using bigger cluster with
higher hardware specs to validate the same conclusion. References: https://docs.hortonworks.com/HDPDocuments/Ambari-2.5.1.0/bk_ambari-installation/content/ch_Getting_Ready.html https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_hdfs-administration/content/ch_hadoop_archives.html https://mariadb.org/
... View more
Labels:
10-30-2017
06:23 PM
1 Kudo
In later versions of NiFi, you may also consider using the "record-aware" processors and their associated Record Readers/Writers, these were developed to avoid this multiple-split problem as well as the volume of associated provenance generated by each split flow file in the flow.
... View more
03-04-2019
01:00 PM
This is a fine answer that lists other aspects of considering Tez over MR for Hive http://community.hortonworks.com/answers/83488/view.html
... View more