About msabri

edjm1971 · ‎11-06-2018

A couple things. 1. I have no idea what Kafka is. 2. 30 minutes for 7 million records is great as my flow if 40 minutes for a meager 70k records. In regards to the above multi SplitText usage my question is regarding the Settings tab for the SplitText. How should it be set? My flow does not execute PutFile still until everything has gone through. I have 2 SplitTexts currently and am about to put in a 3rd to see if that helps but it is just slow processing of the data. Overall flow i have is Source -> SplitText (5000) -> SplitText (250) -> Processing -> PutFile Any tips greatly appreciated.

msabri · ‎09-08-2017

Introduction The objective of this article is to describe how use “DataImport” tool inside Apache Solr to index an Oracle DB table. Assumptions & Design In this exercise, I’ve used two virtual machines: 1-Oracle Database App Development VM – 2GB RAM 2-Linux Centos VM – 3GB RAM (Solr node) Please note that: Hortonworks HDP Sandbox comes with out-of-the-box Solr service that can be easily provisioned or enabled and used as well for this exercise through Ambari UI, instead of installing Solr service on a standalone node. Oracle side - Create a dummy table with the following structure: [1] - Insert some sample data into the created table. On Solr Node #yum install java-1.8.0-openjdk.x86_64 #java -version #wget http://apache.org/dist/lucene/solr/6.6.0/solr-6.6.0.tgz #tar xzf solr-6.6.0.tgz solr-6.6.0/bin/install_solr_service.sh --strip-components=2 #sudo bash ./install_solr_service.sh solr-6.6.0.tgz #sudo service solr restart you should see something like following Started Solr server on port 8983 (pid=[….]). Happy searching! from Oracle machine, copy the ojdbc to solr server #scp ojdbc6.jar root@[Solr-IP-address]:/opt/solr/dist/ Create a new collection by invoking the “solr create –c” command from the path “/opt/solr/bin” as following: [2] From Solr portal (URL: http://[Solr-IP-Address]:8983/solr/#/), make sure that the new collection is appeared [3] from the left panel of Solr home page, and after selecting the “Oracle_table” core, select “Schema”, add the schema for the new table created in Oracle DB. on the right side, press “Add Field” button and make sure not to delete one of the main “Fields”. [4][5] [6][7] after creating the schema fields, they should appear in the “Fields” list. [8] Create the “data-config.xml” file under “/var/solr/data/Oracle_table/conf/”. make sure of the column/field mapping between the Oracle DB table and Solr’s Schema fields are properly configured properly. <dataConfig> <dataSource name="jdbc" driver="oracle.jdbc.OracleDriver" url="jdbc:oracle:thin:@//[DB-IP-Address]:[DB-Port]/[DBInstanceName]" user="myDBuser" password="myDBpass"/> <entity name="solr_test" query="select * from solr_test"> <field column="EMP_ID" name="id" /> <field column="FIRST_NAME" name="first_name" /> <field column="LAST_NAME" name="last_name" /> <field column="DOB" name="dob" /> </entity> </document> </dataConfig> Add the following DataImport handler in “solrconfig.xml” file. <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">data-config.xml</str> </lst> </requestHandler> Add the following <lib/> element in solrconfig.xml <lib dir="/opt/solr-6.6.0/dist/" regex=".*\.jar" /> From the Solr web UI, make sure that the “DataImport” under the created collection “Oracle_table” is as following without errors or warnings: [9] press “Execute” button, and wait for a while or press “Refresh Status” button till a green notification panel is appeared, such as following: [10] Results Solr Side from the left panel in Solr, select “Query”, and make sure that you’ll get results (on the right side) after pressing on “Execute Query” button, as following: [11] Future Work The future work will be extending Solr standalone node to be within a small cluster for maintaining the cores’ replication and high availability. References http://www.oracle.com/technetwork/database/enterprise-edition/databaseappdev-vm-161299.html https://cwiki.apache.org/confluence/display/solr/Running+Solr

msabri · ‎08-29-2017

Introduction The purpose of this article is to compare the upload time between three different methods for uploading same structured datasets into Hadoop (two methods) and MariaDB (one method). Assumptions & Design a small environment is used to deploy three node Hadoop cluster (one master node, two worker nodes). The exercise will be run from my laptop which has the following specs: Processor Name: Intel Core i7 Processor Speed: 2.5 GHz Number of Processors: 1 Total Number of Cores: 4 Memory: 16 GB The Hadoop cluster will be virtualized on top of my Mac machine by “Oracle VM VirtualBox Manager”. The virtual Hadoop nodes running will have the following specs: Table 1: Hadoop Cluster –nodes’ specifications Specification Namenode (Master node) Datanode#1 (Worker node #1) Datanode #2 (Worker node #2) Hostname hdpnn.lab1.com hdpdn1.lab1.com hdpdn2.lab1.com Memory 4646 MB 3072 MB 3072 MB CPU Number 3 2 2 Hard disk size 20 GB 20 GB 20 GB OS CentOS-7-x86_64-Minimal CentOS-7-x86_64-Minimal CentOS-7-x86_64-Minimal IP Address 192.168.43.15 192.168.43.16 192.168.43.17 The MariaDB standalone virtual machine was used for installing MariaDB database with the following specs: Hostname: mariadb.lab1.com IP Address: 192.168.43.55 Memory: 12GB Disk: 40 GB O.S: Linux Centos 7 The Semi-structured Datasets were used is the mail archive for Apache Software Foundation (ASF), it was around 200GB of total size. The mail archive contains communications happened regarding more than 80 open-source projects, such as: (such as Hadoop, Hive, Sqoop, Zookeeper, Hbase, Storm, Kafka and much more). The mail archive could be downloaded simply using "wget" command or any other tool from this URL: http://mail-archives.apache.org/mod_mbox/ Results Results collected from uploading mails files to Hadoop cluster The following results table was collected after distinct 14 uploads for different 13 sub-directories that vary in sizes, number of contained files and sizes of contained files. The last upload was done for testing upload of all previous 13 sub-directories at once. Two upload methods used that are significantly changed in upload time, the first method used is the normal upload for all files directly from local files system of the Hadoop cluster. The second method used is Hadoop Archive (HAR), which is a Hadoop capability used to combine files together in an archiver before writing it back to HDFS. Table 2: Results collected from uploading mails files to Hadoop cluster Loaded directory/directories Directory/Directories size (KB) Number of uploaded files Avg. size of uploaded files (KB) Load time (1st attempt) Load time (2nd attempt) Load using Hadoop Archive (1st attempt) Load using Hadoop Archive (2nd attempt) lucene-dev 1214084 53547 22.67 89m16.790s 70m43.092s 2m54.563s 2m28.416s tomcat-users 1023156 61303 16.69 101m45.870s 86m48.927s 3m17.214s 2m59.006s cxf-commits 612216 22173 27.61 36m30.333s 29m37.924s 1m28.457s 1m15.189s usergrid-commits 325740 9838 33.11 14m50.757s 14m8.545s 0m54.038s 0m44.905s accumulo-notifications 163596 14482 11.30 24m38.159s 24m49.356s 1m3.650s 0m27.550s zookeeper-user 82116 8187 10.03 14m40.461s 14m34.136s 0m47.865s 0m40.913s synapse-user 41396 3690 11.22 5m24.744s 4m47.196s 0m38.167s 0m29.043s incubator-ace-commits 20836 1146 18.18 2m28.330s 2m4.168s 0m25.042s 0m23.401s incubator-batchee-dev 10404 1086 9.58 2m18.903s 2m19.044s 0m27.165s 0m23.166s incubator-accumulo-user 5328 577 9.23 1m10.201s 1m2.328s 0m26.572s 0m23.300s subversion-announce 2664 255 10.45 0m50.596s 0m32.339s 0m29.247s 0m21.578s www-small-events-discuss 1828 218 8.39 0m45.215s 0m20.160s 0m22.847s 0m21.035s openoffice-general-ja 912 101 9.03 0m26.837s 0m7.898s 0m21.764s 0m19.905s All previous directories 3504280 176603 19.84 224m49.673s Not tested 8m13.950s 8m46.144s Results collected from uploading mails files to MariaDB The following results were collected after distinct 14 uploads for different 13 sub-directories that vary in sizes, number of contained files and sizes of contained files. The last upload was done for testing upload of all previous 13 sub-directories at once. Table 3: Results collected from uploading mails files to MariaDB Loaded directory Total size (KB) Number of loaded files Avg. size of loaded files (KB) Load time (1st attempt) Load time (2nd attempt) lucene-dev 1214084 53547 22.67 4m56.730s 4m57.884s tomcat-users 1023156 61303 16.69 5m40.320s 5m38.747s cxf-commits 612216 22173 27.61 2m4.504s 2m2.992s usergrid-commits 325740 9838 33.11 2m30.519s 0m55.091s accumulo-notifications 163596 14482 11.30 1m14.929s 1m16.046s zookeeper-user 82116 8187 10.03 0m39.250s 0m40.822s synapse-user 41396 3690 11.22 0m18.205s 0m18.580s incubator-ace-commits 20836 1146 18.18 0m5.794s 0m5.733s incubator-batchee-dev 10404 1086 9.58 0m5.310s 0m5.276s incubator-accumulo-user 5328 577 9.23 0m2.869s 0m2.657s subversion-announce 2664 255 10.45 0m1.219s 0m1.228s www-small-events-discuss 1828 218 8.39 0m1.045s 0m1.027s openoffice-general-ja 912 101 9.03 0m0.535s 0m0.496s All previous directories 3504280 176603 19.84 46m55.311s 17m31.941s Figure 1: Uploaded data size in KB vs upload time in sec. Figure 2: No of uploaded files vs upload time in sec. Conclusion Traditional data warehouses could be tuned to store small-sized semi-structured data. This could be valid and applicable for small-size upload. By increasing number of files, it may not be the best option, especially when uploading massive number of files of file (millions and above). Uploading small files into Hadoop is a resource consuming process, uploading massive number of small files could affect the performance of the Hadoop cluster dramatically; normal files upload to HDFS is creating a separate Map-Reduce process for every single file. Using Hadoop Archive (HAR) tool is critical when loading massive number of small files at once. The HAR concept is to append files together by using a special delimiter before being uploaded to HDFS which reduces uploading time significantly. It’s important to note that the query time of a HAR from Hadoop will not be equivalent to Hadoop direct uploading without using HAR; because processing HAR for query requires an additional process for internal de-indexing. Future Work I’ll try doing the same exercise using bigger cluster with higher hardware specs to validate the same conclusion. References: https://docs.hortonworks.com/HDPDocuments/Ambari-2.5.1.0/bk_ambari-installation/content/ch_Getting_Ready.html https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_hdfs-administration/content/ch_hadoop_archives.html https://mariadb.org/

mburgess · ‎10-30-2017

In later versions of NiFi, you may also consider using the "record-aware" processors and their associated Record Readers/Writers, these were developed to avoid this multiple-split problem as well as the volume of associated provenance generated by each split flow file in the flow.

y2k_shubhamgupt · ‎03-04-2019

This is a fine answer that lists other aspects of considering Tez over MR for Hive http://community.hortonworks.com/answers/83488/view.html

Online	Offline
Last Visited	‎10-30-2017 11:28 AM

Member Since	‎11-04-2016 08:35 PM
Last Visited	‎10-30-2017 11:28 AM
Posts	17
Kudos received	4

Cloudera Community

Re: Ingesting a Big CSV file into Kafka using a mu...

Indexing Oracle tables into Apache Solr

Semi-structured Dataset Upload-Time Comparison Bet...

Re: Nifi SplitText Big File

Re: When Tez/MR is better for query execution?