About msabri

msabri · ‎10-30-2017

I did something similar for pushing data to Kafka using few million rows CSV file by the same concept of multiple splits: https://community.hortonworks.com/content/kbentry/144771/ingesting-a-big-csv-file-into-kafka-using-a-multi.html

msabri · ‎10-30-2017

Introduction The purpose of this article is to use Apache NiFi to ingest a huge csv file (few million rows) into a Kafka topic. The tricky point when dealing with big file (millions of rows) is that processing the file by NiFi on row-level will generate flow files for every row, this can produce java memory error. As a work around, we need to limit the generated number of rows by splitting the rows on multiple stages. In this article, the file was around 7 million rows, and 6 dividing stages (100K, 10K, 1K, 100, 10 then 1) were used to limit number of generated flow files. The other trick is to increase the “Splits” queue’s “Back Pressure Data Size Threshold” to an adequate size for handling file size, by default it’s 1GB. In this article, 2GB is used. Assumptions & Design A standalone NiFi and Kafka instance is to be used for this exercise. The following NiFi flow will be used to split the workload of the multi-million row csv file to be ingested by dividing the ingestion into multi-stages. Figure 1: the NiFi flow Figure 2: Properties for “SplitText-100000” Figure 3: Properties for “SplitText-10000” Figure 4: Properties for “SplitText-1000” Figure 5: Properties for “SplitText-100” Figure 6: Properties for “SplitText-10” Figure 7: Properties for “SplitText-1” Figure 8: Properties for the six “splits” queues Results The csv rows were ingested properly to the Kafka topic. The only drawback for this flow is that it took almost 30 minutes for ingesting a csv of around 7 million rows. This can be enhanced by using multi NiFi instances and a clustered Kafka, which should be tested in the future. Future Work I’ll try doing the same exercise using bigger file size using bigger NiFi and Kafka clusters with higher hardware specs to validate the same conclusion. References: https://community.hortonworks.com/questions/122858/nifi-splittext-big-file.html https://kafka.apache.org/documentation/

msabri · ‎09-08-2017

Introduction The objective of this article is to describe how use “DataImport” tool inside Apache Solr to index an Oracle DB table. Assumptions & Design In this exercise, I’ve used two virtual machines: 1-Oracle Database App Development VM – 2GB RAM 2-Linux Centos VM – 3GB RAM (Solr node) Please note that: Hortonworks HDP Sandbox comes with out-of-the-box Solr service that can be easily provisioned or enabled and used as well for this exercise through Ambari UI, instead of installing Solr service on a standalone node. Oracle side - Create a dummy table with the following structure: [1] - Insert some sample data into the created table. On Solr Node #yum install java-1.8.0-openjdk.x86_64 #java -version #wget http://apache.org/dist/lucene/solr/6.6.0/solr-6.6.0.tgz #tar xzf solr-6.6.0.tgz solr-6.6.0/bin/install_solr_service.sh --strip-components=2 #sudo bash ./install_solr_service.sh solr-6.6.0.tgz #sudo service solr restart you should see something like following Started Solr server on port 8983 (pid=[….]). Happy searching! from Oracle machine, copy the ojdbc to solr server #scp ojdbc6.jar root@[Solr-IP-address]:/opt/solr/dist/ Create a new collection by invoking the “solr create –c” command from the path “/opt/solr/bin” as following: [2] From Solr portal (URL: http://[Solr-IP-Address]:8983/solr/#/), make sure that the new collection is appeared [3] from the left panel of Solr home page, and after selecting the “Oracle_table” core, select “Schema”, add the schema for the new table created in Oracle DB. on the right side, press “Add Field” button and make sure not to delete one of the main “Fields”. [4][5] [6][7] after creating the schema fields, they should appear in the “Fields” list. [8] Create the “data-config.xml” file under “/var/solr/data/Oracle_table/conf/”. make sure of the column/field mapping between the Oracle DB table and Solr’s Schema fields are properly configured properly. <dataConfig> <dataSource name="jdbc" driver="oracle.jdbc.OracleDriver" url="jdbc:oracle:thin:@//[DB-IP-Address]:[DB-Port]/[DBInstanceName]" user="myDBuser" password="myDBpass"/> <entity name="solr_test" query="select * from solr_test"> <field column="EMP_ID" name="id" /> <field column="FIRST_NAME" name="first_name" /> <field column="LAST_NAME" name="last_name" /> <field column="DOB" name="dob" /> </entity> </document> </dataConfig> Add the following DataImport handler in “solrconfig.xml” file. <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">data-config.xml</str> </lst> </requestHandler> Add the following <lib/> element in solrconfig.xml <lib dir="/opt/solr-6.6.0/dist/" regex=".*\.jar" /> From the Solr web UI, make sure that the “DataImport” under the created collection “Oracle_table” is as following without errors or warnings: [9] press “Execute” button, and wait for a while or press “Refresh Status” button till a green notification panel is appeared, such as following: [10] Results Solr Side from the left panel in Solr, select “Query”, and make sure that you’ll get results (on the right side) after pressing on “Execute Query” button, as following: [11] Future Work The future work will be extending Solr standalone node to be within a small cluster for maintaining the cores’ replication and high availability. References http://www.oracle.com/technetwork/database/enterprise-edition/databaseappdev-vm-161299.html https://cwiki.apache.org/confluence/display/solr/Running+Solr

msabri · ‎08-29-2017

Introduction The purpose of this article is to compare the upload time between three different methods for uploading same structured datasets into Hadoop (two methods) and MariaDB (one method). Assumptions & Design a small environment is used to deploy three node Hadoop cluster (one master node, two worker nodes). The exercise will be run from my laptop which has the following specs: Processor Name: Intel Core i7 Processor Speed: 2.5 GHz Number of Processors: 1 Total Number of Cores: 4 Memory: 16 GB The Hadoop cluster will be virtualized on top of my Mac machine by “Oracle VM VirtualBox Manager”. The virtual Hadoop nodes running will have the following specs: Table 1: Hadoop Cluster –nodes’ specifications Specification Namenode (Master node) Datanode#1 (Worker node #1) Datanode #2 (Worker node #2) Hostname hdpnn.lab1.com hdpdn1.lab1.com hdpdn2.lab1.com Memory 4646 MB 3072 MB 3072 MB CPU Number 3 2 2 Hard disk size 20 GB 20 GB 20 GB OS CentOS-7-x86_64-Minimal CentOS-7-x86_64-Minimal CentOS-7-x86_64-Minimal IP Address 192.168.43.15 192.168.43.16 192.168.43.17 The MariaDB standalone virtual machine was used for installing MariaDB database with the following specs: Hostname: mariadb.lab1.com IP Address: 192.168.43.55 Memory: 12GB Disk: 40 GB O.S: Linux Centos 7 The Semi-structured Datasets were used is the mail archive for Apache Software Foundation (ASF), it was around 200GB of total size. The mail archive contains communications happened regarding more than 80 open-source projects, such as: (such as Hadoop, Hive, Sqoop, Zookeeper, Hbase, Storm, Kafka and much more). The mail archive could be downloaded simply using "wget" command or any other tool from this URL: http://mail-archives.apache.org/mod_mbox/ Results Results collected from uploading mails files to Hadoop cluster The following results table was collected after distinct 14 uploads for different 13 sub-directories that vary in sizes, number of contained files and sizes of contained files. The last upload was done for testing upload of all previous 13 sub-directories at once. Two upload methods used that are significantly changed in upload time, the first method used is the normal upload for all files directly from local files system of the Hadoop cluster. The second method used is Hadoop Archive (HAR), which is a Hadoop capability used to combine files together in an archiver before writing it back to HDFS. Table 2: Results collected from uploading mails files to Hadoop cluster Loaded directory/directories Directory/Directories size (KB) Number of uploaded files Avg. size of uploaded files (KB) Load time (1st attempt) Load time (2nd attempt) Load using Hadoop Archive (1st attempt) Load using Hadoop Archive (2nd attempt) lucene-dev 1214084 53547 22.67 89m16.790s 70m43.092s 2m54.563s 2m28.416s tomcat-users 1023156 61303 16.69 101m45.870s 86m48.927s 3m17.214s 2m59.006s cxf-commits 612216 22173 27.61 36m30.333s 29m37.924s 1m28.457s 1m15.189s usergrid-commits 325740 9838 33.11 14m50.757s 14m8.545s 0m54.038s 0m44.905s accumulo-notifications 163596 14482 11.30 24m38.159s 24m49.356s 1m3.650s 0m27.550s zookeeper-user 82116 8187 10.03 14m40.461s 14m34.136s 0m47.865s 0m40.913s synapse-user 41396 3690 11.22 5m24.744s 4m47.196s 0m38.167s 0m29.043s incubator-ace-commits 20836 1146 18.18 2m28.330s 2m4.168s 0m25.042s 0m23.401s incubator-batchee-dev 10404 1086 9.58 2m18.903s 2m19.044s 0m27.165s 0m23.166s incubator-accumulo-user 5328 577 9.23 1m10.201s 1m2.328s 0m26.572s 0m23.300s subversion-announce 2664 255 10.45 0m50.596s 0m32.339s 0m29.247s 0m21.578s www-small-events-discuss 1828 218 8.39 0m45.215s 0m20.160s 0m22.847s 0m21.035s openoffice-general-ja 912 101 9.03 0m26.837s 0m7.898s 0m21.764s 0m19.905s All previous directories 3504280 176603 19.84 224m49.673s Not tested 8m13.950s 8m46.144s Results collected from uploading mails files to MariaDB The following results were collected after distinct 14 uploads for different 13 sub-directories that vary in sizes, number of contained files and sizes of contained files. The last upload was done for testing upload of all previous 13 sub-directories at once. Table 3: Results collected from uploading mails files to MariaDB Loaded directory Total size (KB) Number of loaded files Avg. size of loaded files (KB) Load time (1st attempt) Load time (2nd attempt) lucene-dev 1214084 53547 22.67 4m56.730s 4m57.884s tomcat-users 1023156 61303 16.69 5m40.320s 5m38.747s cxf-commits 612216 22173 27.61 2m4.504s 2m2.992s usergrid-commits 325740 9838 33.11 2m30.519s 0m55.091s accumulo-notifications 163596 14482 11.30 1m14.929s 1m16.046s zookeeper-user 82116 8187 10.03 0m39.250s 0m40.822s synapse-user 41396 3690 11.22 0m18.205s 0m18.580s incubator-ace-commits 20836 1146 18.18 0m5.794s 0m5.733s incubator-batchee-dev 10404 1086 9.58 0m5.310s 0m5.276s incubator-accumulo-user 5328 577 9.23 0m2.869s 0m2.657s subversion-announce 2664 255 10.45 0m1.219s 0m1.228s www-small-events-discuss 1828 218 8.39 0m1.045s 0m1.027s openoffice-general-ja 912 101 9.03 0m0.535s 0m0.496s All previous directories 3504280 176603 19.84 46m55.311s 17m31.941s Figure 1: Uploaded data size in KB vs upload time in sec. Figure 2: No of uploaded files vs upload time in sec. Conclusion Traditional data warehouses could be tuned to store small-sized semi-structured data. This could be valid and applicable for small-size upload. By increasing number of files, it may not be the best option, especially when uploading massive number of files of file (millions and above). Uploading small files into Hadoop is a resource consuming process, uploading massive number of small files could affect the performance of the Hadoop cluster dramatically; normal files upload to HDFS is creating a separate Map-Reduce process for every single file. Using Hadoop Archive (HAR) tool is critical when loading massive number of small files at once. The HAR concept is to append files together by using a special delimiter before being uploaded to HDFS which reduces uploading time significantly. It’s important to note that the query time of a HAR from Hadoop will not be equivalent to Hadoop direct uploading without using HAR; because processing HAR for query requires an additional process for internal de-indexing. Future Work I’ll try doing the same exercise using bigger cluster with higher hardware specs to validate the same conclusion. References: https://docs.hortonworks.com/HDPDocuments/Ambari-2.5.1.0/bk_ambari-installation/content/ch_Getting_Ready.html https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_hdfs-administration/content/ch_hadoop_archives.html https://mariadb.org/

msabri · ‎08-28-2017

In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standbystate. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary. In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs. In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both. It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called “split-brain scenario,” the JournalNodes will only ever allow a single NameNode to be a writer at a time. During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active to safely proceed with failover. Reference: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html

msabri · ‎03-28-2017

Pls, check this: https://community.hortonworks.com/answers/22307/view.html

msabri · ‎11-10-2016

Hello, I'm wondering when should we use Tez or MR as an execution engine for our queries running in hive?

Online	Offline
Last Visited	‎10-30-2017 11:28 AM

Member Since	‎11-04-2016 08:35 PM
Last Visited	‎10-30-2017 11:28 AM
Posts	17
Kudos received	4

Cloudera Community

Re: Nifi SplitText Big File

Ingesting a Big CSV file into Kafka using a multi-...

Indexing Oracle tables into Apache Solr

Semi-structured Dataset Upload-Time Comparison Bet...

Re: Secondary NameNode, CheckpointNode or BackupNo...

Re: How can I set primary key and reference foreig...

When Tez/MR is better for query execution?