Member since
05-30-2018
1322
Posts
715
Kudos Received
148
Solutions
02-22-2017
09:56 PM
3 Kudos
I find those who install NiFi via ambari using local repos are generally required for security proposes to call out ports to be opened. Typical on cloud environments. I plan update this list this with community feedback to keep this list fresh. These ports are not set in stone. NiFi ports are configure by simply changing the port in the properties file. Lets get to it:
Ambari
8080 Zookeeper
2181 Protocol port 9088 HTTP port (ssl)
9091 HTTP port (non-ssl)
9090 Certificate Authority 10443 nifi.remote.input.socket.port
8022 nifi.cluster.node.protocol.port
8021 Remote Process Group
Raw
8022 HTTP
8070 nifi.remote.input.socket.port
9999
... View more
Labels:
12-25-2016
05:19 AM
good article
... View more
12-15-2016
11:20 PM
7 Kudos
A continuation article from my IaaS Hadoop performance testing. My previous performance test was on BigStep. Objective Test 1 Terabyte of data using the Tera Suite (TeraGen, TeraSort, and TeraValidate) on similar hardware profiles using core baseline settings across multiple IaaS providing and Hadoop as a Service offerings. Here we will capture EMR performance statistics using EMRFS(s3), an object storage. AWS EMR The natural next step is to test the Tera suite on AWS EMR which is Amazon's Hadoop as a Service offering. I used EMR with "EMRFS which is an implementation of HDFS which allows EMR clusters to store data on Amazon S3". Object storage with Hadoop has not traditionally performed well. I was very interested in testing the new EMRFS. EMRFS/S3 was chosen as the storage device for this test due to fact much of the s3's (with EMR) allure is around EMR's ability to store and process data directly off s3. Using EMR's local storage (not s3) mayincrease performance. Hardware Instance Type vCPU Ram Disk i2.4xlarge 16 122 4x800SSD 1 master and 3 data nodes Observation I have run the same core scripts on other platforms (100s of times) without any modification. That is the objective of these test. Test the same job/script on similar hardware profiles and number of nodes. With EMR that was not the case. I had to change various script settings, MR jar file, and timeout settings for the scripts to work on EMR. Jobs on EMR failed using 1 terabyte of data. Issue posted here on AWS forum. I set mapred.task.timeout=12000000 to get around the EMRFS connection reset issue. This issue did not occur for smaller dataset. TeraGen Results: 26 Minutes, 45 Seconds TeraSort Results: 2 Hours, 57 Minutes, 49 seconds TeraValidate Results: 23 minutes, 55 Seconds Performance Numbers IaaS TeraGen TeraSort TeraValidate AWS EMR (EMRFS/s3) 26 Minutes, 45 Seconds 2 Hours, 57 Minutes, 49 seconds 23 minutes, 55 Seconds BigStep/HDP (DAS) 11 Mins 49 Secs 51 Mins 12 secs 4 mins 42 seconds Note - Bigstep test used local disk. EMR test used EMRFS. These numbers show different in performance between local storage and EMRFS(s3). Performance statistics on EMR processing performance using local storage (non S3/EMRFS) are not provided here. The objective of the test was to capture performance statistics using same jobs/scripts with same configuration on similar hardware and document results. That's it. Keep it simple. This is not a reflection of the capabilities of a/the specific IaaS provider. All my scripts are located here. The EMR specific scripts are here.
... View more
Labels:
12-13-2016
03:50 AM
6 Kudos
I have written articles in the past benchmarking Hadoop cloud environments such at BigStep and AWS. What I didn't dive into those article is how I ran the script. I built scripts to rapidly launch TeraGen, TeraSort, and TeraValidate. Why? I found myself running the same script over and over and over again. Why not make it easier by simply executing a shell script. All scripts I mentioned are located here. Grab the following files teragen.sh terasort.sh validate.sh To run TeraGen, TeraSort, and TeraValidate a determination of the volume of data and number of records is required. For example you can generate 500GB of data with 5000000000 rows. The script comes with the following predefined sets #SIZE=500G
#ROWS=5000000000
#SIZE=100G
#ROWS=1000000000
#This will be used as it only value uncommented out
SIZE=1T
ROWS=10000000000
# SIZE=10G
#ROWS=100000000
# SIZE=1G
# ROWS=10000000 Above 1T (for terabyte) and rows 10000000000 are uncommented out. Meaning this script will generate 1TB of data with 10000000000 rows. If you want to use different dataset size and rows, simply comment out all other size and rows. Essentially using only the one you want. Only 1 SIZE and ROWS should be set (uncommented out). This applies to all scripts (teragen.sh, terasort.sh, validate.sh). All scripts must have same SIZE & ROWS setting. Logs A log directory is created based on where you run the script. Run output and stats are stored in the logs directory. For example if you run /home/sunile/teragen.sh It will create the logs directory here, /home/sunile/logs. All the logs from teragen, sort, and validate will reside here. Parameters This is an important piece for tuning. To benchmark your environments parameters should be configured. Much of this is trial and error. I would say experience is required here. ie How each parameter impacts a MapReduce job. Get help here. For tuning change/add parameters here: For ease of first time execution, use the ones set in the script. Run it as is and grab your stats. If the stats are acceptable then move on. What is acceptable? Take a look the articles I published on BigStep and AWS. If stats not acceptable, starting tuning. Run the jobs in the following order TeraGen (teragen.sh) TeraSort (terasort.sh) TeraValidate (validate.sh) Hope these scripts help you quickly benchmark your environment. Now go build some cool stuff!
... View more
12-01-2016
10:07 PM
2 Kudos
Quick tips on how to find low level hardware performance stats. I use it often for NiFi/Spark/Hadoop. This is not limited to those use services. Additionally, this is not a exhaustive list nor am I advocating one tool over the another. Just a few I have ran over the years during my implementations/POCs experience. These give me insights whether I have allocated enough physical resources to run the services. I highly recommend not assuming what your hardware can or can't do. Benchmark it! How? Read my article here. Lets get to it. CPU stats
iostat -c 1 3 will provide you cpu stats every 1 second 3 times. Output of the report (sourced right from here😞
CPU Utilization Report
The first report generated by the iostat command is the CPU Utilization Report. For multiprocessor systems, the CPU values are global averages among all processors. The report has the following format:
%user
Show the percentage of CPU utilization that occurred while executing at the user level (application).
%nice
Show the percentage of CPU utilization that occurred while executing at the user level with nice priority.
%system
Show the percentage of CPU utilization that occurred while executing at the system level (kernel).
%iowait
Show the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
%steal
Show the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
%idle
Show the percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
vmstat -S M 1 5 will report in megabyte every 1 second 5 times. The megabyte indicator (-S M) is not important here since we are only looking at CPU. Output of the report (sourced right from here) The us column reports the amount of time that the processor spends on userland tasks, or all non-kernel processes.
The sy column reports the amount of time that the processor spends on kernel related tasks.
The id column reports the amount of time that the processor spends idle.
The wa column reports the amount of time that the processor spends waiting for IO operations to complete before being able to continue processing tasks.
Memory stats glances is a tool I use for many stats since the UI is much friendlier then most tools. execute glances on command line to view stats for disk, io, and memory. You can also use it as client/server grabbing stats from remote servers. Here you can see swap as well (Sourced from here)
Another method is to run vmstat 1 5 which will read stats every 1 second 5 times. Output of memory stats (source from here) swpd: the amount of virtual memory used.
free: the amount of idle memory.
buff: the amount of memory used as buffers.
cache: the amount of memory used as cache.
inact: the amount of inactive memory. (-a option)
active: the amount of active memory. (-a option) Monitor the swpd. If you are swapping too much you will find your CPU will run hot. Disk stats Glance is good tool to monitor IO. Take a look at the glances screen shot you will see the io stats. iostat -d 1 5 will output disk stats every 1 second 5 times. Output of the report (sourced right from here) Device Utilization Report
The device report provides statistics on a per physical device or partition basis. Block devices for which statistics are to be displayed may be entered on the command line. Partitions may also be entered on the command line providing that option -x is not used. If no device nor partition is entered, then statistics are displayed for every device used by the system, and providing that the kernel maintains statistics for it. If the ALL keyword is given on the command line, then statistics are displayed for every device defined by the system, including those that have never been used. The report may show the following fields, depending on the flags used:
Device:
This column gives the device (or partition) name, which is displayed as hdiskn with 2.2 kernels, for the nth device. It is displayed as devm-n with 2.4 kernels, where m is the major number of the device, and n a distinctive number. With newer kernels, the device name as listed in the /dev directory is displayed.
tps
Indicate the number of transfers per second that were issued to the device. A transfer is an I/O request to the device. Multiple logical requests can be combined into a single I/O request to the device. A transfer is of indeterminate size.
Blk_read/s
Indicate the amount of data read from the device expressed in a number of blocks per second. Blocks are equivalent to sectors with kernels 2.4 and later and therefore have a size of 512 bytes. With older kernels, a block is of indeterminate size.
Blk_wrtn/s
Indicate the amount of data written to the device expressed in a number of blocks per second.
Blk_read
The total number of blocks read.
Blk_wrtn
The total number of blocks written.
kB_read/s
Indicate the amount of data read from the device expressed in kilobytes per second.
kB_wrtn/s
Indicate the amount of data written to the device expressed in kilobytes per second.
kB_read
The total number of kilobytes read.
kB_wrtn
The total number of kilobytes written.
MB_read/s
Indicate the amount of data read from the device expressed in megabytes per second.
MB_wrtn/s
Indicate the amount of data written to the device expressed in megabytes per second.
MB_read
The total number of megabytes read.
MB_wrtn
The total number of megabytes written.
rrqm/s
The number of read requests merged per second that were queued to the device.
wrqm/s
The number of write requests merged per second that were queued to the device.
r/s
The number of read requests that were issued to the device per second.
w/s
The number of write requests that were issued to the device per second.
rsec/s
The number of sectors read from the device per second.
wsec/s
The number of sectors written to the device per second.
rkB/s
The number of kilobytes read from the device per second.
wkB/s
The number of kilobytes written to the device per second.
rMB/s
The number of megabytes read from the device per second.
wMB/s
The number of megabytes written to the device per second.
avgrq-sz
The average size (in sectors) of the requests that were issued to the device.
avgqu-sz
The average queue length of the requests that were issued to the device.
await
The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
svctm
The average service time (in milliseconds) for I/O requests that were issued to the device. Warning! Do not trust this field any more. This field will be removed in a future sysstat version.
%util
Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.
vmstat -d 1 5 will run disk stats every 1 second 5 times. Output of the report (source from here) Reads
total: Total reads completed successfully
merged: grouped reads (resulting in one I/O)
sectors: Sectors read successfully
ms: milliseconds spent reading
Writes
total: Total writes completed successfully
merged: grouped writes (resulting in one I/O)
sectors: Sectors written successfully
ms: milliseconds spent writing
IO
cur: I/O in progress
s: seconds spent for I/O
I like glances the best due to its friendly output if your disk are running too hot.
Network stats Glances again provide easy way to read network stats. Take a look at glances screen shot above. nload is a good utility to read current network stats Lastly you can run sudo iftop -h which will display tons of network stats. I obviously hide my IP address
... View more
Labels:
11-30-2016
10:42 PM
3 Kudos
Apache NiFi 1.1.0 is now available and once again I want to test it in a isolated environment. Docker! The steps to do this are extremely similar to what has been detailed here (https://community.hortonworks.com/articles/69043/launching-a-nifi-docker-instance.html) Pull the image docker pull sunileman/nifi1.1.0 You may find the mirror site is not optimal based on your location. Go here and grab your mirror site. Update MIRROR_SITE parameter in the Dockerfile with your mirror site. The Dockerfile is available here. If you update the Dockerfile you will have to build an image. Do this by running docker build --no-cache -t sunileman/nifi1.1.0 . Whether you pulled the image or built a new one, run this to launch Apache NiFi 1.1.0 docker run -it --rm -p 8080-8081:8080-8081 sunileman/nifi1.1.0 NiFi UI should be available here http://localhost:8080/nifi/ Have fun!
... View more
Labels:
11-29-2016
07:56 PM
7 Kudos
This article describes how to launch Apache NiFi 1.0.0 on docker. To launch Apache NiFi 1.1.0 on docker go here. During my development of the Json2CSV processor here, I quickly found a need for an environment to test my processor. I don't want to build and install NiFi from my laptop since I need all my applications isolated from each other for ease of maintenance. Docker to the rescue! Similar to how I launch a PyCharm IDE from a docker image here which will rerender back to my laptop. Isolation! I like to keep it simply. Put everything in a Dockerfile and allow myself to quickly launch a NiFi Docker image. Here are the steps to get you up and running Prerequisites Download latest virtualbox from here. To run docker containers or build images a docker machine is required. Download docker machine from here. First pull the prebuilt and complied docker image (https://hub.docker.com/r/sunileman/dockernifi/) by running this command: docker pull sunileman/dockernifi Now you have the docker image simply run it docker run -it --rm -p 8080-8081:8080-8081 sunileman/dockernifi Here you are exposing ports 8080 and 8081 and mapping to your local ports 8080 and 8081 respectively. During my development of this docker image I found sometimes virtualbox will not create port-forwarding rules even though I have created them during my docker run. To simplify this process grab portforward shell script from here.
Name it portforward.sh Verify you can execute script by issuing chmod on it then run this command ./portforward.sh 8080 ./portforward.sh 8081 If you do not want to download and execute the script, simply go to virtualbox and create a port forwarding rules for the 8080 and 8081 ports Your done. Go to localhost:8080/nifi/ To shut down nifi simply hit control+c and nifi will shut down gracefully. That is too easy. Now go build some cool stuff!
... View more
Labels:
11-18-2016
04:07 PM
awesome @Binu Mathew
... View more
10-20-2016
03:30 PM
1 Kudo
HDP 2.5 GA'd phoenix query server. This makes connecting to phoenix much easier. Article will walk through steps on how to connect to phoenix Phoenix Query Server via DBVisualizer. Grab the phoenix thin jdbc driver onto your desktop. On the HDP 2.5 here is the location of the jdbc driver Start up DBVisualizer. From the top menu bar select Tools->Driver Manager Popoluate the fields: Name:Apache Phoenix Thin Client I used phoenixthin. URL Format: jdbc:phoenix:thin:url=<scheme>://<server-hostname>:<port>[...] Client on the folder icon Locate your phoenix thin jdbc driver you downloaded to your desktop. Once selected click on ok and and now you have the driver loaded. Lets connect to phoenix. Click on the icon shown below which creates a new db connection. Then select "Use Wizard". Enter a connection name. I used phoenix-QPS Now select the driver you loaded in the previous steps. I named my driver phoenixthin Next enter your userid & password. For this example I use the root user ID Now a data connection has been created. Lets connect to phoenix using that connection alias. Go to your connection alias shown on the left pane. Right click and select "Connect" You are now connected! Start having fun and open up some name spaces. Select * from tables It is clear connecting to Phoenix is much easier now thanks to the community building Phoenix Query Server. Happy Phoenix-ing
... View more
Labels: