About mparisi

mparisi · ‎12-12-2017

Provisioning clusters with MiNiFi C++ Bootstrap. This article references code from https://github.com/apache/nifi-minifi-cpp/pull/218 MiNiFi C++ has a new bootstrap process that enables you to select dependencies with a menu guided approach. In this article, we’ll discuss using the bootstrap script to bypass the menu and install all dependencies that your system is capable of installing. The major advantage of this is that you will be able to run MiNiFI C++, which doesn’t incur a startup cost and can be run within a very small memory footprint, on any device you can connect to using SCP to distribute the source and SSH to run the boot strap command. Overview of the boot strap process. The bootstrap script is a simple set of bash scripts that run the dependency installation, CMAKE setup, and build. You will then use boot strap to run a make install. The bootstrap.sh script normally uses prompts to ensure agreement at every step, but for the purposes of this article, we’ll be supplying the –n argument, which dictates that we want no prompts for the user. This will run the bootstrap in headless mode. We’ll be supplying the –p argument, which will run a parallel make build and make package. Alternatives Readers may wonder why we can’t run make package on a node of similar architecture using a portable binary. We can do this with GCC by supplying a generic march; however, the purpose of this article is to distribute the source and run the build on devices that may have potentially different architectures without the need for cross compilation. By using the bootstrap script, we are able to provision a cluster of any size with different architectural units without the need to centralize cross compilation. By no means will bootstrapping be a more efficient approach. Instead, it will support various architectures across a myriad of different OS distributions. Note that since the boot script detects OS version and type, we can use the bootstrap script to self-limit installation of features. With cross compilation, this becomes a greater task as you must build for your architecture. Running Running the bootstrap is very simple. To test, I built multiple VMs on AWS with Ubuntu 16, RHEL 7, and Centos 6 installed across ten instances. I then used pdsh with a formulated genders file to distribute commands to each host. I used this approach to minimize typing. You can use SSH to run commands and easily distribute them manually. Once I distributed the tar.gz file to each node, I set up a genders file that associated a name with each IP. I called these hosts minifitest00 through minifitest09, assigning a group name of minifitestnodes. I first extracted the the tarred gzip. $ pdsh –g minitiftestnodes ‘tar –xzf ~/minify.tar.gz’ This extracted to a directory named nifi-minifi-cpp in the connecting user’s home directory. I then ran a command to bootstrap $ pdsh –g minitiftestnodes ‘cd ~/nifi-minifi-cpp ; ./bootstrap.sh –p –n –e’ This will download dependencies for each distribution, accepting all updates, run a make package and a make build. With MiNiFi C++ built and provisioned we can run our agents. In future posts we'll discuss using C2 to control these agents. Next Steps for bootstrapping The next steps of the bootstrapping process will be limiting the build nodes. Since many nodes will undoubtedly have the same build profile – architecture, OS version, and OS type, we should be able to use boot strap to automatically identify this and build only once and distribute the executables within our cluster. Conclusion. This article discusses bootstrapping nodes with little to no knowledge of the underlying distribution. Since the bootstrap script is bash based, no underlying dependency is needed. The script will bootstrap the node with all necessary components needed to build and run the agent.

mparisi · ‎12-04-2017

In this article, we’ll discuss tuning repository settings to achieve the highest possible throughput in MiNiFi C++. We’ll begin by benchmarking the SSD that we have. Based on reported numbers, the SSD in this 15” Macbook Pro is rated at 676 MB/s sequential write and 728 MB/s sequential read. Far from the highest, but our goal with this post is to achieve the highest possible throughput with the greatest guarantees. We’ll try different variations with and without volatile stores. Our benchmark shows that we are achieving near optimal speeds across various runs. We’ll create a MiNiFi C++ processor named Throughput Measure [1] that will be placed at the end of our graph to measure bytes output and measure the simple output from our flow. In our case, we’ll generate a flow file and simply measure the output. This won’t give us a completely accurate measure; however, it will be useful for tuning our setup. The first thing we need to be cognizant of is that we need to have enough space in our connection queue from the GenerateFlowFile processor to the Throughput Measuring processor to avoid backpressure. Therefore, we will set the max queue size to 0 ( which makes it limitless) and the max queue data size to 1GB. You may play with this setting to give you a relative measurement. The higher you go the more likely you are to avoid starving the ThroughputMeasure Processor [1] [2] Configuring MiNIFi C++ The next step is to begin changing max bytes and counts for all repositories. You may use the following values as I did: nifi.volatile.repository.options.provenance.max.bytes=102439390 nifi.volatile.repository.options.flowfile.max.bytes=102439309 nifi.volatile.repository.options.content.max.bytes=1023999999 nifi.volatile.repository.options.content.max.count=500000 nifi.volatile.repository.options.content.minimal.locking=false The reason we’re using this as the baseline is to increase the maximum bytes in our volatile repositories. In a subsequent post, I’ll discuss the process for arriving at these numbers. Lowering them will naturally adjust the maximum amount of data a volatile repository may consume, but will also dramatically lower the throughput. In this post we’re interested in achieving the highest possible throughput.The final option, above, "content.minimal.locking" forces the volatile content repository to disable a feature that is better suited for higher concurrent access. The feature titled "minimal locking" favors CAS operations over mutex based locking. Minimal locking will perform better with higher concurrent access. For the purposes of this test we know we won't have elevated concurrency, therefore disabling this feature will be to our benefit. With repositories we have several options. Repo Name Repo type Storage Type Description NoOpRepository no-op n/a Enables the repository to be no-op, validating all puts and removals. Can be useful for testing FlowFileRepository flowfile RocksDB (disk) Stores FlowFile records in a RocksDB database, a WAL is used for improving reliability and performance. ProvenanceRepository provenance RocksDB Stores provenance events in a RocksDB database, a WAL is used for improving reliability and performance. VolatileFlowFileRepository flowfile memory Stores flow file records in memory, without persisting to disk. VolatileProvenanceRepository provenance memory Stores provenance events in memory, without persisting to disk. VolatileContentRepository content memory Stores content in memory, without persisting to disk. DatabaseContentRepository content RocksDB Stores content in a RocksDB database, reducing the number of inodes on disk. FileSystemRepository content disk Stores content on disk, in individual content files. In measuring performance we can safely eliminate NoOpRepository as an option since this provides no functional benefit. You can use this if you wish to not maintain provenance; however, for the purposes of the rest of this article we'll be looking at maintaining our traditional flow of operations. The test plan is to run a series of writes with a clean repository across each segment. We'll bundle flow file and volatile repositories together as a single option intersecting the results with the type of content repository. We'll run this for an hour reporting the median throughput. Results Repo Types FlowFile Provenance FlowFile Volatile Provenance Volatile FlowFile Provenance Volatile FlowFile Volatile Provenance FileSystem 667 MB/s 650 MB/s 656 MB/s 628 MB/s VolatileContent 524 MB/s 510 MB/s 1027 MB/s 1058 MB/s DatabaseContent 118 MB/s 132 MB/s 150 MB/s 202 MB/s The results above show that to the file system repository responds at roughly equivalent speeds. The VolatileContentRepository was slower when using a non volatile flow file repository. This is because the traditional flow file repository removes entries in batches. Within ten seconds, the VolatileContent repository would cause backpressure due to insufficient space, waiting no the Flow File Repository to clear flow files. In the case of the Volatile FlowFile repository we see higher throughput as would be expected with a volatile repository. The DatabaseContent repo performed poorly across the board. This is likely because of the manner in which it flushes the write ahead log. Since we must guarantee the content we enforce flushing on ever write. Where the Database Content repository strives is on file systems with limited resources. In the case where we pit it against the FileSystem repository we see vastly different results on a RaspberryPi Repo Types FlowFile Provenance FlowFile Volatile Provenance Volatile FlowFile Provenance Volatile FlowFile Volatile Provenance FileSystem Fail Fail 32 MB/s 31 MB/s DatabaseContent 33 MB/s 32 MB/s 33 MB/ 32 MB/s Analysis The results, above, are quite intriguing. The throughput of our normal content repository was limited by the speed of our SSD. Configuring all repos to be volatile, while beneficial may not be the most desirable given the environmental conditions. In the example here we are quite limited by the amount of data that GenerateFlowFile produces. If we change our data source to be a trove of already formed data, we may see different results. In my next article I will discuss using this and employing more automated ways at controlling throughput. [1] https://gist.github.com/phrocker/e7814920b5a724ace5016aa2ccd1ba7a [2] https://github.com/phrocker/nifi-minifi-cpp/tree/ThroughputMeasure

mparisi · ‎09-19-2017

Traditional data warehousing models espoused by Apache Hadoop and NoSQL stores entrench themselves on the belief that it is better to bring the processing to the data. Data flow, a hallmark of the HDF platform, is the mechanism by which we originally bring that data to processing platforms and/or data warehouses. Data can be processed, minimally, or fully, with products such as Apache NiFi for seamless ingestion into any number of data warehousing platforms. Apache MiNiFi is a natural extension of these paradigms where processing can be performed with the benefits that NiFi provides, within a much smaller memory footprint; however, this article will discuss using Apache MiNiFi as a data grooming and ingestion mechanism coupled with Apache NiFi to further leverage edge and IoT devices within an organization's infrastructure. How we view this: This article is intended to look at two problem statements. The first is using Apache MiNiFi in a slightly different way: as a mechanism to front load some of the ingestion and processing framework. Second, we look to help augment Apache NiFi using Apache MiNiFi’s capabilities in last mile devices. In doing so we can mitigate problems sometimes seen on data warehousing and query platforms. We will do so by using an example of Apache Accumulo. Based on the BigTable design, Apache Accumulo can be replaced here with Apache HBase for your purposes; however, we are using Apache Accumulo because we already have a C++ client built for this system. Coupled with Apache MiNiFi we are able to create data typically built by MapReduce/Yarn/Spark in the MiNiFi agents avoiding issues that may plague the NoSQL store. How much processing is front loaded onto edge devices is up to the architecture, but for the purposes here we are using Apache MiNiFi and Apache NiFi to distribute processing amongst worker nodes where the data originates. Problem statement: Since Apache Accumulo can be installed on highly distributed systems it stands to reason that problems that plague BigTable also plague installations with high amounts of ingest and query. One of which are compaction storms on highly loaded systems. A server that is receiving too much data, too quickly, may attempt to compact the data in its purview. One way to mitigate this is combining data to be ingested. While we can’t completely alleviate this entirely with sustained ingest we can reduce compactions dramatically when servers are receiving disproportionate data from many sources. Using MiNiFi we can leverage NiFi with a primary (or set of primary) nodes to balance through a Remote Processing Group [1]. This enables many MiNiFi agents to create data, send it to NiFi and then enact on the created data that is sent to the Apache Accumulo instance. These problems exist primarily from the motivation to minimize the number of files read per tablet or region server. As the number of files increases the performance is impacted; however, the system may not be able to compact or reduce the number of files fast enough, hence the need to leverage edge devices to do some of the work if the capacity exists. How it’s done Using a C++ Apache Accumulo client [2], we’ve created a branch of Apache MiNiFi that has extensions [3] built for Apache Accumulo. These extensions are built to simply take content from a flow and put the textualized form of the attributes and content and place it into the Apache Accumulo value. The data is then stored into an Apache Accumulo specific data format known as an RFile[4]. The keys and values are stored into the Rfile and then send to the NiFi instance via the Site-To-Site protocol. Figure 1, below, shows the input and output ports we have configured. Using this information we are able to create a configuration YAML file for Apache MiNiFi that uses the Apache Accumulo writer [5]. This YAML file will simply take an input from a GetFile processor and ship this to the NiFi instance. In the example from paste bin you can see that data is sent to the fromnifi RPG and then immediately shipped back to ToMiNiFi, where we bulk import the Rfiles Figure 1: Apache NiFi view of Input and Output ports. Does this make sense? Since we're taking a slightly different approach by moving processing at the edge, some might question if this makes sense. The answer depends on the use case. We're trying to avoid specific problems we see related to data ingestion on NoSQL systems like Apache Accumulo/HBase. Not having the capacity to keep up with compactions isn't necessarily related to cluster size. There could be a tablet or region server that doesn't have the capacity and may have become a temporary hotspot, so leveraging your edge network to perform more of the ingest and warehousing process may make sense if that network exists and the capacity is sufficient. An edge device similar to that of a Raspberry PI has sufficient capacity to read sensors, create RFiles, and compact data, especially when the data purview is specific to that device. Of course, to avoid the problem statements of using Apache Accumulo above, it would make more sense to combine and compact relevant RFiles. Doing this intelligently means knowing which tablet servers host what data and splicing RFiles we are sent as needed. In the example in Figure 1, we can do this by simply bulk importing files that are directed at a single tablet and re-writing the RFile to the session to be handled by another MiNiFi agent. This requires significantly more coordination, hence the use of the C2 protocol defined in Apache MiNiFi [6]. Using the C2 protocol defined in the design proposal we were able to use C2 to coordinate which Apache MiNiFi should respond to keys and which RPG queue hosts that data. Since this capability isn’t defined in the proposal we are not submitting this to Apache MiNiFi; however, the code should be available in [3] shortly. The purpose of this C2 coordination is to allow many Apache MiNiFi agents to work in a quorum without the need to access Zookeeper. It makes sense for the agents to handle the data they are given and other agents to compact the data externally, but it does not make sense to access Zookeeper. In the case of the BulkImport processor we do access Zookeeper; however, it makes more sense for Apache NiFi or some other mechanism to do the import. Using a large set of Apache MiNiFi agents we can perform compactions in a much smaller footprint than a tablet server, external to it. With a minor increase in latency at ingest, we may be able to avoid a large query latency due to compactions falling behind. The fewer RFiles we can create, and the more we can combine, the better. Apache MiNiFi at the front lines Apache MiNiFi creating an object to be directly ingested implies we are doing data grooming and access. While not all devices may have the ability to perform the compression and memory to support the sorting of data, Apache MiNiFi can using [2] and [3]. Apache MiNiFi works well in scalable environments leveraging Apache NiFi’s remote processing group capability. The design goal is to take small devices that may have processing and memory capabilities beyond simple sensor read devices and use their capabilities to the fullest. In some ways this serves to minimize the infrastructure in data centers, but also helps us scale since we’re using our entire infrastructure as a whole instead of limited portions. The edge doesn’t necessarily have to be a place from which we send data, it can also be a place where we process data. Distributing computing beyond our warehouse and internal clusters gives us the ability to scale well beyond current sizes. Apache Accumulo simply serves as an example where we do more at the edge so that we do less in the warehouse. Doing so may avoid problems at the warehouse and speed up the entire infrastructure. Of course, all use cases are different, and this article is simply meant to provide an alternative focus on edge computing. The next steps will be to better show how we can front load compactions into MiNiFi and use Apache NiFi and Apache MiNiFi to help replicate data to separate Apache Accumulo instances... [1] https://community.hortonworks.com/questions/86731/load-balancing-in-nifi.html [2] https://github.com/phrocker/sharkbite/tree/MINIFI [3] https://github.com/phrocker/nifi-minifi-cpp/tree/MINIFI-SHARKBITE [4] https://accumulo.apache.org/1.8/apidocs/org/apache/accumulo/core/client/rfile/RFile.html [5] https://pastebin.com/nbeTqNpD [6] https://cwiki.apache.org/confluence/display/MINIFI/C2+Design+Proposal

mparisi · ‎06-06-2017

This article will cover debugging MiNiFi C++. This will be the first of a series of debugging, memory leak detection, and profiling MiNiFi C++. In my development of MiNiFi C++ I spent the bulk of my time using common GNU tools for debugging and ensuring we don't have memory leaks. Profiling allows us to baseline. This guide will focus on debugging using the GNU Debugger, GDB Debugging MiNiFi C++ uses CMAKE to generate builds. A precondition for debugging C++ tools is including debug symbols into the executable. To do this you will need to specify a build type of DEBUG or RELWITHDEBINFO, both of which instruct CMAKE to include those symbols. $ cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..$ cmake -DCMAKE_BUILD_TYPE=DEBUG .. When you do this, CMAKE will generate compilation code to include debug symbols (-g). When you build the project debug symbols will be included, and allow you to run GDB. GDB is the GNU debugger. When you've installed MiNiFi you can debug minifi in place by running: gdb ./minifi GNU gdb (GDB 7.12.1 Copyright (C) 2017 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. Reading symbols from ./minifi...done. (gdb) run This will begin the debug session. To set a breakpoint on a piece of code, such as Site2SiteClientProtocol.cpp, you can use the following command to set the breakpoint on your desired line. In this example I'm setting a breakpoint on the function named initiateResourceNegoation of Site2SiteClientProtocol. (gdb) break Site2SiteClientProtocol.cpp:87 Breakpoint 1 at 0x5614db: file /opt/marc/code/nifi-minifi-cpp/libminifi/Site2SiteClientProtocol.cpp, line 87 (gdb) run [Switching to Thread 0x7ffff83fff700 (LWP 7127)] Thread 42 "minifi" hit Breakpoint 1 org::apache::nifi::minifi::Site2SiteClientProtocol::initiateResourceNegotiation at /opt/marc/code/nifi-minifi-cpp/libminifi/Site2SiteClientProtocol.cpp:87 int ret = peer_->write(_currentVersion); If at this point you wish to print the value of _currentVersion you can use the print command. You can see from the example output, below, that we have an unsigned integer _currentVersion whose value is 5. (gdb) print _currentVersion$1 = 5(gdb) ptype _currentVersion type = unsigned int You can inspect the stack with commands such as bt ( or backtrace) to print the backtrace of active functions in the execution stack. or frame <number> if you want to print the stack frame. In many cases, if a signal occurs the interrupts execution you may be able to use backtrace to get the current location whilst inspecting variables that may help lead to the problem. Typical problems that occur are segmentation faults, bus errors, or stack corruption. Segmentation faults are commonly caused by accessing addresses outside of the assigned address space. An example might be null pointers in which your variables are attempting to access 0x00 ( nullptr/NULL ). In this case, printing variables or the stack frame may be helpful in finding which variable is null. In the case of bus errors, this is typically a misaligned instruction pointer or invalid pointer access. This will likely occur because of pointer mismanagement. In these cases you may need to use GDB to print variables. Pointer corruption is better avoided by using smart pointers since we are using C++11; however, in the event that raw pointers are needed, GDB can be very helpful in locating memory issues. In many cases, stack corruption can manifest as a general segfault. One common cause is calling a bogus function pointer, as your program counter ( PC ) will have a bogus value. Since we're using C++ you may see issues as a result of bogus virtual function calls. In these cases, Valgrind may be a better tool to debug the situation. To do this simply run the following command. valgrind ./minifi When the error is reached Valgrind will provide the offending lines along with the address of the function pointer. Verify that this is expected and/or trace the output with gdb to identify if the value is expected. Using our example above with function initiateResourceNegotation you can locate the address of a function. The info command (help info for provide more information about this command) will allow you to get systematic information about symbols. (gdb) info address org::apache::nifi::minifi::Site2SiteClientProtocol::initateResourceNegotiation Symbol "org::apache::nifi::minifi::Site2SiteClientProtocol::initateResourceNegotiation" is a function at address 0x55a950. Info address will provide the address of where the function exists in memory. In this case you can correlate the function symbol address with your function pointer found in Valgrind. With this information you should be able to track down the invalid function pointer and alleviate your source of stack corruption. Conclusion This quick how to discusses some common commands to debug the three most commons issues I've come across. No guide can be an exhaustive tutorial, please comment with requests and I will attempt to demonstrate by example using MiNiFi C++. I will attempt to discuss common usage of Valgrind in the next how-to. While Valgrind is often used for memory leak detection it can also be used to compliment GDB to identify memory access errors. I like to use both tools since Valgrind strong suit is providing contextual errors in memory access in addition to memory leak detection.

Online	Offline
Last Visited	‎11-27-2018 01:44 PM

Member Since	‎03-14-2017 03:56 PM
Last Visited	‎11-27-2018 01:44 PM
Posts	15
Kudos received	17

Cloudera Community

Bootstrapping with MiNiFi C++

Tuning MiNFI C++ repositores

Bringing data storage and data flow closer togethe...

Profiling and debugging MiNiFi C++

Bootstrapping with MiNiFi C++

​Tuning MiNFI C++ repositores

Bringing data storage and data flow closer togethe...

Profiling and debugging MiNiFi C++

Tuning MiNFI C++ repositores