About Eran

Eran · ‎09-13-2017

Hortonworks DataFlow (HDF) includes Apache Nifi with a wealth of processors that make the process of ingesting various syslogs from multiple servers easy. Information collected from the syslog can be stored on the HDFS distributed filesystem as well as forwarded to other systems as Spunk. Furthermore you can parse the stream and select which information should be stored on HDFS and which should be routed to an indexer on Splunk. To demonstrate this capability let us first review the Nifi ListenSyslog processor: The above processor corresponds to the syslog configuration in /etc/rsyslog.conf which includes the following line: ... *.* @127.0.0.1:7780 This will invoke syslog messages to be stream with Nifi flow which we can direct to another processor - PutSplunk, it was configured as follows: In the spunk UI you can configure data inputs under Setting->Data input -> TCP - Listen on a TCP port for incoming data, e.g. syslog.: To complete the selection use the port corresponding to the one we configured in the above Nifi putSplunk processor (516) Follow the next step to configure linux_syslog as follows At this point you can start the flow and Nifi will ingest linux syslog messages into Spunk. Once data is received you can search it in Splunk as follows: To retrieve information from Splunk you can use the GetSplunk processor and connect it to PutFile or PutHDFS processor, as an example I have used the GetSplunk as follows: For more details on HDF: https://hortonworks.com/products/data-center/hdf/

tmccuch · ‎03-02-2017

@eorgad To protect the S3A access/secret keys, it is recommended that you use either: IAM role-based authentication (such as EC2 instance profile), or the Hadoop Credential Provider Framework - securely storing them and accessing them through configuration. The Hadoop Credential Provider Framework allows secure "Credential Providers" to keep secrets outside Hadoop configuration files, storing them in encrypted files in local or Hadoop filesystems, and including them in requests. The Hadoop-AWS Module documentation describes how to configure this properly.

Eran · ‎02-27-2017

This also includes an analysis on the fly for showing odds on a Craps game. This example shows a simple use of Nifi - HDF - handling multiple streams of Dice data - each one simulating a separate Craps table – showing a Monte-Carlo simulation and results of a 1000 run – emulating throws each second. To demonstrate this capability we generate some random dice data, each stream generation uses independent thread. We throttle the threads to sleep for a second between throws, mainly to demonstrate an ongoing stream of data over time. Source for data generation: https://github.com/eorgad/Dice-nifi-streams-example/tree/master/Dice-nifi-stream-example/Dice-nifi-streams/src We use Nifi to create a streaming flow of that data as it is being generated. This simulation will use the following Nifi processors: HandleHttpRequest (Starts an HTTP Server and listens for HTTP Requests) RouteOnAttribute (Routes FlowFiles based on their Attributes using the Attribute Expression Language) ExecuteStreamCommand (Executes an external command on the contents of a flow file, and creates a new flow file with the results of the command.) HandleHttpResponse (Sends an HTTP Response to the Requestor that generated a FlowFile) Site-to-site (To send data from one instance of Nifi to Another) You can use a Template to handle each stream with individual Nifi flow from: https://github.com/eorgad/Dice-nifi-streams-example/blob/master/Multi-stream-dice-example.xml The Nifi flow would look as follows when importing the xml template: Web services: We can use Nifi to host web services either on your HDP instance (can use edge node or the same host serving Ambari), or a stand along server. However in many cases organizations already use web servers internally and externally so you can use an existing instance to link the UI example or generate one using the following steps: Set up a local web service: You can set up you web services either on the a server or on your local mac for demo purposes. 2.1. Installation on CentOS server: To install apache, open terminal and type in this command: sudo yum install httpd 2.2. Make configuration changes for your web services: vi /etc/httpd/conf/httpd.conf Place the content of the UI folder in the DocumentRoot location to be accessed via the webserver DocumentRoot "/var/www/html" 2.3. Start apache by running sudo service httpd start Our simple architecture will look as follows: 3. You can import the java project into eclipse or run the TwoCrapsTest from the cli to generate two files that Nifi would stream to your web instance. In the template there is a port that you can use to stream the feed via site-to-site to another Nifi instance, such as instance running on the edge node of your HDP instance (used HDP 2.5 sandbox VM for this example) When launching the following example you would now be able to view real time streaming data from Nifi handled by your webserver showing a real time analysis of a game of Craps. Each stream represents one table. The bar shows you an accumulation of $ win or lost relating to the theoretical gamble on one of the options: pass line, six, eight, five, nine etc. This simulation will run only 1000 iterations per thread (table in this case), so to get better approximation to the odds, you can increase this Monte Carlo simulation and run it million throws per thread. The following is the result of launching your index.html with the two streams displayed in real time as they arrive: The following is a Bell curve with reference to the UI/dice8.html

azeltov · ‎04-01-2016

@eorgadn You should wrap the geoDistance functions as hive UDF’s it will be a lot friendlier for most people that will want to use it in hive.

aervits · ‎03-11-2016

That certainly works but going forward wouldn't it cause problems?

vshukla · ‎11-03-2015

Spark reads from HDFS and sends jobs to YARN. So security for both HDFS, YARN managed by Ranger works with Spark. From security point of view this is very similar to MR jobs being run on YARN. Since Spark reads from HDFS using HDFS client, the HDFS TDE feature is transparent to Spark and with right key permissions for user running Spark job, there is nothing in Spark to configure. Knox isn’t yet relevant to Spark. In future when we have a REST API for Spark, we will integrate Knox with it.

Online	Offline
Last Visited	‎01-06-2020 09:47 AM

Member Since	‎05-22-2019 04:37 AM
Last Visited	‎01-06-2020 09:47 AM
Posts	26
Kudos received	24

Cloudera Community

Re: System didnt work after installing NiFi. Cant...

Nifi Splunk syslog integration

Re: What is the best way to secure S3A objects on ...

Multi stream with Nifi, a simple Monte Carlo simul...

Re: Geo Distance calculations in Hive and Java

Re: spark history server error class not found

Re: What are the best practices to secure spark on...