Member since
05-02-2016
154
Posts
54
Kudos Received
14
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2127 | 07-24-2018 06:34 PM | |
3327 | 09-28-2017 01:53 PM | |
790 | 02-22-2017 05:18 PM | |
7684 | 01-13-2017 10:07 PM | |
1923 | 12-15-2016 06:00 AM |
12-07-2022
09:46 PM
While converting from json to avro format,how to get logicaltype in avro format. And to get logicaltype in avro format,what we need to add in json data .
... View more
01-07-2021
03:21 PM
I tried "java.arg.8=-Duser.timezone=America/New_York". It does not work for me. I posted one question earlier: https://stackoverflow.com/questions/65620632/why-do-executesqlrecord-and-csvrecordsetwriter-updated-the-time-zone-of-datetime
... View more
10-07-2020
05:07 AM
Hello Shishir, Would you mind to please how do we migrate a standalone Nifi setup to cluster mode? Thanks snm1523
... View more
07-09-2020
02:23 PM
@sajidiqubal CAn you share more info. The solution is simple, the spart streaming jobs needs to find the kafka-jaas and the corresponding keytab. Make sure both paths are accessible on all machines. So kafka-jaas and the keytab need to be in the local folder and not hdfs. If you need in hdfs, then it needs to be sent it as a part of the spark --files and --keytab arguments ( iirc). In newer versions of kafka, you can add the jaas info as a kafka parameter using sasl.jaas.config. see ex. below. In that case you just need the keytab to be available on all machines. sasl.jaas.config=com.sun.security.auth.module.Krb5LoginModule required \
useKeyTab=true \
storeKey=true \
keyTab="/etc/security/keytabs/kafka_client.keytab" \
principal="kafkaclient1@EXAMPLE.COM";
you will also need a parameter sasl.kerberos.service.name=kafka From your error it looks like the code is not able to find one of the files, the jaas.conf or the keytab. Please check and make sure the file is in the right path and on all yarn nodes.
... View more
03-03-2020
11:39 PM
Can you please post the template, I am trying to solve the same problem. It would be a great help for me
... View more
08-16-2018
05:48 AM
3 Kudos
Was trying to dig up some TSP benchmark info for nifi listenhttp, which allows for providing a rest proxy, with no luck, so i tried to create one myself. This is a very rough effort, i can improve it by capturing how many client instances i have running , so we can see the TPS drop and rise as the clients drop and rise. At some point i ran out of servers and resources to run more clients, you will the chart drops off at the end , before going back up. This is where i noticed some of my clients had crashed and i restarted them. Anyway, getting to the matter. Purpose The benchmark only meassure how much load can a ListenHTTP processor handle , when subjected to real world traffic. Setup The nifi cluster is setup on an m4.4xlarge ( 16 cores CPU, 32 GB RAM), The node is also hosting the kafka broker and zookeeper. HDF version is 3.1.1 The NiFi is a simple Listenhttp processor forwarding to updateattribute. updateattribute burns the flowfile. The idea was to only measure Listenhttp performance for receiving a message, create flowfile, respond to client and forward the message to next processor. The benchmark tries to measure what kind of peak TPS could be achieved. The NiFi instance is running a S2S provenance task, which forwards provenance event to another nifi instance, which further forwards it to a kafka topic. The data is then ingested into Druid using kafka ingestion. timestampmillis column of the provenancce event will be used by druid for indexing. For the client piece i have a simple python script that constantly calls the rest service exposed by listenhttp, passing the below json. The timestamp in the json is just to ensure the messages are different. {“key”:”clien1”,”timestamp”:<current_unix_time>}. The python is a simple infinite loop in the below format. import requests
import time
import random
from multiprocessing import Process
import os
import json
import threading
from time import sleep
def call_rest():
values=["client1"]
value = random.choice(values)
start = time.time()
timestamp = round(time.time()*1000)
r = requests.post('http://nifi1.field.hortonworks.com:19192/test',data = json.dumps({"key":value,"timestamp":timestamp}))
while True:
threads = []
for i in range(5):
t = threading.Thread(target=call_rest)
threads.append(t)
t.start()
I ran 5 instances of the script across 8 servers to help me generate the kind of volume i needed for this test. Dashboard Once the data is in druid, i can utilize superset to chart and aggregate the provenance events at an interval of one second. Since the provenance events can take a few minutes to arrive, i used a one minute window from 5 minutes ago, meaning from t-5 to t-4 timestamps. This what i saw on the chart, I also filterd by query to only look for componentType=Listenhttp and eventType=RECEIVE. From the above chart we can see that the rate fluctuates from a max of 3000 TPS max to around 600 TPS minimum. To get a better aggregation or a even aggregation, i aggregated this over 5 minute interval over an hour to see what we are doing on average...The chart was pretty promising. So on an average we are looking at 300k messages per 5 minutes, which is around 1000 TPS. Conclusion The 1000 TPS we se see from NiFi from this above load test, is not probably what the max load it can handle, i can try and run my tasks on more severs and see if we see higher numbers. But, at 1000 TPS , NiFi should be able to handle most web based traffic. Additionaly this is on a clusert with one node of NiFi, we can linearly scale by adding more nodes to the cluster .
... View more
- Find more articles tagged with:
- api
- benchmark
- Cloud & Operations
- FAQ
- listenhttp
- NiFi
- performance
Labels:
08-07-2018
05:23 PM
3 Kudos
Nifi build on HDF 3.1.2 and HDF 3.1.0 fail with a depency issue when trying to push data to ADLS. This is because the new version of ADLS has some dependency on hadoop 2.8 feature, which is not available in 2.7.3 which is referenced by nifi. to fix this you can build nifi again. You could eith build it against hadoop 2.8 or againt hdp 2.6.x which should have the classes that ADLS depends on. to do that git clone nifi repository
cd <nif-repo-home>/nifi-nar-bundles/nifi-hadoop-libraries-bundle/nifi-hadoop-libraries-nar/
vi pom.xml add a hadoop.version property to the pom.xml as shown below. if already set, no change is needed. change nifi version to match the nifi version you are running for the parent <parent>
<groupId>org.apache.nifi</groupId>
<artifactId>nifi-hadoop-libraries-bundle</artifactId>
<version>1.5.0.3.1.2.0-7</version>
</parent>
<artifactId>nifi-hadoop-libraries-nar</artifactId>
<packaging>nar</packaging>
<properties>
<maven.javadoc.skip>true</maven.javadoc.skip>
<source.skip>true</source.skip>
<curator.version>2.11.0</curator.version>
<hadoop.version>2.7.3</hadoop.version>
</properties>
cd ..
vi pom.xml
change the nifi-nar-bunlde version to match your nifi version as shown below <parent>
<groupId>org.apache.nifi</groupId>
<artifactId>nifi-nar-bundles</artifactId>
<version>1.5.0.3.1.2.0-7</version>
</parent>
run the maven build using the command below. change hadoop.version to match your version of hadoop. the nar will avaliable under nifi-hadoop-libraries-nar/target folder. Take that nar and replace the existing nar under nifi/lib. mvn clean package -Dhadoop.version=2.7.3.2.6.5.0-292
Ensure you have the right jars for adls downloaded into a folder accesible to nifi. Add the folder path to Additional classpath option in PUTHDFS. for hdp 2.6.x you can find the needed jars under /usr/hdp/2.6.x..../hadoop/ you will need the following jars. azure-data-lake-store-sdk-2.2.5.jar
hadoop-azure-2.7.3.2.6.5.0-292.jar
hadoop-azure-datalake-2.7.3.2.6.5.0-292.jar
start nifi and push files , hope it works.
... View more
Labels:
07-30-2018
09:09 PM
So far I have been able to get this working. Traffic flows fine through the final NLB, but we want to do some better load testing. I have put together a post that explains: https://everymansravings.wordpress.com/2018/07/27/apache-nifi-behind-an-aws-load-balancer-w-minifi/
... View more
10-17-2018
05:01 AM
sorry just saw this message. could you provide more details
... View more
02-07-2018
04:22 AM
1 Kudo
The solution in this article was tested for DB2 and could work for other databases as well , that are not available in the GenerateTableFetch Database Type dropdown. I was trying to develop a flow that extracted data from a few tables in DB2 database and the goal was to land that data into HDFS. Our flow was a Basic ListDatabaseTables followed by a GenerateTableFetch followed by a ExecuteSSQL. The idea was to generate multiple queries that could parallelly extract data from the db2 nodes, utilizing all the 3 NiFi nodes in the cluster. The problem we ran into was that there is no option for DB2 in the GenerateTableFetch DBType property. Which means we had to select the generic option. Upon running the flow, the select query failed to execute on DB2. On investigating we found that the query generated by GenerateTableFetch looked like this select id,name,update_dt,state,city,address from customers where update_dt<='01-01-2018 12:00:00' order by update_dt offset 10000 limit 10000
The limit and offset syntax does not work with DB2, so we basically could not use the query generated by GenerateTableFetch. our first thought was to see if we could replcatetext to some hack offset 10000 limit 1000 , to a format that DB2 like "offset 10000 rows fetch first 10000 rows" . I quickly realised that would not be easy. Thats when NiFi EL came to the rescue. Noticed that apart from generating the query in the flowfile content GenerateTableFetch also puts the different pieces that form the query into the flowfiles attribute. pieces like the columns to output, the column to order on , the where clause. Using these attributes and EL , i could recreate the query. Update the ExecuteSQL "select query" property to something like this select ${generatetablefetch.columnnames} from ${generatetablefetch.tablename} where ${generatefetchtable.whereClause} order by ${generatetablefetch.maxColumnNames} offset ${generatetablefetch.offset} rows fetch first ${generatetablefetch.limit} rows only and vola , it works. to see the different attributes GenerateTableFetch writes to the output flowfile, refer to the documentation here.
... View more
- Find more articles tagged with:
- db2
- expression-language
- How-ToTutorial
- NiFi
- Sandbox & Learning
Labels:
09-28-2017
02:30 PM
https://community.hortonworks.com/articles/51343/two-approaches-to-perform-dynamic-list-filtering-w.html see if this helps.
... View more
10-09-2017
02:27 PM
I had to restart my NiFi processes, but that was just a band-aid. As such YMMV. I believe what is happening is that the TGT renewal isn't occurring properly and it causes the whole process to stop.
... View more
12-20-2017
06:47 AM
was going through some ORC based processing from pdf. looks it works better if each page is split into a monochrome image. Examples online show Ghostscript as an option. I was able to leverage this processor to extract images and with a property change it to grayscale if needed. I could now send this to OCR processor for extraction. @Jeremy Dyer
... View more
09-26-2017
04:22 AM
With nifi 1.3 you have the record based processors. So you can forward the output of executesql or your custom processor to QueryRecord. Set it up to read Avro. Now add a query like so select col_a,col_b..'${query_id}','${query_time}','${query_end_time}' from flowfile'. this will add the data you need to the query resultset.
... View more
03-15-2019
03:04 PM
Use the ExecuteStreamCommand Processor and use the sed command something like this. This worked for me.
... View more
11-20-2017
12:36 PM
Folks Has anyone been able to get the parameter from SSRS sent over to HIVE using HortonWorks ?? I am facing the same issue as posted above and have tried $ and ? and both does not work - attached are the error that popup.... Any suggestions ?? Goal to acheive : Pick a parameter value from Paramert of SSRS report and send to HIVE through Hortonworks
... View more
02-03-2017
04:56 PM
3 Kudos
I was looking for a way to easily forward and analyze provenance data that is available in nifi. There were a couple of options available.
You could use the nifi-rest api to search for provenance data and then use the results for analysis or storing in a database. The other alternative was to setup a Site2Siteprovenancereportingtask in nifi. This will forward provenance events to a flow in nifi. option 1 is a very techy option , you could point you UI directly to the rest api and present a nice provenance visual with bulk replay capabilities. But, then it makes the developer responsible for keeping up with changes in the nifi rest api. It would be nice if we did not have direct dependency. Also, you might want to lockdown the rest api in production. option 2 is very easy , but it is limited in where i can send those provenance events. The apache nifi eng team resolved this situation with a ScriptedReportingTask controller service. It gives you an easy way of setting up the provenance reporting in Nifi and forwarding it to an end point of your choice. You also do not have a direct dependency between your application and nifi rest api. You can use ScriptedReportingTask to massage the events into a format that works with you application/endpoint. I chose groovy as the language for my script, but there is options for python,javascript and a few others. once you are logged in to nifi . Click the menu on the top right corner. Select controller settings option. On the Controller setting dialog, choose the Reporting Tasks tab. Click the + on the top right corner to create a new reporting task. On the Add reporting task dialog, search for ScriptedReportingTask. Double click on ScriptedReportingTask option in the results or select the row and click Add. You will see a new ScriptedReportingTask in the reporting tasks list. Click on the pencil icon , to edit the reporting task. You will see a reporting task window. Select groovy as the Script Engine choice and paste the script below in Script Body. Make sure to change the location of the file where your events will be written to. import groovy.json.*;
import org.apache.nifi.components.state.StateManager;
import org.apache.nifi.reporting.ReportingContext;
import org.apache.nifi.reporting.EventAccess;
import org.apache.nifi.provenance.ProvenanceEventRepository;
import org.apache.nifi.provenance.ProvenanceEventRecord;
import org.apache.nifi.provenance.ProvenanceEventType;
final StateManager stateManager = context.getStateManager();
final EventAccess access = context.getEventAccess();
final ProvenanceEventRepository provenance = access.getProvenanceRepository();
log.info("starting event id: " + Long.toString(1));
final List<ProvenanceEventRecord> events = provenance.getEvents(1, 100);
log.info("ending event id: " + events.size());
def outFile = new File("/tmp/provenance.txt");
outFile.withWriter('UTF-8') { writer -> events.each{event -> writer.writeLine(new JsonBuilder(event).toPrettyString()) }}
Click ok and apply. Click on the "Play " Button to active the reporting task. I had set the scheduling frequency for the task on mine to 10 secs, so i could see the results right away. You can set it to a higher value as needed. You should the logs appear in /tmp/provenance.txt , in json format. you could use other formats if needed and also may be not prettify for better performance. The ScriptReportingTask is repsponsible for the ReportingContext , which is available to your scripts as the context object. You can log information to the nidi-log using the ComponentLog log object, which is also passed to you by the reporting task. If you need anyother variables to be set in from the nifi task, you can define them as dynamic properties. My script is very simple, it will look at 100 provenance events from the first provenance event. You can use the statemanager to keep track of the last provenance event that you received. You look at the implementation by @jfrazee to see how we can incrementally collect provenance events. https://github.com/jfrazee/nifi-provenance-reporting-bundle Thank you to @Matt Burgess for putting together this very useful reportintask component. Hope this is useful.
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- FAQ
- NiFi
- nifi-reporting
- provenance
- scriptedreportingtask
- site2siteprovenace
Labels:
01-26-2017
05:36 PM
1 Kudo
@srini
The default "FlowFIle Policy" is "use clone" within the advanced UI. For an "and" operation, you will want change that to "use original". Just keep in mind (does not apply here since both rules are unique) that if more then one rule tries to perform the exact same "Actions", only the last rule in the list that will be applied to the original FlowFile on output. Both "Rules" are run against every FlowFile that passes through this UpdateAttribute processor, so it will work in cases where only one or the other is set and cases where both are not set. Rules are operated on in the order they are listed. You can drag rules up and down in the list to order them how you like. If this answer addresses your question, please click "accept". Thank you, Matt
... View more
01-23-2017
03:11 PM
Here is a very good blog showing a performance comparison of Hive LLAP to Impala. It uses 10TB of TPC-DS Data Warehouse data to get a solid apples-to-apple comparison of the two. http://hortonworks.com/blog/apache-hive-vs-apache-impala-query-performance-comparison/
... View more
12-15-2016
03:11 PM
5 Kudos
A usual query that comes up is on using snappy and other compression codes when loading data from hdfs using the nifi puthdfs component. A common error that users come across is "java.lang.UnsatisfiedLinkError". This error occurs because snappy and other compression codecs, which are part of the native linux binaries and which the java libraries for these codecs utilize to do the actual compression, are not in the path of the JVM. Since the JVM cannot find them the bindings fails and you get a java.lang.UnsatisfiedLinkError. Follow the following steps to resolve this issue. 1. copy the native folder containing the compression libraries, from one of your hadoop nodes. cd /usr/hdp/x.x.x.x-xx/hadoop/lib/ tar -cf ~/native.tar native/ 2. scp the native.tar from your hadoop node to your NiFi node, untar to a location of your choice. in my case i use /home/myuser/hadoop/ cd ~ mkdir hadoop cd hadoop tar -cf /path/to/native.tar 3. go to your nifi folder , and open conf/bootstrap.conf and add the following jvn argument for java.library.path pointing to your folder containing the native hadoop binaries. (/home/myuser/hadoop/native in my case ) java.arg.15=-Djava.library.path=/home/myuser/hadoop/native/ i used 15 because that was the number for the last jvm argument in my bootstrap.conf. Alternately, you can edit bin/nifi-env.sh and add export LD_LIBRARY_PATH=/home/mysuser/hadoop/native/ 4. restart nifi.
... View more
- Find more articles tagged with:
- gethdfs
- HDFS
- How-ToTutorial
- NiFi
- nifi-processor
- Sandbox & Learning
- snappy
Labels:
12-13-2016
04:44 PM
After running ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java.jar Everything works - restart & service check ! Thank you so much !
... View more
12-15-2016
06:16 PM
2 Kudos
Thanks @Karthik Narayanan. I was able to resolve the issue. Before diving into the solutions, I should make the below statement - With NiFi 1.0 and 1.1,
LZO compression cannot be achieved using the PutHDFS processor. The only supported compressions are the ones listed in the compression codec drop down. With the LZO related classes being present in the core-site.xml, the NiFi processor fails to run. The suggestion from the previous HCC post was to remove those classes. It needed to be retained so that NiFi's copy and HDP's copy of core-site are always in sync.
NiFi 1.0
I created the hadoop-lzo jar by building it from sources and added the same to the NiFi lib directory and restarted NiFi.
This resolved the issue and I am able to proceed using the PutHDFS without it erroring out. NiFi 1.1
Configure the processor's additional classpath to the jar file. No restart required.
Note : This does not provide LZO compression, it just can run the processor without ERROR even when you have the LZO classes in the core site.
UNSATISFIED LINK ERROR WITH SNAPPY I also had issue with Snappy Compression codec in NiFi. Was able to resolve it setting the path to the .so file. This did not work on the ambari-vagrant boxes, but I was able to get this working on an openstack cloud instance. The issue on the virtual box could be systemic.
To resolve the link error, I copied the .so files from HDP cluster and recreated the links. And as @Karthik Narayanan suggested, added the java library path to the directory containing the .so files. Below is the list of .so and links
And below is the bootstrap configuration change
... View more
12-07-2016
07:33 PM
Karthik, That is a good idea. I can manually install Maven! Thank you so much for answering my question.
... View more
12-07-2016
03:26 PM
@Baruch AMOUSSOU DJANGBAN
you can do this as well.If you have installed the Cluster Shell in cluster, we can perform below simple steps to stop and start ----------------------------------- #!/bin/sh clush -g all ambari-agent restart ------------------------------------------- Refer below link for more info about open source Cluster shell: https://github.com/cea-hpc/clustershell/downloads
... View more
11-30-2016
03:26 PM
1 Kudo
See comment to answer above on how to get configs to local.
... View more
11-18-2017
01:09 PM
repo.txt @Karthik Narayanan thank you dear, i tried 2.6.1 but still issues, now i downloaded the 2.5 sandbox of hdp. can you please guide what to change in the var/lib/ambari-server/resources/stacks/HDP/2.5/repos/repoinfo.xml file? i mean the exact change which i need to do? i have attached that file in this reply Thank you very much.
... View more