Member since
05-02-2016
154
Posts
54
Kudos Received
14
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2110 | 07-24-2018 06:34 PM | |
3312 | 09-28-2017 01:53 PM | |
790 | 02-22-2017 05:18 PM | |
7661 | 01-13-2017 10:07 PM | |
1910 | 12-15-2016 06:00 AM |
09-27-2017
04:10 PM
can nifi user access that keytab? try using the keytab with kinit and try to connect with beeline and see if that works. also you can try adding this property to nifi -Dsun.security.krb5.debug=true , that will give you some detailed logs to figure if there is anything wrong with the TGT.
... View more
09-26-2017
04:55 AM
5 Kudos
In this article we will go over how we can use nifi to ingest PDFs and while we ingest we will use a custom groovy script with executescript processor to extract images from this PDF. The images will be tagged with PDF filename, page num and imagenum , so that it can be indexed in hbase/solr for quick searching and doing some machine learning or other analytics.
The Groovy code below depends on the following jars. Download and copy them to a folder on your computer. I my case they were under /var/pdimagelib/.
pdfbox-2.0.7.jar fontbox-2.0.7.jar jai_imageio.jar commons-logging-1.1.1.jar
Copy the code below to a file on your computer, in my case the file was under /var/scripts/pdimage.groovy import java.nio.charset.*;
import org.apache.commons.io.IOUtils;
import java.awt.image.BufferedImage;
import java.awt.image.RenderedImage;
import java.io.File;
import java.io.FileOutputStream;
import java.util.Iterator;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import javax.imageio.ImageIO;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageTree;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
def flowFile = session.get();
if (flowFile == null) {
return;
}
def ffList = new ArrayList<FlowFile>()
try {
session.read(flowFile,
{inputStream ->
PDDocument document = PDDocument.load(inputStream);
int imageNum = 1;
int pageNum = 1;
document.getDocumentCatalog().getPages().each(){page ->
PDResources pdResources = page.getResources();
pdResources.getXObjectNames().each(){ cosName->
if (cosName != null) {
PDXObject image =pdResources.getXObject(cosName);
if(image instanceof PDImageXObject){
PDImageXObject pdImage = (PDImageXObject)image;
BufferedImage imageStream = pdImage.getImage();
imgFF = session.create(flowFile);
outputStream = session.write(imgFF,
{ outputStream->
ImageIO.write((RenderedImage)imageStream, "png", outputStream);
} as OutputStreamCallback)
imgFF = session.putAttribute(imgFF, "imageNum",String.valueOf(imageNum));
imgFF = session.putAttribute(imgFF, "pageNum",String.valueOf(pageNum));
ffList.add(imgFF);
imageNum++;
}
}
}
pageNum++;
}
} as InputStreamCallback)
session.transfer(ffList, ExecuteScript.REL_SUCCESS)
session.remove(flowFile);
} catch (Exception e) {
log.warn(e.getMessage());
e.printStackTrace();
session.remove(ffList);
session.transfer(flowFile, ExecuteScript.REL_FAILURE);
}
Below is a screenshot of executescript after it has been setup correctly. To ingest the PDF, i used a simple GetFile, though this approach should work for pdfs ingested with any other nifi processor. Below is a snapshot of the nifi flow. When a PDF is ingested, executescript will leverage groovy and pdfbox to extract images. The images will be tagged with the pdf filename,pagenum and imagenum. This can now be sent to hbase or any indexing solution for search or analytics. Hope you find the article useful. Please comment withy your thoughts/questions.
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- executescript
- FAQ
- How-ToTutorial
- image-extract
- NiFi
- pdfbox
Labels:
09-26-2017
04:22 AM
With nifi 1.3 you have the record based processors. So you can forward the output of executesql or your custom processor to QueryRecord. Set it up to read Avro. Now add a query like so select col_a,col_b..'${query_id}','${query_time}','${query_end_time}' from flowfile'. this will add the data you need to the query resultset.
... View more
08-09-2017
04:44 PM
or select(MyDataFrame,MYDataFrame$well)
... View more
08-09-2017
04:43 PM
may be try select(MyDataFrame, “$well”)
... View more
06-14-2017
06:35 PM
@Jim Dolan https://community.hortonworks.com/articles/87434/configure-hive-hplsql.html can you try the steps in this link. Also i would check if the hadoop_home variable is set along with may hadoop_classpath.
... View more
04-19-2017
09:54 PM
Is R installed and spark R enabled on your cluster? if so you can simply do sparkR.init(master="yarn-client")
... View more
02-24-2017
03:29 PM
Can you please give us some context. where are you getting this file from? sftp? wether you explicitly do this or not, the flowfile received in nifi will always be saved to disk. if this can be done easily with Executeprocess, it is a good option and it really will not impact your flows performance. Nifi is very efficient at File IO.
... View more
02-22-2017
05:18 PM
1 Kudo
may be a two step approach should be used. in the first list file, just look for .xml files, and then when you have executed the script for xml, trigger another list for *.dat files .
... View more
02-13-2017
06:23 PM
@Michal R do you have a attribute myfile.cat=somevalue where nifi.prefix=myfile and suffix=cat , in the flow file, that you want to retrieve? you are trying to retrieve the value myfile.cat and assign it to nifi.filename? ,i am not sure if EL can be nested like you did. But you can try doing instead of what you did ... nifi.filename = ${${nifi.prefix}.${suffix}}
... View more
02-08-2017
03:56 AM
i think you can add date, as a parameter in your sirs report and query, SSRS will replace it for you at runtime.
... View more
02-08-2017
03:38 AM
do you mean query parameters ex. where transaction_dt=[date from parameter] , like that. SSRS has the capability to create parameters on reports. https://technet.microsoft.com/en-us/library/aa337401(v=sql.105).aspx
... View more
02-07-2017
06:02 AM
@Mark Wallace df.withColumn("id", when(df.col("deviceFlag").equalTo(1), concat(df.col("device"), lit("#"), df.col("domain"))).otherwise(df.col("device")));
... View more
02-07-2017
05:50 AM
You can try using puthdfs to push the csv to hdfs , in a tmp location. then you can insert overwrite the csv directly into an orc table using puthiveql processor.
... View more
02-06-2017
04:27 PM
Try with a "1" , deviceflag may be a string so it needs to be compared with string .
... View more
02-06-2017
03:36 AM
1 Kudo
Try
df.withColumn("id", when($"deviceFlag"===1, concat($"device", lit("#"), $"domain")).otherwise($"device"));
... View more
02-03-2017
04:56 PM
3 Kudos
I was looking for a way to easily forward and analyze provenance data that is available in nifi. There were a couple of options available.
You could use the nifi-rest api to search for provenance data and then use the results for analysis or storing in a database. The other alternative was to setup a Site2Siteprovenancereportingtask in nifi. This will forward provenance events to a flow in nifi. option 1 is a very techy option , you could point you UI directly to the rest api and present a nice provenance visual with bulk replay capabilities. But, then it makes the developer responsible for keeping up with changes in the nifi rest api. It would be nice if we did not have direct dependency. Also, you might want to lockdown the rest api in production. option 2 is very easy , but it is limited in where i can send those provenance events. The apache nifi eng team resolved this situation with a ScriptedReportingTask controller service. It gives you an easy way of setting up the provenance reporting in Nifi and forwarding it to an end point of your choice. You also do not have a direct dependency between your application and nifi rest api. You can use ScriptedReportingTask to massage the events into a format that works with you application/endpoint. I chose groovy as the language for my script, but there is options for python,javascript and a few others. once you are logged in to nifi . Click the menu on the top right corner. Select controller settings option. On the Controller setting dialog, choose the Reporting Tasks tab. Click the + on the top right corner to create a new reporting task. On the Add reporting task dialog, search for ScriptedReportingTask. Double click on ScriptedReportingTask option in the results or select the row and click Add. You will see a new ScriptedReportingTask in the reporting tasks list. Click on the pencil icon , to edit the reporting task. You will see a reporting task window. Select groovy as the Script Engine choice and paste the script below in Script Body. Make sure to change the location of the file where your events will be written to. import groovy.json.*;
import org.apache.nifi.components.state.StateManager;
import org.apache.nifi.reporting.ReportingContext;
import org.apache.nifi.reporting.EventAccess;
import org.apache.nifi.provenance.ProvenanceEventRepository;
import org.apache.nifi.provenance.ProvenanceEventRecord;
import org.apache.nifi.provenance.ProvenanceEventType;
final StateManager stateManager = context.getStateManager();
final EventAccess access = context.getEventAccess();
final ProvenanceEventRepository provenance = access.getProvenanceRepository();
log.info("starting event id: " + Long.toString(1));
final List<ProvenanceEventRecord> events = provenance.getEvents(1, 100);
log.info("ending event id: " + events.size());
def outFile = new File("/tmp/provenance.txt");
outFile.withWriter('UTF-8') { writer -> events.each{event -> writer.writeLine(new JsonBuilder(event).toPrettyString()) }}
Click ok and apply. Click on the "Play " Button to active the reporting task. I had set the scheduling frequency for the task on mine to 10 secs, so i could see the results right away. You can set it to a higher value as needed. You should the logs appear in /tmp/provenance.txt , in json format. you could use other formats if needed and also may be not prettify for better performance. The ScriptReportingTask is repsponsible for the ReportingContext , which is available to your scripts as the context object. You can log information to the nidi-log using the ComponentLog log object, which is also passed to you by the reporting task. If you need anyother variables to be set in from the nifi task, you can define them as dynamic properties. My script is very simple, it will look at 100 provenance events from the first provenance event. You can use the statemanager to keep track of the last provenance event that you received. You look at the implementation by @jfrazee to see how we can incrementally collect provenance events. https://github.com/jfrazee/nifi-provenance-reporting-bundle Thank you to @Matt Burgess for putting together this very useful reportintask component. Hope this is useful.
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- FAQ
- NiFi
- nifi-reporting
- provenance
- scriptedreportingtask
- site2siteprovenace
Labels:
01-26-2017
05:18 PM
yeah , agree with that.
... View more
01-26-2017
06:52 AM
java.net.SocketTimeoutException, is port 2181 open. Are you running nifi and the phoenix server on the same machine?
... View more
01-26-2017
04:40 AM
3 Kudos
you can use the updateAttribute processor. for the grade attribute, use the EL ${grade:replaceNull("nograde")}.
... View more
01-22-2017
10:49 PM
4 Kudos
HDP 2.5 has a technical preview for LLAP .LLAP is a caching layer on top of yarn, which allows improved hive query performance. You can enable LLAP from ambari hive config page.
... View more
01-19-2017
03:03 PM
How will you find out all flowfiles from a source have been processed? If you can get that figured out nifi has several components ex. put email, which can send an email to given recipients. You can also set up an SNMP agent, use setSNMP to set a message , which the snmp agent can forward to recipients.
... View more
01-14-2017
01:35 AM
check the security group on AWS, make sure there is a incoming rule for port 8080.
... View more
01-13-2017
10:27 PM
yeah, i would try a bigger instance. To quickly do this, go into cons/bootstrap.conf and may be reduce the Xms and Xmx to link 128m and see what happens. It could be possible that OS is not able to allocate 512m to nifi.
... View more
01-13-2017
10:07 PM
I am seeing that nifi is trying to start and is getting killed for some reason, don't see any logs that indicate an error. Are you being patient enough? There is a lot of packages that get unpacked and setup when you run nifi for the first time. Also the entropy can sometimes block, so may try giving it some more time, around 15 minutes and see if it comes up. I am just hoping you are not terminating it too early. If that is not the case, what kind of ec2 are you using, how much memory, disk space and cores?
... View more
01-13-2017
08:37 PM
@Ranjit S That is strange.... can you show us your directory structure, a screenshot of it. Also make sure you downloaded the full tar and that it was not corrupted. I should see something in nifi-bootstrap.log at least.
... View more
01-13-2017
08:02 PM
@Ranjit S what about nifi-bootstrap.log and nifi-app.log can you check that. Also, it could be possible you may getting stuck in the java entropy issue. In your conf/bootstrap.conf file add the following line . java.arg.15=-Djava.security.egd=file:/dev/./urandom Try starting nifi now.
... View more