Member since
05-02-2016
154
Posts
54
Kudos Received
14
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4084 | 07-24-2018 06:34 PM | |
5667 | 09-28-2017 01:53 PM | |
1397 | 02-22-2017 05:18 PM | |
13809 | 01-13-2017 10:07 PM | |
3853 | 12-15-2016 06:00 AM |
09-27-2017
04:34 PM
@Aneena Paul how much volume of data is being moved as part of the sqoop. IF the volume is not too high, why not simply use nifi for moving data from oracle to hive. Nifi can easily handle anything in the GB ranges for daily / hourly jobs. A simple flow would be Qeneratetablefetch -> RPG->executesql->puthdfs.
... View more
09-27-2017
04:24 PM
com.amazonaws.services.securitytokenmodel.AWSSEcurityToeknServiceException: User: arn:aws:sts::7777777:assumed-role/role-hdf-node/i-03333330000 is not authorized to perform: sts:Assumerole on resource: arn:aws:sts::7777777:role/role-hdf-node. that probably is the root cause, you may have to give cross role permission in AWS IAM , to the credential taht is setup on the ec2 node hosting nifi.
... View more
09-26-2017
04:55 AM
5 Kudos
In this article we will go over how we can use nifi to ingest PDFs and while we ingest we will use a custom groovy script with executescript processor to extract images from this PDF. The images will be tagged with PDF filename, page num and imagenum , so that it can be indexed in hbase/solr for quick searching and doing some machine learning or other analytics.
The Groovy code below depends on the following jars. Download and copy them to a folder on your computer. I my case they were under /var/pdimagelib/.
pdfbox-2.0.7.jar fontbox-2.0.7.jar jai_imageio.jar commons-logging-1.1.1.jar
Copy the code below to a file on your computer, in my case the file was under /var/scripts/pdimage.groovy import java.nio.charset.*;
import org.apache.commons.io.IOUtils;
import java.awt.image.BufferedImage;
import java.awt.image.RenderedImage;
import java.io.File;
import java.io.FileOutputStream;
import java.util.Iterator;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import javax.imageio.ImageIO;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageTree;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
def flowFile = session.get();
if (flowFile == null) {
return;
}
def ffList = new ArrayList<FlowFile>()
try {
session.read(flowFile,
{inputStream ->
PDDocument document = PDDocument.load(inputStream);
int imageNum = 1;
int pageNum = 1;
document.getDocumentCatalog().getPages().each(){page ->
PDResources pdResources = page.getResources();
pdResources.getXObjectNames().each(){ cosName->
if (cosName != null) {
PDXObject image =pdResources.getXObject(cosName);
if(image instanceof PDImageXObject){
PDImageXObject pdImage = (PDImageXObject)image;
BufferedImage imageStream = pdImage.getImage();
imgFF = session.create(flowFile);
outputStream = session.write(imgFF,
{ outputStream->
ImageIO.write((RenderedImage)imageStream, "png", outputStream);
} as OutputStreamCallback)
imgFF = session.putAttribute(imgFF, "imageNum",String.valueOf(imageNum));
imgFF = session.putAttribute(imgFF, "pageNum",String.valueOf(pageNum));
ffList.add(imgFF);
imageNum++;
}
}
}
pageNum++;
}
} as InputStreamCallback)
session.transfer(ffList, ExecuteScript.REL_SUCCESS)
session.remove(flowFile);
} catch (Exception e) {
log.warn(e.getMessage());
e.printStackTrace();
session.remove(ffList);
session.transfer(flowFile, ExecuteScript.REL_FAILURE);
}
Below is a screenshot of executescript after it has been setup correctly. To ingest the PDF, i used a simple GetFile, though this approach should work for pdfs ingested with any other nifi processor. Below is a snapshot of the nifi flow. When a PDF is ingested, executescript will leverage groovy and pdfbox to extract images. The images will be tagged with the pdf filename,pagenum and imagenum. This can now be sent to hbase or any indexing solution for search or analytics. Hope you find the article useful. Please comment withy your thoughts/questions.
... View more
Labels:
09-26-2017
04:22 AM
With nifi 1.3 you have the record based processors. So you can forward the output of executesql or your custom processor to QueryRecord. Set it up to read Avro. Now add a query like so select col_a,col_b..'${query_id}','${query_time}','${query_end_time}' from flowfile'. this will add the data you need to the query resultset.
... View more
02-24-2017
03:29 PM
Can you please give us some context. where are you getting this file from? sftp? wether you explicitly do this or not, the flowfile received in nifi will always be saved to disk. if this can be done easily with Executeprocess, it is a good option and it really will not impact your flows performance. Nifi is very efficient at File IO.
... View more
02-22-2017
05:18 PM
1 Kudo
may be a two step approach should be used. in the first list file, just look for .xml files, and then when you have executed the script for xml, trigger another list for *.dat files .
... View more
02-13-2017
06:23 PM
@Michal R do you have a attribute myfile.cat=somevalue where nifi.prefix=myfile and suffix=cat , in the flow file, that you want to retrieve? you are trying to retrieve the value myfile.cat and assign it to nifi.filename? ,i am not sure if EL can be nested like you did. But you can try doing instead of what you did ... nifi.filename = ${${nifi.prefix}.${suffix}}
... View more
02-07-2017
06:02 AM
@Mark Wallace df.withColumn("id", when(df.col("deviceFlag").equalTo(1), concat(df.col("device"), lit("#"), df.col("domain"))).otherwise(df.col("device")));
... View more
02-06-2017
04:27 PM
Try with a "1" , deviceflag may be a string so it needs to be compared with string .
... View more