About knarayanan

knarayanan · ‎09-27-2017

@Aneena Paul how much volume of data is being moved as part of the sqoop. IF the volume is not too high, why not simply use nifi for moving data from oracle to hive. Nifi can easily handle anything in the GB ranges for daily / hourly jobs. A simple flow would be Qeneratetablefetch -> RPG->executesql->puthdfs.

knarayanan · ‎09-27-2017

com.amazonaws.services.securitytokenmodel.AWSSEcurityToeknServiceException: User: arn:aws:sts::7777777:assumed-role/role-hdf-node/i-03333330000 is not authorized to perform: sts:Assumerole on resource: arn:aws:sts::7777777:role/role-hdf-node. that probably is the root cause, you may have to give cross role permission in AWS IAM , to the credential taht is setup on the ec2 node hosting nifi.

knarayanan · ‎09-26-2017

In this article we will go over how we can use nifi to ingest PDFs and while we ingest we will use a custom groovy script with executescript processor to extract images from this PDF. The images will be tagged with PDF filename, page num and imagenum , so that it can be indexed in hbase/solr for quick searching and doing some machine learning or other analytics. The Groovy code below depends on the following jars. Download and copy them to a folder on your computer. I my case they were under /var/pdimagelib/. pdfbox-2.0.7.jar fontbox-2.0.7.jar jai_imageio.jar commons-logging-1.1.1.jar Copy the code below to a file on your computer, in my case the file was under /var/scripts/pdimage.groovy import java.nio.charset.*; import org.apache.commons.io.IOUtils; import java.awt.image.BufferedImage; import java.awt.image.RenderedImage; import java.io.File; import java.io.FileOutputStream; import java.util.Iterator; import java.util.List; import java.util.ArrayList; import java.util.Map; import javax.imageio.ImageIO; import org.apache.pdfbox.cos.COSName; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.pdmodel.PDPageTree; import org.apache.pdfbox.pdmodel.PDResources; import org.apache.pdfbox.pdmodel.graphics.PDXObject; import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject; def flowFile = session.get(); if (flowFile == null) { return; } def ffList = new ArrayList<FlowFile>() try { session.read(flowFile, {inputStream -> PDDocument document = PDDocument.load(inputStream); int imageNum = 1; int pageNum = 1; document.getDocumentCatalog().getPages().each(){page -> PDResources pdResources = page.getResources(); pdResources.getXObjectNames().each(){ cosName-> if (cosName != null) { PDXObject image =pdResources.getXObject(cosName); if(image instanceof PDImageXObject){ PDImageXObject pdImage = (PDImageXObject)image; BufferedImage imageStream = pdImage.getImage(); imgFF = session.create(flowFile); outputStream = session.write(imgFF, { outputStream-> ImageIO.write((RenderedImage)imageStream, "png", outputStream); } as OutputStreamCallback) imgFF = session.putAttribute(imgFF, "imageNum",String.valueOf(imageNum)); imgFF = session.putAttribute(imgFF, "pageNum",String.valueOf(pageNum)); ffList.add(imgFF); imageNum++; } } } pageNum++; } } as InputStreamCallback) session.transfer(ffList, ExecuteScript.REL_SUCCESS) session.remove(flowFile); } catch (Exception e) { log.warn(e.getMessage()); e.printStackTrace(); session.remove(ffList); session.transfer(flowFile, ExecuteScript.REL_FAILURE); } Below is a screenshot of executescript after it has been setup correctly. To ingest the PDF, i used a simple GetFile, though this approach should work for pdfs ingested with any other nifi processor. Below is a snapshot of the nifi flow. When a PDF is ingested, executescript will leverage groovy and pdfbox to extract images. The images will be tagged with the pdf filename,pagenum and imagenum. This can now be sent to hbase or any indexing solution for search or analytics. Hope you find the article useful. Please comment withy your thoughts/questions.

knarayanan · ‎09-26-2017

With nifi 1.3 you have the record based processors. So you can forward the output of executesql or your custom processor to QueryRecord. Set it up to read Avro. Now add a query like so select col_a,col_b..'${query_id}','${query_time}','${query_end_time}' from flowfile'. this will add the data you need to the query resultset.

knarayanan · ‎02-24-2017

Can you please give us some context. where are you getting this file from? sftp? wether you explicitly do this or not, the flowfile received in nifi will always be saved to disk. if this can be done easily with Executeprocess, it is a good option and it really will not impact your flows performance. Nifi is very efficient at File IO.

knarayanan · ‎02-22-2017

may be a two step approach should be used. in the first list file, just look for .xml files, and then when you have executed the script for xml, trigger another list for *.dat files .

knarayanan · ‎02-13-2017

@Michal R do you have a attribute myfile.cat=somevalue where nifi.prefix=myfile and suffix=cat , in the flow file, that you want to retrieve? you are trying to retrieve the value myfile.cat and assign it to nifi.filename? ,i am not sure if EL can be nested like you did. But you can try doing instead of what you did ... nifi.filename = ${${nifi.prefix}.${suffix}}

knarayanan · ‎02-07-2017

@Mark Wallace df.withColumn("id", when(df.col("deviceFlag").equalTo(1), concat(df.col("device"), lit("#"), df.col("domain"))).otherwise(df.col("device")));

knarayanan · ‎02-06-2017

Good pick Dan. Just noticed that Id was part of his original DF.

knarayanan · ‎02-06-2017

Try with a "1" , deviceflag may be a string so it needs to be compared with string .

Online	Offline
Last Visited	‎08-20-2020 04:40 PM

Member Since	‎05-02-2016 08:13 PM
Last Visited	‎08-20-2020 04:40 PM
Posts	154
Kudos received	54

Cloudera Community

Re: MiniFi to NiFi connection through load balance...

Re: Nifi PutS3Object error with AMI Role (AwsCrede...

Re: Need help on a logic using NiFi.

Re: NIFI Installation: keep getting "Apache NiFi i...

Re: XSL to CSV using NiFi

Re: How to do row count using Nifi in source table...

Re: Nifi PutS3Object error with AMI Role (AwsCrede...

Using NiFi and Pdfbox to extract images from PDF

Re: How to add String value in nifi Custom Process...

Re: NiFi - how to remove efficiently a line from a...

Re: Need help on a logic using NiFi.

Re: NiFi dynamic properties (UpdateAttribute) - is...

Re: Spark Dataframe Query not working

Re: Spark Dataframe Query not working

Re: Spark Dataframe Query not working