Member since 
    
	
		
		
		06-20-2016
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                488
            
            
                Posts
            
        
                433
            
            
                Kudos Received
            
        
                118
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 3600 | 08-25-2017 03:09 PM | |
| 2500 | 08-22-2017 06:52 PM | |
| 4187 | 08-09-2017 01:10 PM | |
| 8968 | 08-04-2017 02:34 PM | |
| 8946 | 08-01-2017 11:35 AM | 
			
    
	
		
		
		12-20-2017
	
		
		07:34 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		6 Kudos
		
	
				
		
	
		
					
							 Introduction  Use of data in Excel spreadsheets
is commonplace in the enterprise and often results in widespread yet highly
manual and loosely defined dataflows.  The
manual and loose nature of Excel data flows are at odds with the automated way we
expect data to move through the enterprise. 
As a result, these Excel data flows lack the accuracy, speed and
transparency leveraged by automated flows. 
Shouldn’t Excel data flows be automated too?  In this article I show how to
use NiFi to automate the reading, writing and validation of Excel contents (using
the full capability set humans use) and how to fit this into a larger workflow.  I focus here on building an Excel client to
operate on Excel data during your workflow.  Designing your own Excel workflows depends on how
you do things manually today and on your ability to innovate new ways to
leverage Excel data in your larger data landscape and initiatives.      Technologies  The key technology for
reading and writing to Excel is Apache’s POI Java API.  The key technology to building an automated workflow is NiFi.  A NiFi ExecuteScript
processor calling a Groovy script that implements POI bridges the two: it serves
as an Excel client at any step where the automated workflow needs to interact
with Excel.  Note that there is an out-of-the-box
NiFi processor called ConvertExcelToCSV Processor that uses POI to convert the binary format of Excel
into a CSV file for each workbook (sheet) in the spreadsheet.  That is all it does however.   Using the full POI API in a Groovy
script allows you to do in Groovy/Java pretty much any read, write or
processing operation you can do manually. 
This includes the following and more:   Select and output data to CSV or Excel format  Insert or change the value of one or more cells  Insert or read a comment to a cell  Format a cell or range of cells (borders, colors,
font, etc) including conditional formatting  Insert a hyperlink into a cell  Freeze panes or merge cells  Create pre-populated dropdown lists for data inputs  Set zoom level  Set a cell formula or evaluate an existing formula   See this quick overview of POI capabilities and implementations; more references
are provided at the end of this article.  As mentioned, the main design
feature is to use a NiFi ExecuteScript processor that implement a Groovy script which itself implements POI to work with contents
of the Excel spreadsheet.  From
ExecuteScript, you can output the altered Excel file as an Excel file itself (e.g
to email as attachment to its author, with comments added where validation
errors are detected) or output the data as CSV for downstream routing,
processing, conversion and storage.  A note on Groovy and Java:
Groovy is a scripting abstraction of Java that allows you to write Java code in
a lightweight (Groovyesque J)
scripting style that is like writing Python for example. In Groovy scripts,
Java can be used natively among the Groovy code and that is how we import and
implement the POI libraries.  That takes care of the Excel client
part.  The workflow orchestration part is
classic NiFi processing (get file, route on content or failure/success, put to
file or hive, etc) which numerous HCC articles explore and demonstrate (see
references below).      Example Flow  Below is a simple example of
an automated Excel workflow.  Your own automated
Excel flows will vary and likely be more elaborate.  Main focus on the example flow here will be
processing the spreadsheet, i.e. building an automated NiFi Excel client as a
step in the larger workflow.      In this flow I:  
 get an Excel workbook
file from a local file system  use POI to parse
one sheet from the file and then to read cell values of all rows  write out the row
values as a CSV flow file  split the CSV
into separate rows  validate contents
of each row to make sure the fields are in proper format and ranges  send all failed
rows to a file in the local file system  send all
successful rows to a Hive table       The NiFi Excel Client: ExecuteScript + Groovy + POI  My step (2) operates on the
Excel data.  To do this, the ExecuteScript
processor is configured to access a Groovy script and its library dependencies,
including POI libraries.      This is everything you need
to know in order to work with Excel in NiFi:   ExecuteScript processor receives a flowfile in Excel
format and calls a Groovy script to process it. 
  This ExecuteScript processor is configured to point to
the Groovy script and to the local file path holding all of the script’s
dependencies (including POI jars).    This Groovy script reads manipulates the Excel
workbook.  At this point, the challenge
is mostly learning and using the POI API. 
Note that the script uses native Java in places... Groovy does not
constrain you to Groovy syntax.   Full script is shown
here.  It is quite simple but will get
you started.  The full power is in POI  capabilities which you should learn how to
leverage.  Note that you could output the
flow file as Excel and not convert to CSV.  import org.apache.commons.io.IOUtils
import java.nio.charset.*
import java.text.SimpleDateFormat
import org.apache.poi.ss.usermodel.Cell
import org.apache.poi.ss.usermodel.Row
import org.apache.poi.ss.usermodel.Sheet
import org.apache.poi.ss.usermodel.Workbook
import org.apache.poi.ss.usermodel.WorkbookFactory
 
def flowFile = session.get()
if(!flowFile) return
def date = new Date()
 
flowFile = session.write(flowFile, {inputStream, outputStream ->
    try {
 Workbook wb = WorkbookFactory.create(inputStream,);
 Sheet mySheet = wb.getSheetAt(0);
 def record = ''
   
 // processing time, inserted as first column
 def tstamp = new SimpleDateFormat("YYYY-MM-dd HH:mm:ss.SSS").format(date)
   
 Iterator<Row> rowIter = mySheet.rowIterator();
 def rowNum = 0
  while (rowIter.hasNext()) {
   rowNum++
          Row nextRow = rowIter.next();
              Iterator<Cell> cellIterator = nextRow.cellIterator();
              
          while (cellIterator.hasNext()) {
                  Cell cell = cellIterator.next()
                  record = record + cell.toString() + ','
              }             
             
              if(rowNum > 1){
              // insert tstamp, row num, drop last comma and add end line.  
   // Note: tstamp + row num are composite key
              record = tstamp + ',' + rowNum + ',' + record[0..-2] + '\n'
              outputStream.write(record.getBytes(StandardCharsets.UTF_8))
             }
             record = ''
         }
   
    }
    catch(e) {
     log.error("Error during processing of spreadsheet name = xx, sheet = xx", e)
     //session.transfer(inputStream, REL_FAILURE)
    }
} as StreamCallback)
 
def filename = flowFile.getAttribute('filename').split('\.')[0] + '_' + new SimpleDateFormat("YYYYMMdd-HHmmss").format(date)+'.csv'
flowFile = session.putAttribute(flowFile, 'filename', filename) 
session.transfer(flowFile, REL_SUCCESS)
      A Reference Workflow  The below shows a possible automated
Excel workflow.  The goal of this flow is
to receive Excel data from multiple users into a Hive table for analytics
(reporting, BI, dashboarding, advanced analytics, data science, etc). Errors are
loaded to their own separate Hive table so analytics can be performed on them, with
the goal of improving user data accuracy. 
One important  feature is that the
Excel attachment filename maps to the Hive table the data will be inserted (and
the Groovy script that will validate it). 
In this way the same NiFi flow can be used for different data flows of
spreadsheet data to Hive tables.       The goal of this flow is to
input Excel data from multiple users into a Hive table for analytics
(reporting, BI, dashboarding, advanced analytics, data science, etc). Errors
have their own table to perform analytics on errors with the goal of improving
user accuracy with inputting data.  One important
 feature is Excel attachment filename
maps to the Hive table the data will be inserted (and the Groovy script that will
validate it).  In this way the same NiFi
flow can be used for different data flows of spreadsheet to Hive.  Key details are explained below.  Happy Path  
   
   Get email 
   ConsumePOP3 retrieves email 
  
  
   Extract email headers 
   ExtractEmailHeaders writes “from”
  header  as attribute 
  
  
   Extract attachment 
   ExtractEmailAttachments extracts Excel attachment as flowfile and writes filename as attribute 
  
  
   Validate contents 
   ExecuteScript    Filename attribute equals groovy validation script in
  configuration (fails if script not found)  
   Script validates contents (for each cell value that
  is invalid, writes comments with error description or code and outputs Excell
  file as fails) 
   For success, outputs as cvs file  
   
  
  
   
   Insert to hive  
    PutHDFS puts to file path of
Hive external table.  Note there are many ways to put CSV data to Hive
tables.  Optimized Hive tables ORC files.  To convert CSV to ORC and then insert to
Hive, see this HCC article.   
  
   Error Paths  
   
   Send error email with
  attachment 
   PutEmail returns email to
  user.  Subject of email is taken from
  attribute written from Validate Contents script.  Failure at this point is either “Bad
  workflow name” because attached filename did not map to workflow name or “Bad
  data” because on or more cells had invalid data. If the user did not include
  an attachment the flow fails at ExtractAttachments and an UpdateAttribute processor (not shown in diagram) writes “No attachment” as subject.  The “Bad workflow name” case attaches the
  original Excl workbook.  The “Bad data”
  case attaches the Excel modified with inserted error comments.  The “No attachment” case does not return an
  attachment 
  
  
   Convert error comments 
   ExecuteScript converts error
  comments into cell coordinates (Excel row, column) and error description into
  a CVS with each record holding: timestamp, emailFrom, workflowName, errorRow,
  errorCol, errorCode (or description). 
  
  
   Insert to hive 
    Same as above   
  
       How can you leverage NiFi for Excel Data?  The above is only one
automated Excel workflow.  Use your
imagination to explore the diverse and powerful ways that:   NiFi can move and manipulate data throughout your data
landscape  Groovy with POI can work with Excel the way humans do  you can combine the two to leverage Excel data in
automated workflows running through your system   Think of all of the manual
work performed on Excel data at your workplace. 
Ask yourself, How can we leverage NiFi and POI for Excel data?  Then get your design cap on and start
building NiFi flows to automate Excel data movement and processing like you do
with the rest of your data.       References  NiFi   
 Apache NiFi Overview   Getting Started with Apache NiFi   Hortonworks Apache NiFi Overview  NiFi Intro HCC     Groovy  
 The Apache Groovy programming language - documentation   POI  
 Apache POI - Excel  Apache POI - Excel Quick Guide   Template for first flow   spreadsheet-demo-2017-02-21-v1.xml  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		09-05-2017
	
		
		07:31 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Saikrishna Tarapareddy Keep in mind that "flow" does not have to mean all components on the canvas.  For example, a flow could be all components in a process group.  In that case, you would deploy only the process group in question to deploy your flow.  So the unit of deployment, what we are calling flow, is up to you.  It could be one independent flow or many or all.  For better visibility to the community, I would pose your NiFi git integration question as a new question.  Please note that there is a lot of activity in the NiFi project around SDLC.  Future plans are to have the repository embedded in the NiFi platform itself so you will not have to import and export templates, but can rather code manage them on NiFi. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-30-2017
	
		
		01:17 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 As mentioned by @Laiba Khan data locality refers to moving compute to data which is typically faster than moving data to compute.  In Hadoop, data is divided into blocks and distributed across multiple servers (nodes).  Additionally, it is replicated (typically 3 copies total) across these nodes.  Thus, subsets of a dataset are distributed across nodes.  When a map-reduce or Tez job is started, a container with code is distributed across the cluster nodes.  These containers operate on the data in parallel and usually grab data blocks that are stored on the same node, thus achieving parallel processing with data locality.  This results in fast overall execution of the full data set distributed across multiple nodes.  This is key to operating on large volumes of data ... parallel processing is one component and processing data stored locally is another.  Processing data that has to move across the network (no data locality) is slower.  Note that in cloud computing it is often advantageous NOT to have data locality.  Local disks in the cloud are ephemeral ... if the (virtual) server is destroyed all data sitting on it are destroyed.  Thus, putting data on local disks means you lose it when you spin down a cluster.  One of the advantages to cloud is paying for servers only when you use them.  Thus it is common to have scenarios when you spin up a cluster, do some processing and then spin it down (e.g. running a report or training a model in data science).  In this scenario you would want your data stored on non-local data like AWS S3 object storage which is very inexpensive.  This data persists separately from your cluster so only your compute is ephemeral.  When you spin up a cluster, it reads from the permanent non-local storage and perhaps writes to it.  You lose data locality but you gain the ability to pay for your cluster only when you use it.  Compute on non-local data in this scenario is slower than local but not extremely so, especially when you scale out your cluster (more nodes) to increase the parallel processing aspect. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-25-2017
	
		
		03:09 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 DR is important for most PROD clusters in enterprises.  Here is a good intro:   
 https://community.hortonworks.com/articles/43525/disaster-recovery-and-backup-best-practices-in-a-t.html  https://community.hortonworks.com/articles/43575/disaster-recovery-and-backup-best-practices-in-a-t-1.html   Note that there is very active work by Hortonworks on a solution focused directly on DR ... stay tuned. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-25-2017
	
		
		01:34 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Please note that that HDP 2.4.x and after do NOT support Windows.   
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-24-2017
	
		
		01:21 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 This article will show you how to make customized logs in NiFi (from source logs) https://community.hortonworks.com/articles/65027/nifi-easy-custom-logging-of-diverse-sources-in-mer.html  You will use nifi-app as your source log (logs application operations) and not nifi-user (logs user activity) 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-24-2017
	
		
		12:08 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Bug is fixed in 3.0.1.1 patch and issue is resolved.  Release notes as follows:  http://dev.hortonworks.com.s3.amazonaws.com/HDPDocuments/HDF3/HDF-3.0.1.1/bk_release-notes/content/ch_hdf_relnotes.html#repo-location 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-24-2017
	
		
		01:56 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 See these for configuring ACID transactions in Hive:    https://hortonworks.com/tutorial/using-hive-acid-transactions-to-insert-update-and-delete-data/#enable-acid  https://community.hortonworks.com/articles/97113/hive-acid-merge-by-example.html   See this for implementing ORC format on your Hive files:   https://orc.apache.org/docs/hive-ddl.html  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-23-2017
	
		
		05:27 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 @Sina Talebian A little disappointing when the resolution is reinstall, but it works now 🙂  VMs can be fickle. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
         
					
				













