Reply
New Contributor
Posts: 2
Registered: ‎06-23-2016

Morphline syntax with HBase Lily NRT (PDF extraction)

Hi together,

 

I want to index binary files which are stored in HBase with Lily NRT.

For this purpose I have a problem, which is a "morphline-understanding-problem".

The following morphline.conf extracts the file (PDF, ...) out of an HBase cell, parse it with Tika and Lily loads the output to the solr index. That's fine. The question is: how can I extract an additional field to load it in the same solr document? When I try to extract more hbase-fields, the tika parser parses the other field...

I guess it shouldn't be a problem, but I don't get the right solution...

 

SOLR_LOCATOR: {
  # Name of solr collection
  collection : lily-test

  # ZooKeeper ensemble
  zkHost : "127.0.0.1:2181/solr"

  # Max number of documents to pass per RPC from morphline to Solr Server
  # batchSize : 10000
}
morphlines : [
{
id : lily-test
importCommands : ["org.kitesdk.**", "com.ngdata.**","org.apache.solr.**"]

commands : [     
                { logInfo { format : "HBase binary extraction" } }    
                { extractHBaseCells {
						mappings : [
{ 
							inputColumn : "data:file"
							outputField : "_attachment_body" 
							type : "byte[]" 
							source : value 
}
# {
#	          inputColumn : "data:keywords"
#	          outputField : "text" 
#	          type : string 
#			  #source : qualifier
#			  source : value
#	}
]
}
				}
				{ logInfo { format : "mimetype detection" } }
				{ detectMimeType { 
						includeDefaultMimeTypes : true 
						includeMetaData : true	} 
				}
				{ logInfo { format : "tika parsing" } }
				{ solrCell {
					solrLocator : ${SOLR_LOCATOR}
					captureAttr : true
					lowernames : true
					capture : [title, author, content, content_type]
					parsers : [  
						{ parser : org.apache.tika.parser.image.ImageParser }
						{ parser : org.apache.tika.parser.image.PSDParser }
						{ parser : org.apache.tika.parser.image.TiffParser }
						{ parser : org.apache.tika.parser.microsoft.OfficeParser }
						{ parser : org.apache.tika.parser.microsoft.TNEFParser }
						{ parser : org.apache.tika.parser.microsoft.ooxml.OOXMLParser }
						{ parser : org.apache.tika.parser.odf.OpenDocumentParser }
						{ parser : org.apache.tika.parser.pdf.PDFParser }			
						{ parser : org.apache.tika.parser.asm.ClassParser }
						{ parser : org.gagravarr.tika.FlacParser }
						{ parser : org.apache.tika.parser.audio.AudioParser }
						{ parser : org.apache.tika.parser.audio.MidiParser }
						{ parser : org.apache.tika.parser.crypto.Pkcs7Parser }
						{ parser : org.apache.tika.parser.dwg.DWGParser }
						{ parser : org.apache.tika.parser.epub.EpubParser }
						{ parser : org.apache.tika.parser.executable.ExecutableParser }
						{ parser : org.apache.tika.parser.feed.FeedParser }
						{ parser : org.apache.tika.parser.font.AdobeFontMetricParser }
						{ parser : org.apache.tika.parser.font.TrueTypeParser }
						{ parser : org.apache.tika.parser.xml.XMLParser }
						{ parser : org.apache.tika.parser.html.HtmlParser }
						{ parser : org.apache.tika.parser.iptc.IptcAnpaParser }
						{ parser : org.apache.tika.parser.iwork.IWorkPackageParser }
						{ parser : org.apache.tika.parser.jpeg.JpegParser }
						{ parser : org.apache.tika.parser.mail.RFC822Parser }
						{ parser : org.apache.tika.parser.mbox.MboxParser,
							  additionalSupportedMimeTypes : [message/x-emlx] }
						{ parser : org.apache.tika.parser.mp3.Mp3Parser }
						{ parser : org.apache.tika.parser.mp4.MP4Parser }
						{ parser : org.apache.tika.parser.hdf.HDFParser }
						{ parser : org.apache.tika.parser.netcdf.NetCDFParser }
						{ parser : org.apache.tika.parser.pkg.CompressorParser }
						{ parser : org.apache.tika.parser.pkg.PackageParser }
						{ parser : org.apache.tika.parser.rtf.RTFParser }
						{ parser : org.apache.tika.parser.txt.TXTParser }
						{ parser : org.apache.tika.parser.video.FLVParser }
						{ parser : org.apache.tika.parser.xml.DcXMLParser }
						{ parser : org.apache.tika.parser.xml.FictionBookParser }
						{ parser : org.apache.tika.parser.chm.ChmParser }]}
				}
{ logInfo { format : "add timestamp" } }
{ addCurrentTime {} }
{ logInfo { format : "output record: {}", args : ["@{}"] } }
]
}
]

Thank you in advance!

Announcements
The Kite SDK is a collection of docs, sample code, APIs, and tools to make Hadoop application development faster. Learn more at http://kitesdk.org.