About bhagan

bhagan · ‎05-08-2017

One of the search use cases that I’ve been introduced to would require the ability to index text such as scanned text in png files. I set out to figure out how to do this with SOLR. I came across a couple pretty good blog posts, but as usual, you have to put together what you learn from multiple sources before you can get things to work correctly (or at least that’s what usually happens for me). So I thought I would put together the steps I took to get it to work. I used HDP Sandbox 2.3. Step-by-step guideInstall dependencies - this will provide you support for processing pngs, jpegs, and tiffs yum install autoconf automake libtool yum install libpng-devel yum install libjpeg-devel yum install libtiff-devel yum install zlib-devel Download Leptonica, an image processing library wget http://www.leptonica.org/source/leptonica-1.69.tar.gz Download Tesseract, an Optical Character Recognition engine wget http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz Ensure proper variables and pathing are set – This is necessary so that when building leptonica, the build can find the dependencies that you installed earlier. If this pathing is not correct, you will get Unsupported image type errors when running tesseract command line client. Also, when installing tesseract, you will place language data at TESSDATA_PREFIX dir. [root@sandbox tesseract-ocr]# cat ~/.profileexport TESSDATA_PREFIX='/usr/local/share/'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/lib64 Build leptonica tar xvf leptonica-1.69.tar.gz cd leptonica-1.69./configure make sudo make install Build Tesseract tar xvf tesseract-ocr-3.02.02.tar.gz cd tesseract-ocr ./autogen.sh ./configure make sudo make installsudo ldconfig Download tesseract language(s) and place them in TESSDATA_PREFIX dir, defined above wget http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz tar xzf tesseract-ocr-3.02.eng.tar.gz cp tesseract-ocr/tessdata/* /usr/local/share/tessdata Test Tesseract – Use the image in this blog post. You’ll notice that this is where I started. The ‘hard’ part of this was getting the builds correct for leptonica. And the problem there was ensuring that I had the correct dependencies installed and that they were available on the path defined above. If this doesn’t work, there’s no sense moving on to SOLR. http://blog.thedigitalgroup.com/vijaym/2015/07/17/using-solr-and-tikaocr-to-search-text-inside-an-image/ [root@sandbox tesseract-ocr]# /usr/local/bin/tesseract ~/OM_1.jpg ~/OM_out Tesseract Open Source OCR Engine v3.02.02 with Leptonica [root@sandbox tesseract-ocr]# cat ~/OM_out.txt ‘ '"I“ " "' ./lrast. Shortly before the classes started I was visiting a.certain public school, a school set in a typically Englishcountryside, which on the June clay of my visit was wonder-fully beauliful. The Head Master—-no less typical than hisschool and the country-side—pointed out the charms ofboth, and his pride came out in the ?nal remark which he madebeforehe left me. He explained that he had a class to takein'I'heocritus. Then (with a. buoyant gesture); “ Can you, conceive anything more delightful than a class in Theocritus,on such a day and in such a place?" If you have text in your out file, then you’ve done it correctly! Start Solr Sample – This sample contains the Proper Extracting Request Handler for processing with tika https://wiki.apache.org/solr/ExtractingRequestHandler cd /opt/lucidworks-hdpsearch/solr/bin/ ./solr -e dih Use SOLR Admin to upload the image Go back to the blog post or to the RequestHandler page for the proper update/extract command syntax. From SOLR admin, select the tika core. Click Documents In the Request-Handler (qt) field, enter /update/extract In the Document Type drop down, select File Upload Choose the png file In the Extracting Req. Handler Params box, type the following: literal.id=d1&uprefix=attr_&fmap.content=attr_content&commit=true Understanding all the parameters is another process, but the literal.id is the unique id for the document. For more information on this command, start by reviewing https://wiki.apache.org/solr/ExtractingRequestHandler and then the SOLR documentation. Run a query From SOLR admin, select tika core. Click Query. In the q field, type attr_content:explained Execute the query. http://sandbox.hortonworks.com:8983/solr/tika/select?q=attr_content%3Aexplained&wt=json∈dent=true Try it again Use another png or supported file type. Be sure to use the same Request Handler Params, except provide a new unique literal.id Note, that the attr_content is a dynamic field, and it cannot be highlighted. If you figure out how to add an indexed and stored field to hold the image text, let me know 🙂

bhagan · ‎10-04-2016

I faced the same issue. I used sqoop to import a table, then the search function just hung. I reimported the vm, and now I can't access the Atlas dashboard. I get a 503 error.

bhagan · ‎08-23-2016

@Sunile Manjee Yes, I did flatten the json. Here is what I used (all one line): {"enumTypes":[],"structTypes":[],"traitTypes": [{"superTypes":[],"hierarchicalMetaTypeName":"org.apache.atlas.typesystem.types.TraitType","typeName":"EXPIRES_ON","attributeDefinitions":[{"name":"expiry_date","dataTypeName":"string","multiplicity":"required","isComposite":false,"isUnique":false,"isIndexable":true,"reverseAttributeName": null}]}],"classTypes":[]} But for me, I had left out an attribute.

bhagan · ‎07-29-2016

I was reviewing some posts related to Pig, and found the following question interesting: https://community.hortonworks.com/questions/47720/apache-pig-guarantee-that-all-the-value-in-a-colum.html#answer-47767 I wanted to share an alternative solution using Pentaho Data Integration (PDI), an open source ETL tool, that provides visual mapreduce capabilities. PDI is YARN ready, so when you configure PDI to use your HDP cluster (or sandbox) and run the attached job, it will run as a YARN application. The following image is your Mapper. Above, you see the main transformation. It reads input, which you configure in the Pentaho MapReduce Job (seen below). The transformation follows a pattern, which is to immediately split the delimited file into individual fields. Next, I use a Java Expression to determine if a field is numeric. If not, the we set the value of the field as the String, null. Next, to prepare for MapReduce output, we concatenate the fields back together as a single value and pass the key / value to the MapReduce Output. Once you have the main MapReduce transformation created, you wrap that into a PDI MapReduce Job. If you're familiar with MapReduce, you will recognize the configuration options below, which you would set in your code. Next, configure your Mapper. The Job Succeeds! And the file is in HDFS.

bhagan · ‎07-26-2016

It is often the case that we need to install Hortonworks in environments with strict requirements. One such requirement may be that all http traffic must go through a dedicated proxy server. When installing Hortoworks HDP using Ambari, you can find instructions for configuring Ambari to use the proxy on the docs.hortonworks.com website. For example, here is the page for configuring Ambari 2.2 http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_ambari_reference_guide/content/ch_setting_up_an_internet_proxy_server_for_ambari.html Notice that the instructions mention that you must also configure yum to use the proxy. It’s important to note that the above instructions for yum will configure all repositories to use the proxy, and you may not want this behavior. So while it is great to set the proxy at yum’s global level, you should review any existing repository configurations to determine if they should not use the proxy. If any repositories should not use the proxy, then you can update their configurations with the following option: proxy=_none_ Additionally, while preparing for an HDP installation, you will also use the tools wget and curl. I suggest that you confirm that these tools are also setup to use the proxy. If not, it’s as easy as setting the proxy option in their configuration files. Wget has a global file /usr/local/etc/wgetrc. Wget Options: use_proxy = on http_proxy = http://proxyhost:port Curl does not have a global file, so you can create .curlrc in your home directory. proxy <[protocol://][user:password@]proxyhost[:port] Once you have Ambari, yum, wget, and curl configured to use your proxy, you’ll be ready to start the installation.

bhagan · ‎06-15-2016

Atlas Quickstart creates a number of Tags. You may also have created some tags with the REST API. You may want to list the definition of a single Tag or Trait, or you may want a list of all Tags/Traits in Atlas. The following command will list all TRAITS or Tags curl -iv -d -H "Content-Type: application/json" -X GET http://sandbox.hortonworks.com:21000/api/atlas/types?type=TRAIT The following response shows that I have seven Traits/Tags defined: {"results":["Dimension","ETL","Fact","JdbcAccess","Metric","PII","EXPIRES_ON"],"count":7,"requestId":"qtp1770708318-84 - 6efad306-cb19-4d12-8fd4-31f664e771eb"} The following command returns the definition of a Tag/Trait named, EXPIRES_ON: curl -iv -d -H "Content-Type: application/json" -X GET http://sandbox.hortonworks.com:21000/api/atlas/types/EXPIRES_ON Following is the response: {"typeName":"EXPIRES_ON","definition":"{\n \"enumTypes\":[\n \n ],\n \"structTypes\":[\n \n ],\n \"traitTypes\":[\n {\n \"superTypes\":[\n \n ],\n \"hierarchicalMetaTypeName\":\"org.apache.atlas.typesystem.types.TraitType\",\n \"typeName\":\"EXPIRES_ON\",\n \"attributeDefinitions\":[\n {\n \"name\":\"expiry_date\",\n \"dataTypeName\":\"string\",\n \"multiplicity\":\"required\",\n \"isComposite\":false,\n \"isUnique\":false,\n \"isIndexable\":true,\n \"reverseAttributeName\":null\n }\n ]\n }\n ],\n \"classTypes\":[\n \n ]\n}","requestId":"qtp1770708318-97 - cffcd8b0-5ebe-4673-87b2-79fac9583557"} Notice all of the new lines (\n) that are part of the response. This is a known issue, and you can follow the progress in this JIRA: https://issues.apache.org/jira/browse/ATLAS-208

bhagan · ‎06-14-2016

I figured this out. I had left out dataTypeName as part of the attributeDefinitions.

bhagan · ‎06-14-2016

Hello, Can you tell me which version of HDP and Atlas that you tested this with? I tried today with HDP 2.4, which comes with Atlas 0.5.0.2.4, and I'm getting an error regarding "Unable to deserialize json" I'm using the following curl command to test: curl -iv -d @./atlas_payload.json -H "Content-Type: application/json" -X POST http://sandbox.hortonworks.com:21000/api/atlas/types Thanks!

bhagan · ‎05-17-2016

How to Index PDF File with Flume and MorphlineSolrSink The flow is as follows: Spooling Directory Source > File Channel > MorphlineSolrSink The reason I wanted to complete this exercise was to provide a less complex solution; that is, fewer moving parts, less configuration, and no coding compared to kafka / storm or spark. Also, the example is easy to setup and demonstrate quickly. Flume compared to Kafka/Storm is limited by its declarative nature, but that is what makes it easy to use. However, the morphline does even provide a java command (with some potential performance side effects), so you can get pretty explicit. I’ve read that Flume can handle at 50,000 events per second on a single server, so while the pipe may not be as fat as a Kafka/Storm pipe, it may be well suited for many use cases. Step-by-step guide 1. Take care of dependencies. I am running HDP 2.2.4 Sandbox and the Solr that came with it. To get started, you will need to add a lot of dependencies to your /usr/hdp/current/flume-server/lib/. You can get all of the dependencies from /opt/solr/solr/contrib/ and /opt/solr/solr/dist directory structure. commons-fileupload-1.2.1.jar config-1.0.2.jar fontbox-1.8.4.jar httpmime-4.3.1.jar kite-morphlines-avro-0.12.1.jar kite-morphlines-core-0.12.1.jar kite-morphlines-json-0.12.1.jar kite-morphlines-tika-core-0.12.1.jar kite-morphlines-tika-decompress-0.12.1.jar kite-morphlines-twitter-0.12.1.jar lucene-analyzers-common-4.10.4.jar lucene-analyzers-kuromoji-4.10.4.jar lucene-analyzers-phonetic-4.10.4.jar lucene-core-4.10.4.jar lucene-queries-4.10.4.jar lucene-spatial-4.10.4.jar metrics-core-3.0.1.jar metrics-healthchecks-3.0.1.jar noggit-0.5.jar org.restlet-2.1.1.jar org.restlet.ext.servlet-2.1.1.jar pdfbox-1.8.4.jar solr-analysis-extras-4.10.4.jar solr-cell-4.10.4.jar solr-clustering-4.10.4.jar solr-core-4.10.4.jar solr-dataimporthandler-4.10.4.jar solr-dataimporthandler-extras-4.10.4.jar solr-langid-4.10.4.jar solr-map-reduce-4.10.4.jar solr-morphlines-cell-4.10.4.jar solr-morphlines-core-4.10.4.jar solr-solrj-4.10.4.jar solr-test-framework-4.10.4.jar solr-uima-4.10.4.jar solr-velocity-4.10.4.jar spatial4j-0.4.1.jar tika-core-1.5.jar tika-parsers-1.5.jar tika-xmp-1.5.jar2. 2. Configure SOLR. Next there are some important SOLR configurations: solr.xml – The solr.xml included with collection1 was unmodified schema.xml – The schema.xml that is included with collection1 is all you need. It includes the fields that SolrCell will return when processing the PDF file. You need to make sure that you capture the fields you want with the SolrCell command in the morphline.conf file. solorconfig.xml – The solorconfig.xml that is included with collection1 is all you need. It includes the ExtractingRequestHandler that you need to process the PDF file. 3. Flume Configuration #agent config agent1.sources = spooling_dir_src agent1.sinks = solr_sink agent1.channels = fileChannel # Use a file channel agent1.channels.fileChannel.type = file #agent1.channels.fileChannel.capacity = 10000 #agent1.channels.fileChannel.transactionCapacity = 10000 # Configure source agent1.sources.spooling_dir_src.channels = fileChannel agent1.sources.spooling_dir_src.type = spooldir agent1.sources.spooling_dir_src.spoolDir = /home/flume/dropzone agent1.sources.spooling_dir_src.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder #Configure Solr Sink agent1.sinks.solr_sink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink agent1.sinks.solr_sink.morphlineFile = /home/flume/morphline.conf agent1.sinks.solr_sink.batchsize = 1000 agent1.sinks.solr_sink.batchDurationMillis = 2500 agent1.sinks.solr_sink.channel = fileChannel 4. Morphline Configuration File solrLocator: { collection : collection1 #zkHost : "127.0.0.1:9983" zkHost : "127.0.0.1:2181" } morphlines : [ { id : morphline1 importCommands : ["org.kitesdk.**", "org.apache.solr.**"] commands : [ { detectMimeType { includeDefaultMimeTypes : true } } { solrCell { solrLocator : ${solrLocator} captureAttr : true lowernames : true capture : [title, author, content, content_type] parsers : [ { parser : org.apache.tika.parser.pdf.PDFParser } ] } } { generateUUID { field : id } } { sanitizeUnknownSolrFields { solrLocator : ${solrLocator} } } { loadSolr: { solrLocator : ${solrLocator} } } ] } ] 5. Start SOLR. I used the following command so I could watch the logging. Note I am using the embedded Zookeeper that starts with this command: ./solr start –f 6. Start Flume. I used the following command: /usr/hdp/current/flume-server/bin/flume-ng agent --name agent1 --conf /etc/flume/conf/agent1 --conf-file /home/flume/flumeSolrSink.conf -Dflume.root.logger=DEBUG,console 7. Drop a PDF file into /home/flume/dropzone. If you're watching the log, you'll see when the process is completed. 8. In SOLR Admin, queries to run: text:* (or any text in the file) title:* (or the title) content_type:* (or pdf) author:* (or the author) use the content field for highlighting, not for searching

Online	Offline
Last Visited	‎01-10-2022 11:19 AM

Member Since	‎09-29-2015 03:09 PM
Last Visited	‎01-10-2022 11:19 AM
Posts	142
Kudos received	45

Cloudera Community

How to Search for Text in an Image

Re: How to get Atlas up and running in HDP 2.5 San...

Re: Create Trait Types in Atlas

Finding Non-Numerics in a File - Pig Alternative

Preparing to Install HDP behind a Proxy

List Atlas Tags and Traits

Re: Create Trait Types in Atlas

Re: Create Trait Types in Atlas

How to Index PDF File with Flume and MorphlineSolr...