About bhagan

bhagan · ‎06-06-2018

Vinit, Take a look at this example that I put up on github: https://github.com/bchagan/spark-sql-concepts/blob/master/src/main/java/com/hagan/brian/spark/NestedStructureProcessor.java

bhagan · ‎06-08-2017

In Nifi, you can use the PutHiveStreaming processor, and it is designed to commit transactions in batch, which is configurable. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hive-nar/1.2.0/org.apache.nifi.processors.hive.PutHiveStreaming/index.html I think risky is not the correct term for using an ACID table, but careful may be better; that is, with careful design and configuration, you can avoid locking issues. Be sure to review the Hive documentation on this: https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

bhagan · ‎06-07-2017

One way to do this is to first use the EvaluateJsonPath processor and break out the json into individual attributes. Notice that you can add ctime by clicking the + symbol, adding a name and a value. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.2.0/org.apache.nifi.processors.standard.EvaluateJsonPath/index.html See image, evaluatejsonpath, below. Then later, you can use the AttributesToJson processor to rebuild the complete JSON event. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.2.0/org.apache.nifi.processors.standard.AttributesToJSON/index.html See image, attributestojson, below.The output at this point would look like this { "twitterMessage" : "20x faster VM access w/ Dell EMC and Intel. Learn more https://t.co/1hD0MYfBbh https://t.co/0GhsMQve8b", "twitterUser" : "GCSIT Solutions", "ctime" : "Wed Jun 07 18:10:35 +0000 2017" }

bhagan · ‎05-30-2017

I suspect that your id column is not specified as the primary key on the table. Try making id the primary key, and see if you get different results.

bhagan · ‎05-26-2017

I think if one of the columns in the dataframe is the key of the HBase table, the lookup will be very efficient; that is, if there is a bottleneck, I don't believe the bottleneck will be the lookup.

bhagan · ‎05-26-2017

By default, ACLs are disabled, so you'll have to add the property to HDFS service. In Ambari, select the HDFS service. Select the Configs tab. Find or scroll to "Custom hdfs-site" Click Add Property. Enter the value: dfs.namenode.acls.enabled=true Save your changes and add a comment.

bhagan · ‎05-15-2017

My understanding is that once Kafka is configured for Kerberos, Kafka requires a ticket for both Producers and Consumers.

bhagan · ‎05-08-2017

One of the search use cases that I’ve been introduced to would require the ability to index text such as scanned text in png files. I set out to figure out how to do this with SOLR. I came across a couple pretty good blog posts, but as usual, you have to put together what you learn from multiple sources before you can get things to work correctly (or at least that’s what usually happens for me). So I thought I would put together the steps I took to get it to work. I used HDP Sandbox 2.3. Step-by-step guideInstall dependencies - this will provide you support for processing pngs, jpegs, and tiffs yum install autoconf automake libtool yum install libpng-devel yum install libjpeg-devel yum install libtiff-devel yum install zlib-devel Download Leptonica, an image processing library wget http://www.leptonica.org/source/leptonica-1.69.tar.gz Download Tesseract, an Optical Character Recognition engine wget http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz Ensure proper variables and pathing are set – This is necessary so that when building leptonica, the build can find the dependencies that you installed earlier. If this pathing is not correct, you will get Unsupported image type errors when running tesseract command line client. Also, when installing tesseract, you will place language data at TESSDATA_PREFIX dir. [root@sandbox tesseract-ocr]# cat ~/.profileexport TESSDATA_PREFIX='/usr/local/share/'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/lib64 Build leptonica tar xvf leptonica-1.69.tar.gz cd leptonica-1.69./configure make sudo make install Build Tesseract tar xvf tesseract-ocr-3.02.02.tar.gz cd tesseract-ocr ./autogen.sh ./configure make sudo make installsudo ldconfig Download tesseract language(s) and place them in TESSDATA_PREFIX dir, defined above wget http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz tar xzf tesseract-ocr-3.02.eng.tar.gz cp tesseract-ocr/tessdata/* /usr/local/share/tessdata Test Tesseract – Use the image in this blog post. You’ll notice that this is where I started. The ‘hard’ part of this was getting the builds correct for leptonica. And the problem there was ensuring that I had the correct dependencies installed and that they were available on the path defined above. If this doesn’t work, there’s no sense moving on to SOLR. http://blog.thedigitalgroup.com/vijaym/2015/07/17/using-solr-and-tikaocr-to-search-text-inside-an-image/ [root@sandbox tesseract-ocr]# /usr/local/bin/tesseract ~/OM_1.jpg ~/OM_out Tesseract Open Source OCR Engine v3.02.02 with Leptonica [root@sandbox tesseract-ocr]# cat ~/OM_out.txt ‘ '"I“ " "' ./lrast. Shortly before the classes started I was visiting a.certain public school, a school set in a typically Englishcountryside, which on the June clay of my visit was wonder-fully beauliful. The Head Master—-no less typical than hisschool and the country-side—pointed out the charms ofboth, and his pride came out in the ?nal remark which he madebeforehe left me. He explained that he had a class to takein'I'heocritus. Then (with a. buoyant gesture); “ Can you, conceive anything more delightful than a class in Theocritus,on such a day and in such a place?" If you have text in your out file, then you’ve done it correctly! Start Solr Sample – This sample contains the Proper Extracting Request Handler for processing with tika https://wiki.apache.org/solr/ExtractingRequestHandler cd /opt/lucidworks-hdpsearch/solr/bin/ ./solr -e dih Use SOLR Admin to upload the image Go back to the blog post or to the RequestHandler page for the proper update/extract command syntax. From SOLR admin, select the tika core. Click Documents In the Request-Handler (qt) field, enter /update/extract In the Document Type drop down, select File Upload Choose the png file In the Extracting Req. Handler Params box, type the following: literal.id=d1&uprefix=attr_&fmap.content=attr_content&commit=true Understanding all the parameters is another process, but the literal.id is the unique id for the document. For more information on this command, start by reviewing https://wiki.apache.org/solr/ExtractingRequestHandler and then the SOLR documentation. Run a query From SOLR admin, select tika core. Click Query. In the q field, type attr_content:explained Execute the query. http://sandbox.hortonworks.com:8983/solr/tika/select?q=attr_content%3Aexplained&wt=json∈dent=true Try it again Use another png or supported file type. Be sure to use the same Request Handler Params, except provide a new unique literal.id Note, that the attr_content is a dynamic field, and it cannot be highlighted. If you figure out how to add an indexed and stored field to hold the image text, let me know 🙂

bhagan · ‎05-01-2017

@Gu Gur And just to add a little detail to Matt's comment, from the documentation, Regular Expressions are entered by adding user-defined properties; the name of the property maps to the Attribute Name into which the result will be placed.

bhagan · ‎04-28-2017

The following page has examples: https://github.com/abajwa-hw/ambari-flink-service/blob/master/package/scripts/flink.py Ali has many examples of services in his repositories. The setup page for this service shows how to get the scripts into the cluster. https://github.com/abajwa-hw/ambari-flink-service

Online	Offline
Last Visited	‎01-10-2022 11:19 AM

Member Since	‎09-29-2015 03:09 PM
Last Visited	‎01-10-2022 11:19 AM
Posts	142
Kudos received	45

Cloudera Community

Re: HIVE insert/update/delete

Re: updating and inserting new data to mysql using...

Re: requirement for ACLs

Re: Ambari - adding custom service

Re: nifi dataflow - get result, parse it, and save...

Re: Iterating through nested fields in spark DF

Re: HIVE insert/update/delete

Re: Nifi - How to add key:value to json

Re: updating and inserting new data to mysql using...

Re: How to perform lookup operation in spark dataf...

Re: requirement for ACLs

Re: Can Kafka handle the mixture of authentication...

How to Search for Text in an Image

Re: nifi dataflow - get result, parse it, and save...

Re: Ambari - adding custom service