Member since
09-29-2015
142
Posts
45
Kudos Received
15
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1001 | 06-08-2017 05:28 PM | |
4167 | 05-30-2017 02:07 PM | |
722 | 05-26-2017 07:48 PM | |
2398 | 04-28-2017 02:48 PM | |
1320 | 04-28-2017 02:41 PM |
06-06-2018
02:37 PM
It depends on what you're trying to do, but perhaps the first thing you want to do is tokenize each line by the space delimiter: -- sample2.txt
-- 03:00:00,685 INFO [tr.com.anadolubank.gm.server.ANDDefaultServiceExecuter] (http-/0.0.0.0:8080-1) [31e432d4-6a89-4828-9c24-0f1d596eed23][10.40.26.49][WEB_AUTHENTICATE] started
-- 03:00:00,703 INFO [tr.com.anadolubank.gm.server.ANDDefaultServiceExecuter] (http-/0.0.0.0:8080-1) [31e432d4-6a89-4828-9c24-0f1d596eed23][10.40.26.49][WEB_AUTHENTICATE] executed in 18 ms
-- 03:00:00,898 INFO [tr.com.anadolubank.gm.server.ANDDefaultServiceExecuter] (http-/0.0.0.0:8080-1) [88898a09-0664-4a77-bc53-3d428712e4ef][10.40.26.49][sessionKill] started
-- 03:00:00,947 INFO [tr.com.anadolubank.gm.server.ANDDefaultServiceExecuter] (http-/0.0.0.0:8080-1) [88898a09-0664-4a77-bc53-3d428712e4ef][10.40.26.49][sessionKill] executed in 49 ms
A = LOAD '/user/admin/sample2.txt' AS (line:chararray);
X = FOREACH A GENERATE TOKENIZE(line, ' ');
DUMP X;
-- results
-- ({(03:00:00,685),(INFO),([tr.com.anadolubank.gm.server.ANDDefaultServiceExecuter]),((http-/0.0.0.0:8080-1)),([31e432d4-6a89-4828-9c24-0f1d596eed23][10.40.26.49][WEB_AUTHENTICATE]),(started)})
-- ({(03:00:00,703),(INFO),([tr.com.anadolubank.gm.server.ANDDefaultServiceExecuter]),((http-/0.0.0.0:8080-1)),([31e432d4-6a89-4828-9c24-0f1d596eed23][10.40.26.49][WEB_AUTHENTICATE]),(executed),(in),(18),(ms)})
-- ({(03:00:00,898),(INFO),([tr.com.anadolubank.gm.server.ANDDefaultServiceExecuter]),((http-/0.0.0.0:8080-1)),([88898a09-0664-4a77-bc53-3d428712e4ef][10.40.26.49][sessionKill]),(started)})
-- ({(03:00:00,947),(INFO),([tr.com.anadolubank.gm.server.ANDDefaultServiceExecuter]),((http-/0.0.0.0:8080-1)),([88898a09-0664-4a77-bc53-3d428712e4ef][10.40.26.49][sessionKill]),(executed),(in),(49),(ms)})
... View more
06-06-2018
02:28 PM
-- sample.txt
-- 03:00:00,685 INFO [aa.com.aaaa.gm.server.ANDDefaultServiceExecuter] (http-/0.0.0.0:8080-1) [31e432d4-6a89-4828-9c24-0f1d596eed23][10.40.26.49][WEB_AUTHENTICATE] started
A = LOAD '/user/admin/sample.txt' AS (line:chararray);
X = FOREACH A GENERATE TOKENIZE(line, ' ');
DUMP X;
-- results
({(03:00:00,685),(INFO),([aa.com.aaaa.gm.server.ANDDefaultServiceExecuter]),((http-/0.0.0.0:8080-1)),([31e432d4-6a89-4828-9c24-0f1d596eed23][10.40.26.49][WEB_AUTHENTICATE]),(started)})
... View more
06-06-2018
01:34 PM
Vinit, Take a look at this example that I put up on github: https://github.com/bchagan/spark-sql-concepts/blob/master/src/main/java/com/hagan/brian/spark/NestedStructureProcessor.java
... View more
06-06-2018
01:13 PM
Hi Sami, Have you come up with anything? After a table has been defined, I don't see a way to add a column qualifier without adding a value at the same time. Is it acceptable to create a table with 40 column qualifiers and add the value that is in the string? If you don't have 40 tokens, then populate the values of the other cells with a constant, like -1. I've been able to accomplish this. So given a sequence, "1,2,3,4" haganbrian column=f1:c1, timestamp=1528213851119, value=\x00\x00\x00\x01
haganbrian column=f1:c2, timestamp=1528213851119, value=\x00\x00\x00\x02
haganbrian column=f1:c3, timestamp=1528213851119, value=\x00\x00\x00\x03
haganbrian column=f1:c4, timestamp=1528213851119, value=\x00\x00\x00\x04
haganbrian column=f1:c5, timestamp=1528213851119, value=\xFF\xFF\xFF\xFF
haganbrian column=f1:c6, timestamp=1528213851119, value=\xFF\xFF\xFF\xFF
...
... View more
05-17-2018
06:55 PM
Is it possible that a large portion of your data has the same or similar key, such as a timestamp causing hotspotting? Because you imported the table, all records will have a similar timestamp. Take a look at the records to see.
... View more
11-15-2017
06:08 PM
Take a look at your Ambari Dashboard and let us know what the HDFS Disk Usage indicates.
... View more
08-03-2017
06:49 PM
Try creating your table with bucketing and sorting as described here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTables
... View more
06-08-2017
05:28 PM
In Nifi, you can use the PutHiveStreaming processor, and it is designed to commit transactions in batch, which is configurable. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hive-nar/1.2.0/org.apache.nifi.processors.hive.PutHiveStreaming/index.html I think risky is not the correct term for using an ACID table, but careful may be better; that is, with careful design and configuration, you can avoid locking issues. Be sure to review the Hive documentation on this: https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
... View more
06-08-2017
03:51 PM
I suspect that you did not create the hive table with the correct serde. Are you using the avro serde described here: https://cwiki.apache.org/confluence/display/Hive/AvroSerDe
... View more
06-08-2017
03:34 PM
1 Kudo
Your question leads me believe that you know how to write avro messages into a Kafka topic. Once your avro message is in a Kafka topic, you will need to write a Consumer which can retrieve the avro message from the topic. The nature of the message doesn’t really matter in terms of retrieving the message. The Spark Streaming documentation provides an example of how to setup a consumer to retrieve events from Kafka topics. You could expand the example to process the avro message. https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
... View more
06-07-2017
06:14 PM
One way to do this is to first use the EvaluateJsonPath processor and break out the json into individual attributes. Notice that you can add ctime by clicking the + symbol, adding a name and a value.
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.2.0/org.apache.nifi.processors.standard.EvaluateJsonPath/index.html See image, evaluatejsonpath, below.
Then later, you can use the AttributesToJson processor to rebuild the complete JSON event.
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.2.0/org.apache.nifi.processors.standard.AttributesToJSON/index.html
See image, attributestojson, below.The output at this point would look like this {
"twitterMessage" : "20x faster VM access w/ Dell EMC and Intel. Learn more https://t.co/1hD0MYfBbh https://t.co/0GhsMQve8b",
"twitterUser" : "GCSIT Solutions",
"ctime" : "Wed Jun 07 18:10:35 +0000 2017"
}
... View more
05-30-2017
02:07 PM
1 Kudo
I suspect that your id column is not specified as the primary key on the table. Try making id the primary key, and see if you get different results.
... View more
05-30-2017
01:57 PM
Yes, you would use the lastmodified mode for the sqoop incremental update as explained here: https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_incremental_imports
... View more
05-30-2017
01:53 PM
Not from within Ambari, but I always like to have the Pig documentation page open while I'm working with Pig. http://pig.apache.org/docs/r0.16.0/basic.html
... View more
05-30-2017
01:47 PM
Sounds like you want to use the GetHttp processor. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.2.0/org.apache.nifi.processors.standard.GetHTTP/index.html There is an example on the NiFi wiki, https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates and I'm sure there are a few in the Articles and Repos sections on HCC. Have a great day!
... View more
05-26-2017
08:13 PM
I think if one of the columns in the dataframe is the key of the HBase table, the lookup will be very efficient; that is, if there is a bottleneck, I don't believe the bottleneck will be the lookup.
... View more
05-26-2017
07:57 PM
Sean, I would suggest using the HA methods prescribed by the RDBMS vendor. If setup according to the DB that Hortonworks supports, then I see no reason why we wouldn't support HA for MySQL, Postgres, or Oracle with their respective services.
... View more
05-26-2017
07:48 PM
By default, ACLs are disabled, so you'll have to add the property to HDFS service. In Ambari, select the HDFS service. Select the Configs tab. Find or scroll to "Custom hdfs-site" Click Add Property. Enter the value: dfs.namenode.acls.enabled=true Save your changes and add a comment.
... View more
05-15-2017
06:29 PM
1 Kudo
My understanding is that once Kafka is configured for Kerberos, Kafka requires a ticket for both Producers and Consumers.
... View more
05-08-2017
06:32 PM
2 Kudos
One of the search use cases that I’ve been introduced to would
require the ability to index text such as scanned text in png files. I set out
to figure out how to do this with SOLR. I came across a couple pretty good blog
posts, but as usual, you have to put together what you learn from multiple
sources before you can get things to work correctly (or at least that’s what
usually happens for me). So I thought I would put together the steps I took to
get it to work. I used HDP Sandbox 2.3. Step-by-step
guideInstall dependencies - this will provide you support for processing pngs, jpegs, and tiffs yum install autoconf automake libtool
yum install libpng-devel
yum install libjpeg-devel
yum install libtiff-devel
yum install zlib-devel Download Leptonica, an image processing library wget http://www.leptonica.org/source/leptonica-1.69.tar.gz Download Tesseract, an Optical Character Recognition engine wget http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz Ensure proper variables and pathing are set – This is necessary so that when building leptonica, the build can find the dependencies that you installed earlier. If this pathing is not correct, you will get Unsupported image type errors when running tesseract command line client.
Also, when installing tesseract, you will place language data at TESSDATA_PREFIX dir. [root@sandbox tesseract-ocr]# cat ~/.profileexport TESSDATA_PREFIX='/usr/local/share/'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/lib64 Build leptonica tar xvf leptonica-1.69.tar.gz
cd leptonica-1.69./configure
make
sudo make install Build Tesseract tar xvf tesseract-ocr-3.02.02.tar.gz
cd tesseract-ocr
./autogen.sh
./configure
make
sudo make installsudo ldconfig Download tesseract language(s) and place them in TESSDATA_PREFIX dir, defined above wget http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz
tar xzf tesseract-ocr-3.02.eng.tar.gz
cp tesseract-ocr/tessdata/* /usr/local/share/tessdata Test Tesseract – Use the image in this blog post. You’ll notice that this is where I started. The ‘hard’ part of this was getting the builds correct for leptonica. And the problem there was ensuring that I had the correct dependencies installed and that they were available on the path defined above. If this doesn’t work, there’s no sense moving on to SOLR.
http://blog.thedigitalgroup.com/vijaym/2015/07/17/using-solr-and-tikaocr-to-search-text-inside-an-image/ [root@sandbox tesseract-ocr]# /usr/local/bin/tesseract ~/OM_1.jpg ~/OM_out
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
[root@sandbox tesseract-ocr]# cat ~/OM_out.txt
‘ '"I“ " "' ./lrast. Shortly before the classes started I was visiting a.certain public school, a school set in a typically Englishcountryside, which on the June clay of my visit was wonder-fully beauliful. The Head Master—-no less typical than hisschool and the country-side—pointed out the charms ofboth, and his pride came out in the ?nal remark which he madebeforehe left me. He explained that he had a class to takein'I'heocritus. Then (with a. buoyant gesture); “ Can you, conceive anything more delightful than a class in Theocritus,on such a day and in such a place?" If you have text in your out file, then you’ve done it correctly! Start Solr Sample – This sample contains the Proper Extracting Request Handler for processing with tika
https://wiki.apache.org/solr/ExtractingRequestHandler cd /opt/lucidworks-hdpsearch/solr/bin/
./solr -e dih
Use SOLR Admin to upload the image
Go back to the blog post or to the RequestHandler page for the proper update/extract command syntax.
From SOLR admin, select the tika core. Click Documents In the Request-Handler (qt) field, enter /update/extract In the Document Type drop down, select File Upload Choose the png file In the Extracting Req. Handler Params box, type the following: literal.id=d1&uprefix=attr_&fmap.content=attr_content&commit=true Understanding all the parameters is another process, but the literal.id is the unique id for the document. For more information on this command, start by reviewing https://wiki.apache.org/solr/ExtractingRequestHandler and then the SOLR documentation. Run a query
From SOLR admin, select tika core. Click Query. In the q field, type attr_content:explained Execute the query. http://sandbox.hortonworks.com:8983/solr/tika/select?q=attr_content%3Aexplained&wt=json∈dent=true Try it again
Use another png or supported file type. Be sure to use the same Request Handler Params, except provide a new unique literal.id
Note, that the attr_content is a dynamic field, and it cannot be highlighted. If you figure out how to add an indexed and stored field to hold the image text, let me know 🙂
... View more
- Find more articles tagged with:
- Data Processing
- How-ToTutorial
- image
- solr
Labels:
05-01-2017
03:16 PM
@Gu Gur And just to add a little detail to Matt's comment, from the documentation, Regular Expressions are entered by adding user-defined properties; the name of the property maps to the Attribute Name into which the result will be placed.
... View more
04-28-2017
03:00 PM
Is there more output to this? Perhaps a line with timestamp and ERROR? If so, can you include it here?
... View more
04-28-2017
02:48 PM
1 Kudo
The following page has examples: https://github.com/abajwa-hw/ambari-flink-service/blob/master/package/scripts/flink.py Ali has many examples of services in his repositories. The setup page for this service shows how to get the scripts into the cluster. https://github.com/abajwa-hw/ambari-flink-service
... View more
04-28-2017
02:41 PM
Take a look at the ExtractText processor and see if it meets your needs: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExtractText/index.html
... View more
04-28-2017
02:35 PM
Could you provide your entire replacement value and the event payload? It looks to me that your event does not include one of the attributes/parameters in capture group 5.
... View more
02-27-2017
08:59 PM
The description of Nifi functions ends with the statement, After evaluating expression language functions, all attributes are stored as type String. https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#functions How do you plan to use the attribute down stream? If you're just looking for a way to have your output become an attribute, the following page has a nice example, http://funnifi.blogspot.com/2016/02/executescript-processor-hello-world.html
... View more
02-27-2017
07:41 PM
Kafka is used in a real-time streaming scenario where you need to read, write, store, and process data streams. To use Kafka in this scenario, you would have to build a
Kafka Producer that has knowledge of or has access to data in Oracle that needs to
be replicated. A Producer would be a Java class that publishes an event to a
Kafka Topic. And you would need to build a Kafka Consumer in Java that subscribes
to and reads the event from the Kafka Topic and writes it to Teradata. For complex logic, you could use the Kafka Streaming APIs.
... View more
01-27-2017
03:12 PM
Arun, yes, timestamp is a supported datatype. Here is a link to the doc: https://avro.apache.org/docs/1.8.1/spec.html#Logical+Types
... View more
01-06-2017
08:41 PM
You can read messages to the console from a particular offset using the Simple Consumer CLI: https://cwiki.apache.org/confluence/display/KAFKA/System+Tools Search for Simple Consumer.
... View more
12-20-2016
04:44 PM
1 Kudo
It wasn't clear to me from this thread...have you shut down both ambari and the ambari-agent? If not, I would perform the following: Stop ambari. Stop the agent. Start ambari. Start the agent with the command, ambari-agent start --verbose. Then perhaps include the ambari-agent log as an attachment.
... View more