Member since
03-16-2016
707
Posts
1753
Kudos Received
203
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5126 | 09-21-2018 09:54 PM | |
6493 | 03-31-2018 03:59 AM | |
1968 | 03-31-2018 03:55 AM | |
2176 | 03-31-2018 03:31 AM | |
4821 | 03-27-2018 03:46 PM |
08-17-2016
01:33 PM
@nizar saddiki In other databases (other than Hive), ST_Transform converts two-dimensional ST_Geometry data into the spatial reference specified by the spatial reference ID (SRID). SRID parameter is not supported in Hive. As such, you need to pre-process the data in other system before uploading to Hive. Usually, that leads to denormalization. You would add a new column for each SRID. However, if they are way too many, is probably better to write your own ST_Transform service or function. I wish I could give better news. Check this article: https://community.hortonworks.com/articles/44319/geo-spatial-queries-with-hive-using-esri-geometry.html. Also: https://community.hortonworks.com/articles/44319/geo-spatial-queries-with-hive-using-esri-geometry.html It will show you how to add the jar and create the function, as well as how to use it. Second article includes some limitations.
... View more
08-16-2016
04:00 AM
1 Kudo
@mqureshi I assume that you ask about Java regex. There are various flavors based on the language, e.g. Java, C#, VB etc. str.replace(/[@]/g,"") str.replace(/[$]/g,"") or str.replace(/[$@]/g,"") if you want to have one pass at both. I assume that you want all of those characters replaced at once, as such you could use str.replaceAll Keep in mind that $ is also a special character in regex. Matches end of line. That is if you want to handle other scenarios where there could be some ambiguousity between $ as a character and the end of the line. Use an escape character to indicate that you really mean $. A good testing tool for your patterns: http://regexr.com/
... View more
08-15-2016
09:10 PM
@mqureshi Technically, you are correct. Since I did not know the reason for storing the jar file in hdfs, I wanted to provide a "either way" solution. For example, in large development teams, I used a trick for cases where needed to store the jar files in HDFS as such they can be easily accessed between multiple clients including Hive as UDFs. It was something like in the reference you provided, but you are correct, it will still pick it from the local file system, however, it will provide a centralized location which can be used as a target for build artifacts and shared across multiple developers in a team. hadoop fs -copyToLocal hdfs:///home/usr/jar/myjar.jar /tmp/myjar.jar && hadoop jar /tmp/myjar.jar com.test.TestMain
... View more
08-15-2016
08:47 PM
3 Kudos
@Ram D Nothing automated, however, you can configure Dynamic Resource Allocation manually, as one time activity: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_spark-guide/content/config-dra-manual.html Some more here: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_spark-guide/content/ch_tuning-spark.html
... View more
08-15-2016
01:37 AM
It can be in hdfs or local fs. You just need to have them referenced properly and have the proper privileges.
... View more
08-15-2016
01:35 AM
1 Kudo
@Tech Guy The message is misleading. You need to provide a path for your input and output files and make sure that your user has privileges on those folders. You should also try to place the jar locally as @mqureshi suggested, which is probably the easiest, otherwise you would need to specify a full hdfs path to the jar file. For example, you could run the command as it follows: hadoop jar /home/joe/wordcount.jar WordCount /user/joe/wordcount/input /user/joe/wordcount/output If the response is helpful, vote/accept answer.
... View more
08-15-2016
01:23 AM
3 Kudos
@bob bza Let's start with: https://github.com/apache/ambari/blob/trunk/ambari-server/docs/api/v1/index.md Get a list of services: GET /clusters/:name/services For any of the services listed, including ResourceManager, view service information: GET /clusters/:clusterName/services/:serviceName Of course, prepend the GET statements with: curl -u admin:admin -H "X-Requested-By: ambari" -X Another useful reference: https://cwiki.apache.org/confluence/display/AMBARI/Using+APIs+to+delete+a+service+or+all+host+components+on+a+host Please vote/accept answer, if helpful.
... View more
08-12-2016
11:03 PM
3 Kudos
Introduction This article is not meant to show how to install or create a
“Hello World” Nifi data flow, but how to resolve a data filtering problem with
NiFi providing two approaches, using a filter list as a file on the disk, which could be
static or dynamic, and a list stored in a distributed cache populated from the
same file. The amount of data used was minimal and simplistic and no
performance difference can be perceived, however, at scale, where memory is
available, a caching implementation should perform better. This article assumes some familiarity with NiFi, knowing
what a processor or a queue is and how to set the basic configurations for a
processor or a queue, also how to visualize the data at various steps throughout
the flow, starting and stopping processors. Since you are somehow familiar with Nifi, you probably know
how to install it and start it, however, I will provide a quick refresher
below. Pre-requisites For this demo, I used the latest version of Nifi available
at the date of working on this demo, 0.6.1.
This version was not part of the HDP 2.4.2 which was available at the
time of this demo, it has also 0.5.1. HDP 2.5 was just launched last month at
the Hadoop Summit in Santa Clara. If you wanted for your OSX installation to use brew install nifi that will only install nifi 0.5.1 which
does not have some of the features needed for the demo, e.g.
PutDistributedCacheMap or FetchDistributedCacheMap. Instead, use the following steps: A reference about downloading and installing on Linux and Mac is here. You can download Nifi 0.6.1 from any of the sites listed
there, for example: https://www.apache.org/dyn/closer.lua?path=/nifi/0.6.1/nifi-0.6.1-bin.tar.gz I prefer wget and installing my apps to /opt cd /opt
wget http://supergsego.com/apache/nifi/0.6.1/nifi-0.6.1-bin.tar.gz That will download a 421.19MB file tar –xvf nifi-0.6.1-bin.tar.gz
ls -l and here is your /opt/nifi-0.6.1 cd /opt/nifi-0.6.1/bin That is your NIFI_HOME. ./nifi.sh start Open a browser and type: http://localhost:8080/nifi You will need to import the NiFi .xml template posted in my github
repo, mentioned earlier. Clone it to your local folder of preference, assuming that you have a
git client installed: git clone https://github.com/cstanca1/nifi-filter.git After importing the model, instantiate it. It will show as the following: Required Changes In order for the template to work for your specific folder structure, you will need to make a few changes to tell GetFile processor (right-click on Get File
processor header, View Configuration, Properties tab, Input Directory, from where
to go to get the data). Keep in mind that the GetFile processor once started it
will read the file and delete it. If you want to re-feed it for test, you just
have to drop it again in the same folder and it will re-ingest it. You can also
place multiple files of the same structure in that folder and they will be
ingested all and every line. In real-life, GetFile can be replaced with a
different processor capable to read from an actual log. For this demo, I used a
static file as an input. Also, enable and start DistributedMapCacheServer Controller Service.This is required for the put and fetch
distributed cache. The DistributedMapCacheServer can be started just like any other Controller Service, by configuring it to be valid and hit the "start" button. The unique thing about the DistributedMapCacheServer is that processors work with the cache by utilizing a DistributedMapCacheClientService. So you will create both a Server and Client Service. Then configure the processor to use the Client Service. Next start both the server and service. Finally start the processor. Test Data For your test, you can use the two files
checked-in to the git repo that you just cloned locally: macaddresses-blacklist.txt and macaddresses-log.txt. macaddresses-blacklist.txt is a list of
blacklisted mac addresses which will be used to filter the incoming stream fed
by macaddresses-log.txt using GetFile ingest, line by line. To understand what happens step-by-step, I
suggest to start each processor and inspect the queue and data lineage. Populating
DistributedMapCache is performed in the flow presented on the right side of the
model that you imported at the previous step. The filtering flow, via Scan Attribute or
FetchDistributedCacheMap: Use of GetFile, SplitText and ExtractText processor is well
documented and a basic Google will return several good examples, however, a
good example of how to use FetchDistributedMapCache and PutDistributedMapCache
is not that well documented. That was the main reason to write this article. I
could not find another good reference. I am sure others felt the same way and
hopefully this helps. ScanAttribute
Approach Before starting this processor, you need to right click on
its header, choose Configuration and go to Properties tab and change the
Dictionary file to be your macaddress-blacklist.txt. This is a clone of the
same file you use in GetFile processor, but I suggest to put it in a separate
folder as such GetFile will not ingest it and delete it after use. This needs to
be permanent like a lookup file on the disk. SplitText is used to split the file line by line. You can
check this property by righ-clicking the header of the processor and choosing “Properties”
tab. Line Split Count is set to 1. ExtractText processor uses a custom regex to extract the mac
address from macaddress-log.txt. You can
find in Properties, last property in the list. ScanAttribute processor sets the Dictionary File to the
folder/file of choice. In this demo, I used macaddresses-blacklist.txt file
included in the repo that you cloned at one of the previous steps. DistributedCache
Approach The right branch of the flow in the left uses the
DistributedCache populated by the flow on the right of the model. Inspect each
processor by checking each processor properties. They are already similar with
the first half of the flow on the left, excepting the use of
PutDistributedCache processor which sets the Cache Entry Identifier for
mac.address value. I’ll refer only to the consumption of the mac.address
property value set by PutDistributedCache. Set Cache Entry Identifier to the same mac.address Please note that DistributedMapCacheClientService is
enabled. You can achieve that by clicking on NiFi Flow Settings icon, fourth on
the right corner, "Controller Services" Learning Technique Don’t forget to start all processors and inspect the queues.
My approach is to start processors one at the time in the order of the data
flow and processing and check all the stats and lineage on connection queues.
This is what I love about NiFi, it is so easy to test and learn. Credit Thanks to Simon Ball for taking a few minutes of his time to
review the model on DistributedCache approach. Conclusion Dynamic filtering has large applicability in any type of simple event processing. I am sure that there are
many other ways to skin the same cat with NiFi. Enjoy!
... View more
Labels:
08-11-2016
01:26 PM
3 Kudos
@Davide Ferrari Unfortunately, no. This issue is fixed in Hive 1.3/2.0. See: https://issues.apache.org/jira/browse/HIVE-11421
... View more
08-11-2016
02:18 AM
4 Kudos
@sivakumar sudhakarannair girijakumari Yes. Hive supports parallel transactions. Your error could be generated by a global setting override at session level. If your global tez.grouping_min-size is not low enough to allow you to set your session tez.grouping.max-size to a value higher than the global tez.grouping.min-size, you may want to change the global tez.grouping.min-size to a lower value to satisfy the condition. This seems to be similar to: https://community.hortonworks.com/questions/50008/while-executing-a-select-sql-on-hive-we-are-seeing.html#comment-50558 Let me know if this is different.
... View more