About cstanca

cstanca · ‎08-17-2016

@nizar saddiki In other databases (other than Hive), ST_Transform converts two-dimensional ST_Geometry data into the spatial reference specified by the spatial reference ID (SRID). SRID parameter is not supported in Hive. As such, you need to pre-process the data in other system before uploading to Hive. Usually, that leads to denormalization. You would add a new column for each SRID. However, if they are way too many, is probably better to write your own ST_Transform service or function. I wish I could give better news. Check this article: https://community.hortonworks.com/articles/44319/geo-spatial-queries-with-hive-using-esri-geometry.html. Also: https://community.hortonworks.com/articles/44319/geo-spatial-queries-with-hive-using-esri-geometry.html It will show you how to add the jar and create the function, as well as how to use it. Second article includes some limitations.

cstanca · ‎08-16-2016

@mqureshi I assume that you ask about Java regex. There are various flavors based on the language, e.g. Java, C#, VB etc. str.replace(/[@]/g,"") str.replace(/[$]/g,"") or str.replace(/[$@]/g,"") if you want to have one pass at both. I assume that you want all of those characters replaced at once, as such you could use str.replaceAll Keep in mind that $ is also a special character in regex. Matches end of line. That is if you want to handle other scenarios where there could be some ambiguousity between $ as a character and the end of the line. Use an escape character to indicate that you really mean $. A good testing tool for your patterns: http://regexr.com/

cstanca · ‎08-15-2016

@mqureshi Technically, you are correct. Since I did not know the reason for storing the jar file in hdfs, I wanted to provide a "either way" solution. For example, in large development teams, I used a trick for cases where needed to store the jar files in HDFS as such they can be easily accessed between multiple clients including Hive as UDFs. It was something like in the reference you provided, but you are correct, it will still pick it from the local file system, however, it will provide a centralized location which can be used as a target for build artifacts and shared across multiple developers in a team. hadoop fs -copyToLocal hdfs:///home/usr/jar/myjar.jar /tmp/myjar.jar && hadoop jar /tmp/myjar.jar com.test.TestMain

cstanca · ‎08-15-2016

@Ram D Nothing automated, however, you can configure Dynamic Resource Allocation manually, as one time activity: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_spark-guide/content/config-dra-manual.html Some more here: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_spark-guide/content/ch_tuning-spark.html

cstanca · ‎08-15-2016

It can be in hdfs or local fs. You just need to have them referenced properly and have the proper privileges.

cstanca · ‎08-15-2016

@Tech Guy The message is misleading. You need to provide a path for your input and output files and make sure that your user has privileges on those folders. You should also try to place the jar locally as @mqureshi suggested, which is probably the easiest, otherwise you would need to specify a full hdfs path to the jar file. For example, you could run the command as it follows: hadoop jar /home/joe/wordcount.jar WordCount /user/joe/wordcount/input /user/joe/wordcount/output If the response is helpful, vote/accept answer.

cstanca · ‎08-15-2016

@bob bza Let's start with: https://github.com/apache/ambari/blob/trunk/ambari-server/docs/api/v1/index.md Get a list of services: GET /clusters/:name/services For any of the services listed, including ResourceManager, view service information: GET /clusters/:clusterName/services/:serviceName Of course, prepend the GET statements with: curl -u admin:admin -H "X-Requested-By: ambari" -X Another useful reference: https://cwiki.apache.org/confluence/display/AMBARI/Using+APIs+to+delete+a+service+or+all+host+components+on+a+host Please vote/accept answer, if helpful.

cstanca · ‎08-12-2016

Introduction This article is not meant to show how to install or create a “Hello World” Nifi data flow, but how to resolve a data filtering problem with NiFi providing two approaches, using a filter list as a file on the disk, which could be static or dynamic, and a list stored in a distributed cache populated from the same file. The amount of data used was minimal and simplistic and no performance difference can be perceived, however, at scale, where memory is available, a caching implementation should perform better. This article assumes some familiarity with NiFi, knowing what a processor or a queue is and how to set the basic configurations for a processor or a queue, also how to visualize the data at various steps throughout the flow, starting and stopping processors. Since you are somehow familiar with Nifi, you probably know how to install it and start it, however, I will provide a quick refresher below. Pre-requisites For this demo, I used the latest version of Nifi available at the date of working on this demo, 0.6.1. This version was not part of the HDP 2.4.2 which was available at the time of this demo, it has also 0.5.1. HDP 2.5 was just launched last month at the Hadoop Summit in Santa Clara. If you wanted for your OSX installation to use brew install nifi that will only install nifi 0.5.1 which does not have some of the features needed for the demo, e.g. PutDistributedCacheMap or FetchDistributedCacheMap. Instead, use the following steps: A reference about downloading and installing on Linux and Mac is here. You can download Nifi 0.6.1 from any of the sites listed there, for example: https://www.apache.org/dyn/closer.lua?path=/nifi/0.6.1/nifi-0.6.1-bin.tar.gz I prefer wget and installing my apps to /opt cd /opt wget http://supergsego.com/apache/nifi/0.6.1/nifi-0.6.1-bin.tar.gz That will download a 421.19MB file tar –xvf nifi-0.6.1-bin.tar.gz ls -l and here is your /opt/nifi-0.6.1 cd /opt/nifi-0.6.1/bin That is your NIFI_HOME. ./nifi.sh start Open a browser and type: http://localhost:8080/nifi You will need to import the NiFi .xml template posted in my github repo, mentioned earlier. Clone it to your local folder of preference, assuming that you have a git client installed: git clone https://github.com/cstanca1/nifi-filter.git After importing the model, instantiate it. It will show as the following: Required Changes In order for the template to work for your specific folder structure, you will need to make a few changes to tell GetFile processor (right-click on Get File processor header, View Configuration, Properties tab, Input Directory, from where to go to get the data). Keep in mind that the GetFile processor once started it will read the file and delete it. If you want to re-feed it for test, you just have to drop it again in the same folder and it will re-ingest it. You can also place multiple files of the same structure in that folder and they will be ingested all and every line. In real-life, GetFile can be replaced with a different processor capable to read from an actual log. For this demo, I used a static file as an input. Also, enable and start DistributedMapCacheServer Controller Service.This is required for the put and fetch distributed cache. The DistributedMapCacheServer can be started just like any other Controller Service, by configuring it to be valid and hit the "start" button. The unique thing about the DistributedMapCacheServer is that processors work with the cache by utilizing a DistributedMapCacheClientService. So you will create both a Server and Client Service. Then configure the processor to use the Client Service. Next start both the server and service. Finally start the processor. Test Data For your test, you can use the two files checked-in to the git repo that you just cloned locally: macaddresses-blacklist.txt and macaddresses-log.txt. macaddresses-blacklist.txt is a list of blacklisted mac addresses which will be used to filter the incoming stream fed by macaddresses-log.txt using GetFile ingest, line by line. To understand what happens step-by-step, I suggest to start each processor and inspect the queue and data lineage. Populating DistributedMapCache is performed in the flow presented on the right side of the model that you imported at the previous step. The filtering flow, via Scan Attribute or FetchDistributedCacheMap: Use of GetFile, SplitText and ExtractText processor is well documented and a basic Google will return several good examples, however, a good example of how to use FetchDistributedMapCache and PutDistributedMapCache is not that well documented. That was the main reason to write this article. I could not find another good reference. I am sure others felt the same way and hopefully this helps. ScanAttribute Approach Before starting this processor, you need to right click on its header, choose Configuration and go to Properties tab and change the Dictionary file to be your macaddress-blacklist.txt. This is a clone of the same file you use in GetFile processor, but I suggest to put it in a separate folder as such GetFile will not ingest it and delete it after use. This needs to be permanent like a lookup file on the disk. SplitText is used to split the file line by line. You can check this property by righ-clicking the header of the processor and choosing “Properties” tab. Line Split Count is set to 1. ExtractText processor uses a custom regex to extract the mac address from macaddress-log.txt. You can find in Properties, last property in the list. ScanAttribute processor sets the Dictionary File to the folder/file of choice. In this demo, I used macaddresses-blacklist.txt file included in the repo that you cloned at one of the previous steps. DistributedCache Approach The right branch of the flow in the left uses the DistributedCache populated by the flow on the right of the model. Inspect each processor by checking each processor properties. They are already similar with the first half of the flow on the left, excepting the use of PutDistributedCache processor which sets the Cache Entry Identifier for mac.address value. I’ll refer only to the consumption of the mac.address property value set by PutDistributedCache. Set Cache Entry Identifier to the same mac.address Please note that DistributedMapCacheClientService is enabled. You can achieve that by clicking on NiFi Flow Settings icon, fourth on the right corner, "Controller Services" Learning Technique Don’t forget to start all processors and inspect the queues. My approach is to start processors one at the time in the order of the data flow and processing and check all the stats and lineage on connection queues. This is what I love about NiFi, it is so easy to test and learn. Credit Thanks to Simon Ball for taking a few minutes of his time to review the model on DistributedCache approach. Conclusion Dynamic filtering has large applicability in any type of simple event processing. I am sure that there are many other ways to skin the same cat with NiFi. Enjoy!

cstanca · ‎08-11-2016

@Davide Ferrari Unfortunately, no. This issue is fixed in Hive 1.3/2.0. See: https://issues.apache.org/jira/browse/HIVE-11421

cstanca · ‎08-11-2016

@sivakumar sudhakarannair girijakumari Yes. Hive supports parallel transactions. Your error could be generated by a global setting override at session level. If your global tez.grouping_min-size is not low enough to allow you to set your session tez.grouping.max-size to a value higher than the global tez.grouping.min-size, you may want to change the global tez.grouping.min-size to a lower value to satisfy the condition. This seems to be similar to: https://community.hortonworks.com/questions/50008/while-executing-a-select-sql-on-hive-we-are-seeing.html#comment-50558 Let me know if this is different.

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Re: ST_Transform is not supported in ESRI Spatial ...

Re: nifi regex replace special characters

Re: Not Valid Jar Error

Re: Is there any specific method to know how much ...

Re: Not Valid Jar Error

Re: Not Valid Jar Error

Re: Ambari rest API to get host of particular comp...

Dynamic List Filtering with NiFi

Re: Hive in HDP 2.4 and ALTERs

Re: Does hive supports parallel transactions.