About cstanca

jmatyas · ‎08-18-2016

Salut Constantin Currently this is tech preview due to some limitations - there are some changes undergoing within Mesos (unified containerizes, IP per container, reverse lookups, storage drivers, etc to name a few) - so expect a large change here. Nevertheless if someone is happy with the current limitations (namely running with net=host) the code is stable and works well (we and a few others have test/dev clusters). Should you want more details drop me a mail. Janos

rgelhausen · ‎08-17-2016

No. Falcon is used for bulk data replication. For HBase, you should use HBase's own internally managed replication capability.

sunile_manjee · ‎08-17-2016

@jbarnett When you need to interface with the service (Hbase,hive,yarn,etc) then you decide to install the client node. typically you find in cluster setups you dedicate 1 node called "edge node" where you install all your client libraries. this then becomes your single entry point to run your services. you can add many edge node to scale out accordingly. as @Constantin Stanca explained it simply installed the client libraries for your specific version of hadoop and services. makes it very easy on end user. hope that helps.

cstanca · ‎08-12-2016

Introduction This article is not meant to show how to install or create a “Hello World” Nifi data flow, but how to resolve a data filtering problem with NiFi providing two approaches, using a filter list as a file on the disk, which could be static or dynamic, and a list stored in a distributed cache populated from the same file. The amount of data used was minimal and simplistic and no performance difference can be perceived, however, at scale, where memory is available, a caching implementation should perform better. This article assumes some familiarity with NiFi, knowing what a processor or a queue is and how to set the basic configurations for a processor or a queue, also how to visualize the data at various steps throughout the flow, starting and stopping processors. Since you are somehow familiar with Nifi, you probably know how to install it and start it, however, I will provide a quick refresher below. Pre-requisites For this demo, I used the latest version of Nifi available at the date of working on this demo, 0.6.1. This version was not part of the HDP 2.4.2 which was available at the time of this demo, it has also 0.5.1. HDP 2.5 was just launched last month at the Hadoop Summit in Santa Clara. If you wanted for your OSX installation to use brew install nifi that will only install nifi 0.5.1 which does not have some of the features needed for the demo, e.g. PutDistributedCacheMap or FetchDistributedCacheMap. Instead, use the following steps: A reference about downloading and installing on Linux and Mac is here. You can download Nifi 0.6.1 from any of the sites listed there, for example: https://www.apache.org/dyn/closer.lua?path=/nifi/0.6.1/nifi-0.6.1-bin.tar.gz I prefer wget and installing my apps to /opt cd /opt wget http://supergsego.com/apache/nifi/0.6.1/nifi-0.6.1-bin.tar.gz That will download a 421.19MB file tar –xvf nifi-0.6.1-bin.tar.gz ls -l and here is your /opt/nifi-0.6.1 cd /opt/nifi-0.6.1/bin That is your NIFI_HOME. ./nifi.sh start Open a browser and type: http://localhost:8080/nifi You will need to import the NiFi .xml template posted in my github repo, mentioned earlier. Clone it to your local folder of preference, assuming that you have a git client installed: git clone https://github.com/cstanca1/nifi-filter.git After importing the model, instantiate it. It will show as the following: Required Changes In order for the template to work for your specific folder structure, you will need to make a few changes to tell GetFile processor (right-click on Get File processor header, View Configuration, Properties tab, Input Directory, from where to go to get the data). Keep in mind that the GetFile processor once started it will read the file and delete it. If you want to re-feed it for test, you just have to drop it again in the same folder and it will re-ingest it. You can also place multiple files of the same structure in that folder and they will be ingested all and every line. In real-life, GetFile can be replaced with a different processor capable to read from an actual log. For this demo, I used a static file as an input. Also, enable and start DistributedMapCacheServer Controller Service.This is required for the put and fetch distributed cache. The DistributedMapCacheServer can be started just like any other Controller Service, by configuring it to be valid and hit the "start" button. The unique thing about the DistributedMapCacheServer is that processors work with the cache by utilizing a DistributedMapCacheClientService. So you will create both a Server and Client Service. Then configure the processor to use the Client Service. Next start both the server and service. Finally start the processor. Test Data For your test, you can use the two files checked-in to the git repo that you just cloned locally: macaddresses-blacklist.txt and macaddresses-log.txt. macaddresses-blacklist.txt is a list of blacklisted mac addresses which will be used to filter the incoming stream fed by macaddresses-log.txt using GetFile ingest, line by line. To understand what happens step-by-step, I suggest to start each processor and inspect the queue and data lineage. Populating DistributedMapCache is performed in the flow presented on the right side of the model that you imported at the previous step. The filtering flow, via Scan Attribute or FetchDistributedCacheMap: Use of GetFile, SplitText and ExtractText processor is well documented and a basic Google will return several good examples, however, a good example of how to use FetchDistributedMapCache and PutDistributedMapCache is not that well documented. That was the main reason to write this article. I could not find another good reference. I am sure others felt the same way and hopefully this helps. ScanAttribute Approach Before starting this processor, you need to right click on its header, choose Configuration and go to Properties tab and change the Dictionary file to be your macaddress-blacklist.txt. This is a clone of the same file you use in GetFile processor, but I suggest to put it in a separate folder as such GetFile will not ingest it and delete it after use. This needs to be permanent like a lookup file on the disk. SplitText is used to split the file line by line. You can check this property by righ-clicking the header of the processor and choosing “Properties” tab. Line Split Count is set to 1. ExtractText processor uses a custom regex to extract the mac address from macaddress-log.txt. You can find in Properties, last property in the list. ScanAttribute processor sets the Dictionary File to the folder/file of choice. In this demo, I used macaddresses-blacklist.txt file included in the repo that you cloned at one of the previous steps. DistributedCache Approach The right branch of the flow in the left uses the DistributedCache populated by the flow on the right of the model. Inspect each processor by checking each processor properties. They are already similar with the first half of the flow on the left, excepting the use of PutDistributedCache processor which sets the Cache Entry Identifier for mac.address value. I’ll refer only to the consumption of the mac.address property value set by PutDistributedCache. Set Cache Entry Identifier to the same mac.address Please note that DistributedMapCacheClientService is enabled. You can achieve that by clicking on NiFi Flow Settings icon, fourth on the right corner, "Controller Services" Learning Technique Don’t forget to start all processors and inspect the queues. My approach is to start processors one at the time in the order of the data flow and processing and check all the stats and lineage on connection queues. This is what I love about NiFi, it is so easy to test and learn. Credit Thanks to Simon Ball for taking a few minutes of his time to review the model on DistributedCache approach. Conclusion Dynamic filtering has large applicability in any type of simple event processing. I am sure that there are many other ways to skin the same cat with NiFi. Enjoy!

cstanca · ‎08-06-2016

@sivakumar sudhakarannair girijakumari Step 1: Build geometry-api (this is a pre-requisite for the spatial framework for hadoop) clone this repository: https://github.com/Esri/geometry-api-java edit pom.xml to use Java 1.8, Hive 1.2 and Hadoop 2.7 save pom.xml and build with mvn Step 2: build spatial framework for hadoop clone this repository https://github.com/Esri/spatial-framework-for-hadoop edit pom.xml to use Java 1.8, Hive 1.2 and Hadoop 2.7 save pom.xml and build with mvn Build with ant is also supported, see build.xml. Some changes are necessary.

cstanca · ‎08-04-2016

@ mqureshi Go to Resource Manager UI: http://127.0.0.1:8088/cluster, click on your application_... job and then on the Attempt ID line click on Logs. You may also want to use your Tez View in Ambari http://127.0.0.1:8080/#/main/views/TEZ

cstanca · ‎09-21-2016

@Kumar Veerappan 1.3.1 is the Spark version supported by HDP 2.3.0. Would it be possible that someone installed a newer version of Spark outside of Ambari then uninstalled and Ambari is caching somehow that version. Did you restart Ambari server and checked again?

cstanca · ‎12-26-2016

@Fish Berh This could have due to a problem with the spark-csv jar. i have encountered this myself and I found a solution which I cannot find now. Here are my notes at the time: 1. Create a folder in your local OS or HDFS and place the proper versions for your case of the jars here (replace ? with your version needed): spark-csv_?.jar commons-csv-?.jar univocity-parsers-?.jar 2. Go to your /conf directory where you have installed Spark and in the spark-defaults.conf file add the line: spark.driver.extraClassPath D:/Spark/spark_jars/* The asterisk should include all the jars. Now run Python, create SparkContext, SQLContext as you normally would. Now you should be able to use spark-csv as sqlContext.read.format('com.databricks.spark.csv').\ options(header='true', inferschema='true').\ load('foobar.csv')

sandeep1 · ‎08-03-2019

Why does it require write permissions to create an external table? If we have huge read-only data which we want the various users to query without duplicating, what should we do?

bpreachuk · ‎11-04-2017

Hi @Jeff Watson. You are correct about SAS use of String datatypes. Good catch! One of my customers also had to deal with this. String datatype conversions can perform very poorly in SAS. With SAS/ACCESS to Hadoop you can set the libname option DBMAX_TEXT (added with SAS 9.4m1 release) to globally restrict the character length of all columns read into SAS. However for restricting column size SAS does specifically recommends using the VARCHAR datatype in Hive whenever possible. http://support.sas.com/documentation/cdl/en/acreldb/67473/HTML/default/viewer.htm#n1aqglg4ftdj04n1eyvh2l3367ql.htm Use Case Large Table, All Columns of Type String: Table A stored in Hive has 40 columns, all of type String, with 500M rows. By default, SAS Access converts String to $32K. So, 32K in length for char. The math for this size table yields 1.2MB row length x 500M rows. This causes the system to come to a halt - Too large to store in LASR or WORK. The following techniques can be used to work around the challenge in SAS, and they all work: Use char and varchar in Hive instead of String. Set the libname option DBMAX_TEXT to globally restrict the character length of all columns read in In Hive do "SET TBLPROPERTIES SASFMT" to add formats for SAS on schema in HIVE. Add formatting to SAS code during inbound reads example: Sequence Length 8 Informat 10. format 10. I hope this helps.

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Re: Did Anyone Attempt to Deploy HDP with Docker o...

Re: Can Apache Falcon be Used for HBase to HBase R...

Re: When to install Hadoop clients

Dynamic List Filtering with NiFi

Re: ESRI spatial framework and Geometry API librar...

Re: Viewing logs for Hive query Executions

Re: Question on Spark Versioning

Re: Reading data from HDFS on AWS EC2 cluster

Re: External Table creation error/ permission deni...

Re: Hive STRING vs VARCHAR Performance