Member since
03-16-2016
707
Posts
1753
Kudos Received
203
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5127 | 09-21-2018 09:54 PM | |
6495 | 03-31-2018 03:59 AM | |
1968 | 03-31-2018 03:55 AM | |
2179 | 03-31-2018 03:31 AM | |
4832 | 03-27-2018 03:46 PM |
08-18-2016
07:15 AM
1 Kudo
Salut Constantin
Currently this is tech preview due to some limitations - there are some changes undergoing within Mesos (unified containerizes, IP per container, reverse lookups, storage drivers, etc to name a few) - so expect a large change here.
Nevertheless if someone is happy with the current limitations (namely running with net=host) the code is stable and works well (we and a few others have test/dev clusters).
Should you want more details drop me a mail. Janos
... View more
08-17-2016
07:26 PM
2 Kudos
No. Falcon is used for bulk data replication. For HBase, you should use HBase's own internally managed replication capability.
... View more
08-17-2016
09:02 PM
2 Kudos
@jbarnett When you need to interface with the service (Hbase,hive,yarn,etc) then you decide to install the client node. typically you find in cluster setups you dedicate 1 node called "edge node" where you install all your client libraries. this then becomes your single entry point to run your services. you can add many edge node to scale out accordingly. as @Constantin Stanca explained it simply installed the client libraries for your specific version of hadoop and services. makes it very easy on end user. hope that helps.
... View more
08-12-2016
11:03 PM
3 Kudos
Introduction This article is not meant to show how to install or create a
“Hello World” Nifi data flow, but how to resolve a data filtering problem with
NiFi providing two approaches, using a filter list as a file on the disk, which could be
static or dynamic, and a list stored in a distributed cache populated from the
same file. The amount of data used was minimal and simplistic and no
performance difference can be perceived, however, at scale, where memory is
available, a caching implementation should perform better. This article assumes some familiarity with NiFi, knowing
what a processor or a queue is and how to set the basic configurations for a
processor or a queue, also how to visualize the data at various steps throughout
the flow, starting and stopping processors. Since you are somehow familiar with Nifi, you probably know
how to install it and start it, however, I will provide a quick refresher
below. Pre-requisites For this demo, I used the latest version of Nifi available
at the date of working on this demo, 0.6.1.
This version was not part of the HDP 2.4.2 which was available at the
time of this demo, it has also 0.5.1. HDP 2.5 was just launched last month at
the Hadoop Summit in Santa Clara. If you wanted for your OSX installation to use brew install nifi that will only install nifi 0.5.1 which
does not have some of the features needed for the demo, e.g.
PutDistributedCacheMap or FetchDistributedCacheMap. Instead, use the following steps: A reference about downloading and installing on Linux and Mac is here. You can download Nifi 0.6.1 from any of the sites listed
there, for example: https://www.apache.org/dyn/closer.lua?path=/nifi/0.6.1/nifi-0.6.1-bin.tar.gz I prefer wget and installing my apps to /opt cd /opt
wget http://supergsego.com/apache/nifi/0.6.1/nifi-0.6.1-bin.tar.gz That will download a 421.19MB file tar –xvf nifi-0.6.1-bin.tar.gz
ls -l and here is your /opt/nifi-0.6.1 cd /opt/nifi-0.6.1/bin That is your NIFI_HOME. ./nifi.sh start Open a browser and type: http://localhost:8080/nifi You will need to import the NiFi .xml template posted in my github
repo, mentioned earlier. Clone it to your local folder of preference, assuming that you have a
git client installed: git clone https://github.com/cstanca1/nifi-filter.git After importing the model, instantiate it. It will show as the following: Required Changes In order for the template to work for your specific folder structure, you will need to make a few changes to tell GetFile processor (right-click on Get File
processor header, View Configuration, Properties tab, Input Directory, from where
to go to get the data). Keep in mind that the GetFile processor once started it
will read the file and delete it. If you want to re-feed it for test, you just
have to drop it again in the same folder and it will re-ingest it. You can also
place multiple files of the same structure in that folder and they will be
ingested all and every line. In real-life, GetFile can be replaced with a
different processor capable to read from an actual log. For this demo, I used a
static file as an input. Also, enable and start DistributedMapCacheServer Controller Service.This is required for the put and fetch
distributed cache. The DistributedMapCacheServer can be started just like any other Controller Service, by configuring it to be valid and hit the "start" button. The unique thing about the DistributedMapCacheServer is that processors work with the cache by utilizing a DistributedMapCacheClientService. So you will create both a Server and Client Service. Then configure the processor to use the Client Service. Next start both the server and service. Finally start the processor. Test Data For your test, you can use the two files
checked-in to the git repo that you just cloned locally: macaddresses-blacklist.txt and macaddresses-log.txt. macaddresses-blacklist.txt is a list of
blacklisted mac addresses which will be used to filter the incoming stream fed
by macaddresses-log.txt using GetFile ingest, line by line. To understand what happens step-by-step, I
suggest to start each processor and inspect the queue and data lineage. Populating
DistributedMapCache is performed in the flow presented on the right side of the
model that you imported at the previous step. The filtering flow, via Scan Attribute or
FetchDistributedCacheMap: Use of GetFile, SplitText and ExtractText processor is well
documented and a basic Google will return several good examples, however, a
good example of how to use FetchDistributedMapCache and PutDistributedMapCache
is not that well documented. That was the main reason to write this article. I
could not find another good reference. I am sure others felt the same way and
hopefully this helps. ScanAttribute
Approach Before starting this processor, you need to right click on
its header, choose Configuration and go to Properties tab and change the
Dictionary file to be your macaddress-blacklist.txt. This is a clone of the
same file you use in GetFile processor, but I suggest to put it in a separate
folder as such GetFile will not ingest it and delete it after use. This needs to
be permanent like a lookup file on the disk. SplitText is used to split the file line by line. You can
check this property by righ-clicking the header of the processor and choosing “Properties”
tab. Line Split Count is set to 1. ExtractText processor uses a custom regex to extract the mac
address from macaddress-log.txt. You can
find in Properties, last property in the list. ScanAttribute processor sets the Dictionary File to the
folder/file of choice. In this demo, I used macaddresses-blacklist.txt file
included in the repo that you cloned at one of the previous steps. DistributedCache
Approach The right branch of the flow in the left uses the
DistributedCache populated by the flow on the right of the model. Inspect each
processor by checking each processor properties. They are already similar with
the first half of the flow on the left, excepting the use of
PutDistributedCache processor which sets the Cache Entry Identifier for
mac.address value. I’ll refer only to the consumption of the mac.address
property value set by PutDistributedCache. Set Cache Entry Identifier to the same mac.address Please note that DistributedMapCacheClientService is
enabled. You can achieve that by clicking on NiFi Flow Settings icon, fourth on
the right corner, "Controller Services" Learning Technique Don’t forget to start all processors and inspect the queues.
My approach is to start processors one at the time in the order of the data
flow and processing and check all the stats and lineage on connection queues.
This is what I love about NiFi, it is so easy to test and learn. Credit Thanks to Simon Ball for taking a few minutes of his time to
review the model on DistributedCache approach. Conclusion Dynamic filtering has large applicability in any type of simple event processing. I am sure that there are
many other ways to skin the same cat with NiFi. Enjoy!
... View more
Labels:
08-06-2016
06:45 PM
5 Kudos
@sivakumar sudhakarannair girijakumari Step 1: Build geometry-api (this is a pre-requisite for the spatial framework for hadoop)
clone this repository: https://github.com/Esri/geometry-api-java edit pom.xml to use Java 1.8, Hive 1.2 and Hadoop 2.7 save pom.xml and build with mvn Step 2: build spatial framework for hadoop
clone this repository https://github.com/Esri/spatial-framework-for-hadoop edit pom.xml to use Java 1.8, Hive 1.2 and Hadoop 2.7 save pom.xml and build with mvn Build with ant is also supported, see build.xml. Some changes are necessary.
... View more
08-04-2016
03:24 AM
2 Kudos
@ mqureshi Go to Resource Manager UI: http://127.0.0.1:8088/cluster, click on your application_... job and then on the Attempt ID line click on Logs. You may also want to use your Tez View in Ambari http://127.0.0.1:8080/#/main/views/TEZ
... View more
09-21-2016
02:32 PM
4 Kudos
@Kumar Veerappan 1.3.1 is the Spark version supported by HDP 2.3.0. Would it be possible that someone installed a newer version of Spark outside of Ambari then uninstalled and Ambari is caching somehow that version. Did you restart Ambari server and checked again?
... View more
12-26-2016
10:34 PM
2 Kudos
@Fish Berh This could have due to a problem with the spark-csv jar. i have encountered this myself and I found a solution which I cannot find now. Here are my notes at the time: 1. Create a folder in your local OS or HDFS and place the proper versions for your case of the jars here (replace ? with your version needed):
spark-csv_?.jar commons-csv-?.jar univocity-parsers-?.jar 2. Go to your /conf directory where you have installed Spark and in the spark-defaults.conf file add the line: spark.driver.extraClassPath D:/Spark/spark_jars/* The asterisk should include all the jars. Now run Python, create SparkContext, SQLContext as you normally would. Now you should be able to use spark-csv as sqlContext.read.format('com.databricks.spark.csv').\
options(header='true', inferschema='true').\
load('foobar.csv')
... View more
08-03-2019
04:25 PM
Why does it require write permissions to create an external table? If we have huge read-only data which we want the various users to query without duplicating, what should we do?
... View more
11-04-2017
12:19 PM
Hi @Jeff Watson. You are correct about SAS use of String datatypes. Good catch! One of my customers also had to deal with this. String datatype conversions can perform very poorly in SAS. With SAS/ACCESS to Hadoop you can set the libname option DBMAX_TEXT (added with SAS 9.4m1 release) to globally restrict the character length of all columns read into SAS. However for restricting column size SAS does specifically recommends using the VARCHAR datatype in Hive whenever possible. http://support.sas.com/documentation/cdl/en/acreldb/67473/HTML/default/viewer.htm#n1aqglg4ftdj04n1eyvh2l3367ql.htm Use Case
Large Table, All Columns of Type String: Table A stored in Hive has 40 columns, all of type String, with 500M rows. By default, SAS Access converts String to $32K. So, 32K in length for char. The math for this size table yields 1.2MB row length x 500M rows. This causes the system to come to a halt - Too large to store in LASR or WORK. The following techniques can be used to work around the challenge in SAS, and they all work:
Use char and varchar in Hive instead of String. Set the libname option DBMAX_TEXT to globally restrict the character length of all columns read in In Hive do "SET TBLPROPERTIES SASFMT" to add formats for SAS on schema in HIVE. Add formatting to SAS code during inbound reads
example: Sequence Length 8 Informat 10. format 10. I hope this helps.
... View more