Member since
09-06-2016
108
Posts
36
Kudos Received
11
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1528 | 05-11-2017 07:41 PM | |
594 | 05-06-2017 07:36 AM | |
3869 | 05-05-2017 07:00 PM | |
1556 | 05-05-2017 06:52 PM | |
3821 | 05-02-2017 03:56 PM |
08-21-2018
01:20 PM
Can confirm the DBCPConnectionPool approach suggested here by @Rudolf Schimmel works. We did run into issues when using Java 10 (uncaught Exception: java.lang.NoClassDefFoundError: org/apache/thrift/TException even though libthrift was specified). Using Java 8 worked.
... View more
05-19-2017
01:49 PM
Ingest data with NIFI to Hive LLAP and Druid Setting up Hive LLAP Setting up Druid Configuring the dimensions Setting up Superset Connection to Druid Creating a dashboard with visualisations Differences with ES/Kibana and SOLR/Banana
... View more
- Find more articles tagged with:
- How-ToTutorial
- Sandbox & Learning
- superset
- visualisation
Labels:
05-15-2017
06:09 PM
Hi @Opao E, Can you explain a bit more what you are trying to accomplish? this is not a setup I typically see at customers (for what it's worth) and might be better served by an alternative solution architecture?
... View more
05-15-2017
06:06 PM
Hi, the default NIFI memory settings are usually to low (512MB). Check your current maximum memory allocation pool setting (Xmx) for the NIFI JVM. Try to set it to about 80% of your available memory. Then keep an eye of the memory consumption of your flow for a few days (e.g. with a NIFI monitoring task), and see if the high mark is hit and you can a) optimize your flow, or b) assign/add more memory
... View more
05-15-2017
08:36 AM
Hi @Peter Kim . Thx for the correction.
... View more
05-13-2017
04:06 PM
Hi @Peter Kim, SOLR is officially supported on HDP 2.4.x and HDP 2.5.x See the docs on how to install the MPack: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_solr-search-installation/content/ch_hdp-search-install-ambari.html Installing it via this Mpack shouldn't give any issues.
... View more
05-13-2017
02:14 PM
2 Kudos
Hi @Sushant, Regarding the spark parameters: the perfect setting is very depending of the characteristics of each spark job. You can get some good defaults using this Apache Spark Config Cheat sheet: http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/ Also take a look at dr-elephant, a performance monitoring and tuning tool for Apache Hadoop.
... View more
05-13-2017
02:10 PM
1 Kudo
Hi @Sushant, You can control access to Yarn Queues, including who can kill applications, with access control lists (ACL's). Read more about this in the docs. /W
... View more
05-12-2017
06:45 AM
Hmm. can't find anything obvious. Best to post a new questions on HCC for this, so it gets the proper attention.
... View more
05-11-2017
07:41 PM
1 Kudo
Hi @PJ, See https://issues.apache.org/jira/browse/HDFS-4239 for a good relevant discussion. So shut down the datanode, clean the disk, remount and restart the datanode. Because of the data replication factor of 3 from HDFS that shouldn't be a problem. Make sure the new mount is in the dfs.data.dir config. Additionally you can also decomission the node and recommission following the steps here: https://community.hortonworks.com/articles/3131/replacing-disk-on-datanode-hosts.html
... View more
05-10-2017
08:01 PM
3 Kudos
The standard solution Let's say you want to collect log messages from an edge cluster with NIFI, and push it to a central NIFI cluster via the Site To Site (S2S) protocol. This is exactly what NIFI is designed for, and results in a simple flow setup like this:
A processor that tails the log file which sends it's flowfiles to a remote process group which is configured with the FQDN URL of the central NIFI cluster on the central NIFI cluster an INPUT port is defined and from that input port the rest of the flow is doing it's thing with the incoming flow files, like filtering, transformations and eventually sinking it into kafka, HDFS or SOLR. The NIFI S2S protocol is used for the connection between the edge NIFI cluster and the central nifi cluster. which PUSHES the flowfiles from the edge cluster to the central NIFI cluster. And now with a firewall blocking incoming connections in between This standard setup however assumes the central NIFI cluster has a public FQDN and isn't behind a firewall blocking incoming connections. But what if there is a firewall blocking incoming connections? Fear not! The flexibility of NIFI comes to the rescue once again.
The solution is to move the initiation of the S2S connection from the edge NIFI to central NIFI:
The remote process group in defined on the central node, which connects to a output port on the edge node as the edge NIFI node has a public FQDN (this is required!) and instead of a S2S PUSH, the data is effectively PULLED from the edge NIFI cluster to the central NIFI cluster. To be clear: this setup has the downside that the central cluster NIFI will need to know about all edge clusters. Not necessarily a big deal, just means the flow in the central NIFI cluster needs to be updated when edge clusters/nodes are added. But if you can't change the fact you have a firewall blocking incoming connections, it does the job. Example solution NIFI flow setup Screenshot of flow on Edge Node with a TailFile processor that send it's flowfiles to the output port named `logs`: Screenshot of flow on central NIFI cluster with a remote process group pointed to the FQDN of the Edge Node and a connection from the output port `logs` to the rest of the flow: The configuration of the remote process group: And the details of the `logs` connection:
... View more
- Find more articles tagged with:
- edge
- firewall
- How-ToTutorial
- NiFi
- Sandbox & Learning
Labels:
05-09-2017
06:52 PM
Great that you where able to solve it!
... View more
05-09-2017
04:27 PM
Hi @Meryem Moumen, from the documentation: Download the latest release from here: https://github.com/hortonworks-spark/shc/releases Build with mvn package -DskipTests And run like this: ./bin/spark-submit --class your.application.class --master yarn-client --packages com.hortonworks:shc-core:1.1.0-2.1-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/conf/hbase-site.xml /To/your/application/jar
... View more
05-09-2017
12:53 PM
Hi @Meryem Moumen, This Apache Hbase Connector seems to work with pyspark. Example code from here: from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlc = SQLContext(sc)
data_source_format = 'org.apache.hadoop.hbase.spark'
df = sc.parallelize([('a', '1.0'), ('b', '2.0')]).toDF(schema=['col0', 'col1'])
# ''.join(string.split()) in order to write a multi-line JSON string here.
catalog = ''.join("""{
"table":{"namespace":"default", "name":"testtable"},
"rowkey":"key",
"columns":{
"col0":{"cf":"rowkey", "col":"key", "type":"string"},
"col1":{"cf":"cf", "col":"col1", "type":"string"}
}
}""".split())
# Writing
df.write\
.options(catalog=catalog)\ # alternatively: .option('catalog', catalog)
.format(data_source_format)\
.save()
# Reading
df = sqlc.read\
.options(catalog=catalog)\
.format(data_source_format)\
.load()
... View more
05-09-2017
12:43 PM
Hi @John Cleveland, Usually this is a result of not clicking the Save button under the 'Interpreter Binding' section when you open the notebook for the first time. Check this in the Interpreter Settings using setting icon on upper right corner of the notebook and make sure both spark and the shell Interpreter are selected.
... View more
05-09-2017
09:22 AM
1 Kudo
To get an idea of the write performance of a Spark cluster i've created a Spark version of the standard
TestDFSIO tool, which measures the I/O performance of HDFS in your cluster. Lies, damn lies and benchmarks, so the goal of this tool is providing a sanity check of your Spark setup, focusing on the HDFS writing performance, not on the compute performance. Think the tool can be improved? Feel free to submit a pull request or raise a Github issue Getting the Spark Jar
Download the Spark Jar from here:
https://github.com/wardbekker/benchmark/releases/download/v0.1/benchmark-1.0-SNAPSHOT-jar-with-dependencies.jar
It's build for Spark 1.6.2 / Scala 2.10.5 Or build from from source
$ git clone https://github.com/wardbekker/benchmark
$ cd benchmark && mvn clean package Submit args explains
<file/partitions> : should ideally be equal to recommended spark.default.parallelism (cores x instances).
<bytes_per_file> : should fit in memory: for example: 90000000.
<write_repetitions> : no of re-writing of the test RDD to disk. benchmark will be averaged.
spark-submit --class org.ward.Benchmark --master yarn --deploy-mode cluster --num-executors X --executor-cores Y --executor-memory Z target/benchmark-1.0-SNAPSHOT-jar-with-dependencies.jar <files/partitions> <bytes_per_file> <write_repetitions>
CLI Example for 12 workers with 30GB mem per node: It's important to get the amount of executors and cores right: you want to get the maximum amount of parallelism without going over the maximum capacity of the cluster.
This command will write out the generated RDD 10 times, and will calculate an aggregate throughput over it.
spark-submit --class org.ward.Benchmark --master yarn --deploy-mode cluster --num-executors 60 --executor-cores 3 --executor-memory 4G target/benchmark-1.0-SNAPSHOT-jar-with-dependencies.jar 180 90000000 10
Retrieving benchmark results: You can retrieve the benchmark results by running yarn log in this way:
yarn logs -applicationId <application_id> | grep 'Benchmark'
for example:
Benchmark: Total volume : 81000000000 Bytes
Benchmark: Total write time : 74.979 s
Benchmark: Aggregate Throughput : 1.08030246E9 Bytes per second
So that's about 1 GB write per sec for this run.
... View more
- Find more articles tagged with:
- benchmark
- How-ToTutorial
- performance
- Sandbox & Learning
- Spark
- testdfsio
- tool
Labels:
05-08-2017
02:59 PM
@Arpit Agarwal good point. The customer uses ranger audit logging. What extra information is in the hdfs audit log, what is not already in the ranger audit logs.
... View more
05-08-2017
01:26 PM
Ah, 512MB is probably to low for your use case. If you don't have a lot of other services running on your node, I would suggest to start with 80% of the node memory.
... View more
05-07-2017
08:45 PM
Hi
@BRivas garriv,
You can run a mapreduce program with:
${hadoop_home}/bin/hadoop jar ${your_program_jar_file} ${main_class_of_jar}
You could trigger this by having the hadoop client on your pc, or via ssh: ssh foo@example.org '/path/to/bin/hadoop jar etc..' If that works, next step is to have your server side language execute this ssh command when desired.
... View more
05-07-2017
01:30 PM
@ankur V FYI: It's recommended to install Metron on CentOS, not on Ubuntu.
... View more
05-07-2017
01:17 PM
Hi @Achu Thambi, Can you verify that Zookeeper is installed and running correctly on the cluster? From this answer: Ambari and ps command can show you that ZK service and ZK process is running on respective nodes, but only after "zkService.sh status" shows you that one node is the leader and the others are followers you can be absolutely certain that ZK is running and fully functional. You can run it using pdsh and targeting only ZK nodes.
... View more
05-07-2017
01:11 PM
@Daniel Kozlowski Does installing those libraries e.g. yum install libtre-devel tre-devel help with matching the prerequistes?
... View more
05-06-2017
07:36 AM
There is an article Creating a HANA Workflow using HADOOP Oozie from the SAP blog Perhaps easier is to connect to SAP Hana via NIFI. Preferred method is via REST API, not via JDBC (see previous question on this topic). However, you can connect over JDBC with Hana using NIFI as explained here: https://community.hortonworks.com/articles/81153/basic-cdc-using-apache-nifi-and-sap-hana.html
... View more
05-06-2017
06:12 AM
3 Kudos
Mindwave Neurosky The Mindwave Neurosky is a headset that allows you to record your brainwaves using EEG technology. In this article we show you how to ingest these brainwaves with NIFI Mindwave Neurosky driver installation for OSX Sierra
Download and install the latest driver from http://download.neurosky.com/public/Products/MindWave%20headset/RF%20driver%20for%20Mac/MindWaveDriver5.1.pkg After the driver is installed, download and install the latest MindWave Manager from http://download.neurosky.com/public/Products/MindWave%20headset/RF%20driver%20for%20Mac/MindWave%20Manager4.0.4.zip Launch the MindWave Manager, navigate to "Pairing" section and click the "Search for MindWave", then follow the instructions to pair the headset. Install NIFI on OSX Sierra with Homebrew
Install Homebrew from the terminal: /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" Install NIFI (time of writing v1.1.2): brew install nifi Import NIFI Flow Template
An example flow template can be downloaded using Curl:
curl -O https://gist.githubusercontent.com/wardbekker/a80cbe7d12bc1866f393c5a74bf417a0/raw/9d64daa748dec352ebe8cc350e3fb34e65130ec3/mindwave_nifi_ingest_template.xml
The most important processor here is the ListenTCP processor, which will listen on port 20000 and will receive the JSON payload. The flow also contains a Site 2 Site NIFI connection to a remote processgroup with the URL
http://wbekkerhdf0.field.hortonworks.com:9090/nifi . You can change it to your own remote NIFI cluster. Get Ruby 'forward' script
The Mindwave Thinkgear driver will create a socket where we can consume the sensor data as Json messages. To ingest it with the current vanilla version of NIFI, we need to 'forward' the messages from the thinkgear port to the NIFI ListenTCP processor port number. Upcoming versions of NIFI will have a
GetTCP processor, making this Ruby script obsolete.
Save this Ruby script as a file under
thinkgear.rb . Run it with ruby thinkgear.rb AFTER you have connected your headset AND started ListenTCP processor on the NIFI flow. Otherwise you will run into connection errors.
require 'socket'
require 'json'
require 'date'
thinkgear_server_socket = TCPSocket.new 'localhost', 13854
nifi_server_socket = TCPSocket.new 'localhost', 20000
# trigger json output
thinkgear_server_socket.puts "{\"enableRawOutput\": true, \"format\": \"Json\"}\n"
while line = thinkgear_server_socket.gets # Read lines from socket
hash = JSON.parse(line)
hash['timestamp'] = DateTime.now.strftime('%Q')
hash['user_id'] = 1
json = JSON.generate(hash)
puts json
nifi_server_socket.puts json
end
thinkgear_server_socket.close
nifi_server_socket.close
Start ingestion of your brainwaves
Connect you headset by launching the MindWave Manager, navigate to "Pairing" section and click the "Search for MindWave", then follow the instructions to pair the headset. Start the NIFI flow, or at least the ListenTCP processor. Start the ruby script with ruby thinkgear.rb .
At this point you should see JSON output from your Mindwave headset on your terminal, and new flowfiles into NIFI. Have fun with your brainwaves!
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- eeg
- How-ToTutorial
- NiFi
Labels:
05-05-2017
08:07 PM
@Ramon Wartala If my answer solved your problem, please click the accept answer link.
... View more
05-05-2017
08:03 PM
1 Kudo
Hi @Raphaël MARY What are your JVM memory settings? The standard is 512MB, which will likely result in OOM with a large query result set. Best to give NIFI as much memory as possible if you plan to do a lot of in memory workload, like working with large result sets in this case.
... View more
05-05-2017
07:46 PM
Hi @Ekantheshwara Basappa I believe Cluster CPU gives the Percentage of vCores allocated across all NodeManager hosts.
... View more
05-05-2017
07:27 PM
Hi @jjoshua thomas, feel free to accept my answer if it resolved the issue.
... View more
05-05-2017
07:03 PM
Is there something in the NIFI log regarding the Hbase service?
... View more