Member since
11-07-2016
58
Posts
26
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
872 | 05-17-2017 04:57 PM | |
2095 | 03-17-2017 06:51 PM | |
1015 | 01-14-2017 07:03 PM | |
2144 | 01-14-2017 06:59 PM | |
912 | 12-29-2016 06:45 PM |
01-02-2018
09:27 PM
1 Kudo
For certain large environments, it's very easy to for Spark History Server to get overwhelmed by the large number of applications being posted and number of users / developers viewing history data. Spark jobs create an artifact called the history file which is what is parsed by the Spark History Server (SHS) and served via the UI. The size of this file has a huge impact in driving the load on the SHS also note that the size of history file is determined by the number of events generated by the SHS (small executor heart beat interval) Workaround: If you are still interested in analyzing performance issues with these large history files, one option is to download these files and browse them from a locally hosted SHS instance. To run this: Download Spark 1.6 https://spark.apache.org/downloads.html Unpack Create a directory to hold the logs called spark-logs Create a properties file called test.properties Inside test.properties add spark.history.fs.logDirectory=<path to the spark-logs directory> <spark download>/sbin/start-history-server.sh --properties-file <path to test.properties> Open web browser and visit http://localhost:18080 Once done, you can now download Spark History files from HDFS and copy them to this directory. The running Spark History Server will dynamically load the files as they are made available in spark-logs directory.
... View more
- Find more articles tagged with:
- Cloud & Operations
- FAQ
- operations
- Spark
- spark-history-server
Labels:
11-20-2017
06:50 PM
You can use a log4j appender for Kafka: https://logging.apache.org/log4j/2.x/manual/appenders.html#KafkaAppender Another option could be to use Atlas hook: http://atlas.apache.org/Bridge-Hive.html
... View more
11-01-2017
12:15 AM
Abstract:
Nimbus metrics are critical to operations as well as development teams for monitoring the performance and stability of Storm applications / topology. Usually most production environments have a metrics / operations monitoring systems including solr, elasticsearch, tsdbs etc. This post shows you; how you can use Collectd to forward these metrics over to your desired metrics environment and alert on them.
Solution:
Collectd is a standard metrics collection tool that can be run natively on linux operating systems. It's capable of capturing a wide variety of metrics, you can find more information on Collectd here: https://collectd.org/
So to capture Storm nimbus metrics, here's a collectd plugin that needs to be complied and built: https://github.com/srotya/storm-collectd (using Maven). Simply run:
mvn clean package assembly:single
In addition, you will need to install collectd and ensure that it has Java plugin capability. Here's a great post on how to do that:
http://blog.asquareb.com/blog/2014/06/09/enabling-java-plugin-for-collectd/ (Please note that the JAR="/path/to/jar" JAVAC="/path/to/javac" variables need to be fixed before you can run it)
Once installed, you will need to configure collectd using the following: (DON'T FORGET TO CONFIGURE OUTPUT PLUGIN)
LoadPlugin java
<Plugin "java">
# required JVM argument is the classpath
# JVMArg "-Djava.class.path=/installpath/collectd/share/collectd/java"
# Since version 4.8.4 (commit c983405) the API and GenericJMX plugin are
# provided as .jar files.
JVMARG "-Djava.class.path=<ABSOLUTE PATH>/lib/collectd-api.jar:<ABSOLUTE PATH>/target/storm-collectd-0.0.1-SNAPSHOT-jar-with-dependencies.jar"
LoadPlugin "com.srotya.collectd.storm.StormNimbusMetrics"
<Plugin "storm">
address "http://localhost:8084/"
kerberos false
jaas "<PATH TO JAAS CONF>"
</Plugin>
</Plugin>
... View more
- Find more articles tagged with:
- FAQ
- monitoring
- Security
- Storm
Labels:
10-31-2017
11:59 PM
2 Kudos
Problem: If you have an AD/LDAP environment and using HDP with Ranger, it's critical to review the case in which usernames and group ids are stored in your Directory Services environment. Ranger authorization is case sensitive therefore if the username / group id doesn't match the one returned from Directory (AD/LDAP) authorization will be denied Solution: To solve this problem Ranger offers 2 parameters that can be set via Ambari. This should ideally be done at install time to avoid the need to re-sync all users. Ranger usersync properties for case conversion are:
ranger.usersync.ldap.username.caseconversion ranger.usersync.ldap.groupname.caseconversion You can set these properties to lower or upper; this will make sure that Ranger will store the usernames and groups in the above specified format in it's local database therefore when users login their authorization parameter will match correctly.
... View more
Labels:
10-31-2017
11:46 PM
advertized listeners needs to be configured for your Kafka brokers but the consumer also has to use the public subnet. Please accept this answer if it helped you.
... View more
10-31-2017
11:43 PM
You can use the consumer group information (offsets) from Kafka to inform you on how much data has been processed. This information is fairly reliable to be used for reporting purposes. Please accept this answer if it helped you.
... View more
10-31-2017
11:41 PM
To compare Spark Vs. Hive on a level field ensure that the number of executors (containers) and their resources are identical in both cases. Spark have executor count and memory size per container and dynamic resource allocation. With Hive you should use Tez instead of MR for a fair comparison. Please accept this answer if it helped you.
... View more
10-31-2017
11:38 PM
Here's a great article on Kafka performance tuning: https://community.hortonworks.com/articles/80813/kafka-best-practices-1.html
... View more
10-31-2017
11:37 PM
Please make sure the correct POM is being used: https://github.com/hortonworks-gallery/iot-truck-streaming/blob/master/storm-streaming/pom23.xml The code is based on older version of storm so it won't compile correctly with Storm 1.x in HDP 2.5 Note: Please accept this answer if it helped you.
... View more
10-31-2017
11:34 PM
@mquershi please note that this API is asynchronous. Here's the method doc: public java.util.concurrent.Future<RecordMetadata> send(ProducerRecord<K,V> record)
Asynchronously send a record to a topic. Equivalent to send(record, null) . See send(ProducerRecord, Callback) for details. You should call .get() after send to make sure the event's actually sent out.
... View more
10-31-2017
11:30 PM
What's you data flow rate? Could you please post the stack trace? Note that you should try to tune your batch sizes; size of event * batch size > heap will result in OOM. On a side note: having large single event can cause such errors.
... View more
07-25-2017
04:14 PM
@Biswajit Chakraborty this is the perfect problem for Apache NiFI (HDF) to solve. Additionally, unless your use case requires only a few things to be computed, I would recommend building a NiFi -> Kafka -> Spark or Storm -> Solr -> Silk pipeline making it more extensible.
... View more
06-15-2017
06:44 PM
1 Kudo
@heta desai please accept the answer if that helped address your question
... View more
06-07-2017
06:52 PM
Please make sure of 2 things: Elasticsearch 2.3.3 is available in your configured yum repos Elasticsearch is not already installed on the machine, verify running sudo rpm -qa | grep elasticsearch
... View more
06-07-2017
06:29 PM
Storm operates on 1 tuple / event/ message at a time; Spark operates on batches of messages. Event is whatever the message in your use case is. Events can represent log messages, messages in Kafka etc.
... View more
06-06-2017
07:55 PM
Technical: Spark Vs Storm can be decided based on amount of branching you have in your pipeline. Storm can handle complex branching whereas it's very difficult to do so with Spark. Branching means if you have events/messages divided into streams of different types based on some criteria. This is made possible by the fact that Storm operates on a per event basis whereas Spark operates on batches. So if have branching or reason to operate on a per event basis Storm should be your choice. If you have a linear pipeline, something like validate->transform->ingest then you can perform Apples to Apples comparison as in you can compare the micro-batching performance of Spark Vs. Storm Vs. Flink. Additionally, decision should also consider
... View more
06-06-2017
07:47 PM
1 Kudo
It is possible with a custom Processor, the out of the box processors supports query based on flowfile: https://community.hortonworks.com/questions/42727/using-invokehttp-flowfile-and-replacetext-from-nif.html
... View more
05-17-2017
04:57 PM
@Naveen Keshava why are you trying to run it using Maven? The issue is the code requires command line however maven is treating those arguments as goals. Please compile and run independently, even if you are trying to run it locally.
... View more
04-06-2017
06:33 PM
1 Kudo
@Shravanthi Those methods (setMemoryLoad and setCPULoad) are not supported in Flux at the moment.
... View more
04-05-2017
11:15 PM
Repo DescriptionRepo Info Github Repo URL https://github.com/ambud/ambari-kafka-supervisord Github account name ambud Repo name ambari-kafka-supervisord
... View more
- Find more articles tagged with:
- Ambari
- ambari-extensions
- ambari-kafka
- solutions
Labels:
03-22-2017
07:12 PM
1 Kudo
Here's some example code to show you how explicit anchoring and acking can be done:
https://github.com/Symantec/hendrix/blob/current/hendrix-storm/src/main/java/io/symcpe/hendrix/storm/bolts/ErrorBolt.java
... View more
03-22-2017
07:06 PM
Yes, that is incorrect, https://github.com/apache/storm/blob/master/storm-core/src/jvm/org/apache/storm/topology/base/BaseBasicBolt.java this bolt class doesn't even have a collector to acknowledge messages.
... View more
03-22-2017
06:33 PM
@Laxmi Chary You should be anchoring, without anchoring Storm doesn't guarantee at least once semantics which means it's best effort. Anchoring is a factor of your delivery semantics, you should be using BaseRichBolt, otherwise you don't have a collector.
... View more
03-17-2017
06:51 PM
1 Kudo
@Laxmi Chary thanks for your question.
Do you know if there's ever a case where Message from Bolt 2 doesn't get written but from Bolt 3 does get written?
Are you anchoring tuples in your topology? collector.emit(tuple, new Field()) [the tuple is the anchor] Are you doing any microbatching in your topology?
... View more
03-01-2017
06:16 PM
Thanks @leo lee Yes, data should be getting to Elasticsearch, there are no errors/failures in the topology (based on your screenshot above) To your question about the significance of color in Storm Topology visualization, it's showing relative load / performance of different bolts. The HDFS Indexing Bolt consumes a lot more time (75.776 ms) compared to other bolts.
... View more
02-28-2017
09:22 PM
@leo lee please post the Nimbus screen shot that includes Kafka Spout (Spouts section).
... View more
02-21-2017
07:37 PM
1 Kudo
Unfortunately there's no File System API to do that. Apache Tika does file type detection (https://tika.apache.org/1.1/detection.html) and will provide you base APIs that you can extend to create a detector for ORC, Avro etc. To detect whether a file is a given type you can: use brute force file reads (try catch if else) or use file format header detection http://orc.apache.org/docs/spec-intro.html Avro and Text will require a try catch detection Create a job builder to construct / initialize the mapreduce driver based on the file format, detected using the logic above. An important point to remember is you will need careful input splitting for any of these formats and the criteria for split varies. Hope this helps!
... View more
02-11-2017
03:42 AM
Repo Description Collectd input plugin to monitor Storm topologies (with Kerberos support) Repo Info Github Repo URL https://github.com/srotya/storm-collectd Github account name srotyahttps://github.com/srotya/storm-collectd Repo name storm-collectd
... View more
- Find more articles tagged with:
- ambari-extensions
- collectd
- Data Ingestion & Streaming
- Storm
- storm-kafka
Labels: