1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
773 | 04-03-2024 06:39 AM | |
1425 | 01-12-2024 08:19 AM | |
771 | 12-07-2023 01:49 PM | |
1326 | 08-02-2023 07:30 AM | |
1922 | 03-29-2023 01:22 PM |
08-10-2016
04:03 PM
1 Kudo
I have look at it and there's no specific connectors for SnappyData in HDF. I am looking into writing one to handle in-memory data stores, perhaps using the Redis Connector as a start https://github.com/qntfy/nifi-redis
... View more
05-20-2016
03:20 PM
I will give this a try and I'll post the results. For Windows and DBVisualizer, there's an article with step by step details. DBVisualizer Windows For Tableau: http://kb.tableau.com/articles/knowledgebase/connecting-to-hive-server-2-in-secure-mode For Squirrel SQL: https://community.hortonworks.com/questions/17381/hive-with-dbvisualiser-or-squirrel-sql-client.html
... View more
05-19-2016
10:40 AM
Having run a bunch of Spark jobs locally, in Spark Standalone clusters and in HDP Yarn Clusters; I have found a few JVM settings that helped with debugging non-production jobs and assist with better Garbage Collection. This is important even with off-heap storage and bare metal optimizations. spark-submit --driver-java-options "-XX:+PrintGCDetails -XX:+UseG1GC -XX:MaxGCPauseMillis=400"
You can also set options extra options in the runtime environment (see Spark Documentation). For HDP / Spark, you can add this from Ambari. In your Scala Spark Program: sparkConf.set("spark.cores.max", "4")
sparkConf.set("spark.serializer", classOf[KryoSerializer].getName)
sparkConf.set("spark.sql.tungsten.enabled", "true")
sparkConf.set("spark.eventLog.enabled", "true")
sparkConf.set("spark.app.id", "MyAppIWantToFind")
sparkConf.set("spark.io.compression.codec", "snappy")
sparkConf.set("spark.rdd.compress", "false")
sparkConf.set("spark.suffle.compress", "true")
Make sure you have Tungsten on, the KryoSerializer, eventLog enabled and use Logging. Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.apache.spark.storage.BlockManager").setLevel(Level.ERROR)
val log = Logger.getLogger("com.hortonworks.myapp")
log.info("Started Logs Analysis") Also, whenever possible include relevant filters on your datasets: "filter(!_.clientIp.equals("Empty"))".
... View more
Labels:
11-10-2016
01:21 PM
@vlundberg This has nothing to do with being installed via Ambari. If the core-site.xml file that is being used by the HDFS processor in NiFi reference a Class which NiFi does not include, you will get a NoClassDef found error. Adding new Class to NiFi's HDFS NAR bundle may be a possibility, but as I am not a developer i can't speak to that. You can always file an Apache Jira against NiFi for this change. https://issues.apache.org/jira/secure/Dashboard.jspa Thanks, Matt
... View more
05-10-2017
09:10 AM
Thanks for the very useful article. I am getting the below when trying to compile. constructor cannot be instantiated to expected type;
found : (T1, T2)
required: org.apache.kafka.clients.consumer.ConsumerRecord[String,Array[Byte]]
[ERROR] val rdd2 = rdd.map { case (k, v) => parseAVROToString(v) } Did anybody face this issue? Thanks.
... View more
05-12-2016
08:37 AM
Ok Thanks! Seems adding this param works for me. #!/usr/bin/env bash
# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.
MASTER="yarn-cluster"
# Options read in YARN client mode
SPARK_EXECUTOR_INSTANCES="3" #Number of workers to start (Default: 2)
#SPARK_EXECUTOR_CORES="1" #Number of cores for the workers (Default: 1).
#SPARK_EXECUTOR_MEMORY="1G" #Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
#SPARK_DRIVER_MEMORY="512 Mb" #Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb)
#SPARK_YARN_APP_NAME="spark" #The name of your application (Default: Spark)
#SPARK_YARN_QUEUE="~@~Xdefault~@~Y" #The hadoop queue to use for allocation requests (Default: @~Xdefault~@~Y)
#SPARK_YARN_DIST_FILES="" #Comma separated list of files to be distributed with the job.
#SPARK_YARN_DIST_ARCHIVES="" #Comma separated list of archives to be distributed with the job.
# Generic options for the daemons used in the standalone deploy mode
# Alternate conf dir. (Default: ${SPARK_HOME}/conf)
export SPARK_CONF_DIR=${SPARK_CONF_DIR:-{{spark_home}}/conf}
# Where log files are stored.(Default:${SPARK_HOME}/logs)
#export SPARK_LOG_DIR=${SPARK_HOME:-{{spark_home}}}/logs
export SPARK_LOG_DIR={{spark_log_dir}}
# Where the pid file is stored. (Default: /tmp)
export SPARK_PID_DIR={{spark_pid_dir}}
# A string representing this instance of spark.(Default: $USER)
SPARK_IDENT_STRING=$USER
# The scheduling priority for daemons. (Default: 0)
SPARK_NICENESS=0
export HADOOP_HOME=${HADOOP_HOME:-{{hadoop_home}}}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-{{hadoop_conf_dir}}}
# The java implementation to use.
export JAVA_HOME={{java_home}}
if [ -d "/etc/tez/conf/" ]; then
export TEZ_CONF_DIR=/etc/tez/conf
else
export TEZ_CONF_DIR=
fi
ps:it works well but seems the params passed via command line (e.g.: --num-executors 8--num-executor-core 4--executor-memory 2G) are not taken in consideration. Instead, if I set the executors in "spark-env template" filed of Ambari, the params are taken in consideration. Anyway now it works 🙂 Thanks a lot.
... View more
05-10-2016
03:53 PM
Would be interesting to see. There seem to be a couple data quality tools out there in the open source commnity mural/mosaic but the last update in the repository seems to have been 4 years ago. So not sure how useful that is. https://java.net/projects/mosaic
... View more
05-13-2016
01:51 PM
Hi finally the problem was about the directory permission /var/run/ambari-server on the namenode I did: chown -R ambari:ambari /var/run/ambari-server
... View more
05-02-2016
08:56 PM
Love this!! Already sent it to some close sales reps for a good laugh 🙂 Great job Dan!
... View more