Member since
10-24-2015
171
Posts
379
Kudos Received
23
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2700 | 06-26-2018 11:35 PM | |
4403 | 06-12-2018 09:19 PM | |
2914 | 02-01-2018 08:55 PM | |
1475 | 01-02-2018 09:02 PM | |
6861 | 09-06-2017 06:29 PM |
03-30-2017
08:58 PM
1 Kudo
Find the below link which shows how to clear /rmstore znode https://community.hortonworks.com/questions/46703/resource-manager-failed-to-start.html
... View more
03-30-2017
08:35 PM
7 Kudos
Find sample spark-default.conf as below. Replace the correct value for below markers: <hadoop-client-native> : Dir path to hadoop native dir , In hdp clusters , it is generally /usr/hdp/current/hadoop-client/lib/native <spark-history-dir> : HDFS dir on cluster where spark history server event log should be stored. Make sure this dir is owned by spark:hadoop with 777 permission spark.driver.extraLibraryPath <hadoop-client-native>:<hadoop-client-native>/Linux-amd64-64
spark.eventLog.dir hdfs:///<spark-history-dir>
spark.eventLog.enabled true
spark.executor.extraLibraryPath <hadoop-client-native>:<hadoop-client-native>/Linux-amd64-64
spark.history.fs.logDirectory hdfs:///<spark-history-dir>
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080
spark.yarn.containerLauncherMaxThreads 20
spark.yarn.driver.memoryOverhead 384
spark.yarn.executor.memoryOverhead 384
spark.yarn.historyServer.address xxx:18080
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.submit.file.replication 3 Find sample spark-env.sh as below. Please update the paths as per your environment. export SPARK_CONF_DIR=/etc/spark/conf
export SPARK_LOG_DIR=/var/log/spark
export SPARK_PID_DIR=/var/run/spark
export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/current/hadoop-client}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/current/hadoop-client/conf}
# The java implementation to use.
export JAVA_HOME=<jdk path>
Hope this helps.
... View more
03-30-2017
06:16 PM
1 Kudo
@n c, This error is related to Zookeeper connection. can you please make sure if zookeeper is up and running fine ? You can also try restarting Zookeepers and RM to check if this issue is resolved.
... View more
03-29-2017
08:12 PM
9 Kudos
@Thangarajan Pannerselvam, 1) Python: You can actually use panda to get a data frame from a query result. Read below blog for details on how to run a query on a database and get data frame from a result. https://www.dataquest.io/blog/python-pandas-databases/ 2) Scala: You can also use jdbc driver to run a query and save the result of query in data frame. ( use spark.read.jdbc api). Details can be found in below link. https://sparkour.urizone.net/recipes/using-jdbc/
... View more
03-28-2017
09:23 PM
9 Kudos
@Devender Yadav, It seems that TimeStampType do not support nanoseconds yet. https://issues.apache.org/jira/browse/SPARK-17914 is opened to track this issue. You might want to use some other data type like String to store this data if you require nanoseconds data.
... View more
03-25-2017
12:08 AM
7 Kudos
@Cheng Xu, can you please try adding cobertura jar to spark.driver.extraClassPath?
... View more
03-24-2017
09:59 PM
3 Kudos
https://issues.apache.org/jira/browse/ZEPPELIN-84 is regarding breaking a statement in multiple lines. @Zsoka Kovacs, you should be able to run below paragraph. Do not give extra \n in between and make sure there are no extra chars copied at the end of the line. {code} %pyspark myLines=sc.textFile('/tmp/Hortonworks') myLinesFiltered=myLines.filter(lambdax:len(x)>0) count=myLinesFiltered.count() print count {code}
... View more
03-22-2017
11:40 PM
2 Kudos
Mateusz Grabowski, queue distribution ensures the capacity distribution. However it is possible that containers from different queues can run on same node manager host. In this case, execution time of a container may be affected. So, isolating queues is not sufficient. You will also need to configure CGroup for cpu isolation. Find some good links on CGroup as below. https://hortonworks.com/blog/managing-cpu-resources-in-your-hadoop-yarn-clusters/ https://hortonworks.com/blog/apache-hadoop-yarn-in-hdp-2-2-isolation-of-cpu-resources-in-your-hadoop-yarn-clusters/
... View more
03-22-2017
06:58 PM
5 Kudos
@Abhijeet Rajput, Found an article which compares performance of RDD/ Dataframe and SQL . It will help you make informed decision. https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html In summary, You mainly need to analyze your use case ( like what type of queries will you be running , how big is data set etc). Depending on your use case, you can choose to go with either SQL or Dataframe API. For example: If your use case involves lot of groupby, orderby like queries, you should go with sparkSQL instead data frame api. ( because sparkSQL executes faster than data frame api for such use case)
... View more
03-22-2017
05:57 PM
3 Kudos
@Mateusz Grabowski, The above comment is not clear. Do you mean to say that Zeppelin application is taking resources from the q_apr_general queue ? If the applications running in default queue are not acquiring containers from q_apr_general queue, it can not affect the performance of any application in q_apr_general queue. In this case, you should debug the streaming application to see where longer delays are happening.
... View more