About yvora

yvora · ‎03-30-2017

Find the below link which shows how to clear /rmstore znode https://community.hortonworks.com/questions/46703/resource-manager-failed-to-start.html

yvora · ‎03-30-2017

Find sample spark-default.conf as below. Replace the correct value for below markers: <hadoop-client-native> : Dir path to hadoop native dir , In hdp clusters , it is generally /usr/hdp/current/hadoop-client/lib/native <spark-history-dir> : HDFS dir on cluster where spark history server event log should be stored. Make sure this dir is owned by spark:hadoop with 777 permission spark.driver.extraLibraryPath <hadoop-client-native>:<hadoop-client-native>/Linux-amd64-64 spark.eventLog.dir hdfs:///<spark-history-dir> spark.eventLog.enabled true spark.executor.extraLibraryPath <hadoop-client-native>:<hadoop-client-native>/Linux-amd64-64 spark.history.fs.logDirectory hdfs:///<spark-history-dir> spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.ui.port 18080 spark.yarn.containerLauncherMaxThreads 20 spark.yarn.driver.memoryOverhead 384 spark.yarn.executor.memoryOverhead 384 spark.yarn.historyServer.address xxx:18080 spark.yarn.preserve.staging.files false spark.yarn.queue default spark.yarn.scheduler.heartbeat.interval-ms 5000 spark.yarn.submit.file.replication 3 Find sample spark-env.sh as below. Please update the paths as per your environment. export SPARK_CONF_DIR=/etc/spark/conf export SPARK_LOG_DIR=/var/log/spark export SPARK_PID_DIR=/var/run/spark export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/current/hadoop-client} export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/current/hadoop-client/conf} # The java implementation to use. export JAVA_HOME=<jdk path> Hope this helps.

yvora · ‎03-30-2017

@n c, This error is related to Zookeeper connection. can you please make sure if zookeeper is up and running fine ? You can also try restarting Zookeepers and RM to check if this issue is resolved.

yvora · ‎03-29-2017

@Thangarajan Pannerselvam, 1) Python: You can actually use panda to get a data frame from a query result. Read below blog for details on how to run a query on a database and get data frame from a result. https://www.dataquest.io/blog/python-pandas-databases/ 2) Scala: You can also use jdbc driver to run a query and save the result of query in data frame. ( use spark.read.jdbc api). Details can be found in below link. https://sparkour.urizone.net/recipes/using-jdbc/

yvora · ‎03-28-2017

@Devender Yadav, It seems that TimeStampType do not support nanoseconds yet. https://issues.apache.org/jira/browse/SPARK-17914 is opened to track this issue. You might want to use some other data type like String to store this data if you require nanoseconds data.

yvora · ‎03-25-2017

@Cheng Xu, can you please try adding cobertura jar to spark.driver.extraClassPath?

yvora · ‎03-24-2017

https://issues.apache.org/jira/browse/ZEPPELIN-84 is regarding breaking a statement in multiple lines. @Zsoka Kovacs, you should be able to run below paragraph. Do not give extra \n in between and make sure there are no extra chars copied at the end of the line. {code} %pyspark myLines=sc.textFile('/tmp/Hortonworks') myLinesFiltered=myLines.filter(lambdax:len(x)>0) count=myLinesFiltered.count() print count {code}

yvora · ‎03-22-2017

Mateusz Grabowski, queue distribution ensures the capacity distribution. However it is possible that containers from different queues can run on same node manager host. In this case, execution time of a container may be affected. So, isolating queues is not sufficient. You will also need to configure CGroup for cpu isolation. Find some good links on CGroup as below. https://hortonworks.com/blog/managing-cpu-resources-in-your-hadoop-yarn-clusters/ https://hortonworks.com/blog/apache-hadoop-yarn-in-hdp-2-2-isolation-of-cpu-resources-in-your-hadoop-yarn-clusters/

yvora · ‎03-22-2017

@Abhijeet Rajput, Found an article which compares performance of RDD/ Dataframe and SQL . It will help you make informed decision. https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html In summary, You mainly need to analyze your use case ( like what type of queries will you be running , how big is data set etc). Depending on your use case, you can choose to go with either SQL or Dataframe API. For example: If your use case involves lot of groupby, orderby like queries, you should go with sparkSQL instead data frame api. ( because sparkSQL executes faster than data frame api for such use case)

yvora · ‎03-22-2017

@Mateusz Grabowski, The above comment is not clear. Do you mean to say that Zeppelin application is taking resources from the q_apr_general queue ? If the applications running in default queue are not acquiring containers from q_apr_general queue, it can not affect the performance of any application in q_apr_general queue. In this case, you should debug the streaming application to see where longer delays are happening.

Online	Offline
Last Visited	‎10-25-2018 06:32 PM

Member Since	‎10-24-2015 06:41 PM
Last Visited	‎10-25-2018 06:32 PM
Posts	171
Kudos received	375

Cloudera Community

Re: yarn cache files does not have execute permiss...

Re: What is the use of zookeeper.out?

Re: how to know the reason for missing blocks

Re: Best way to monitor/move hadoop files through ...

Re: Limit in number of Yarn jobs

Re: Resource Manager down - /rmstore error

Re: spark 2.1 properties (spark-env.sh, spark-defa...

Re: Resource Manager down - /rmstore error

Re: Dataframe/Dataset from Query

Re: How to insert nano seconds in the TimestampTyp...

Re: Got spark class not found error trying to run ...

Re: Cannot run multiple lines PySpark

Re: Spark job in YARN queue depends on jobs in ano...

Re: Pyspark - Spark SQL

Re: Spark job in YARN queue depends on jobs in ano...