About tsharma

tsharma · ‎11-16-2017

@anobi do For spark driver memory see this link -> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-driver.html Also when you do a collect or take, the result comes to driver, your driver will throw error if the result of collect or take is more than free space. Hence it's kept large to account for that if you have big datasets. However default is set to 1G or 2G because it mainly schedules tasks working with YARN with operations being performed on executors themselves (which actually have data, can cache it and process it). When you increase sessions, STS daemon memory shall increase too because it has to keep listening and handling sessions. My thrift server process was started like this: hive 27597 13 Nov15 ?00:49:53 /usr/lib/jvm/java-1.8.0/bin/java -Dhdp.version=2.6.1.0-129 -cp /usr/hdp/current/spark2-thriftserver/conf/:/usr/hdp/current/spark2-thriftserver/jars/*:/usr/hdp/current/hadoop-client/conf/ -Xmx6000m org.apache.spark.deploy.SparkSubmit --properties-file /usr/hdp/current/spark2-thriftserver/conf/spark-thrift-sparkconf.conf --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name Thrift JDBC/ODBC Server spark-internal Note the -Xmx here corresponds to thrift daemon memory rather than driver memory, driver memory is taken from spark2-thriftserver/conf/spark-thrift-sparkconf.conf which internally has a symbolic link to one inside /etc. If you don't override it there it would just pick default. So please have spark.executor.memory, spark.driver.memory defined there. Can you get in your node, do ps -eaf | grep thrift and paste output here? I had asked you to set SPARK_DAEMON_MEMORY=6000m ? Are you using HDP/Ambari? If yes, please set it directly here as shown: screen-shot-2017-11-16-at-104601-am.png And set thrift-server parameters here: screen-shot-2017-11-16-at-104834-am.png Just for example. If you're not using HDP/Ambari, Set SPARK_DAEMON_MEMORY in spark-env.sh and thrift parameters in /etc/spark2/conf/spark-thrift-sparkconf.conf and start thrift-sever. spark.driver.cores 1 spark.driver.memory 40G spark.executor.cores 1 spark.executor.instances 13 spark.executor.memory 40G Or you can also give thrift parameters dynamically as mentioned in the IBM link I sent. You can cross-check your configuration in Environment Tab when you open your application in Spark History Server. Even I couldn't find a document explaining thrift-server in detail. Please confirm that you've done above and cross-check environment in Spark UI.

tsharma · ‎11-14-2017

Yes, this takes effect on cluster mode too and dictates the memory for Spark History Server and STS daemons. Are you using HDP? If yes you should be able to set it via Ambari, else set it directly in spark-env.sh. Please do try this.

tsharma · ‎11-14-2017

@anobi do Spark Thrift Server is just a gateway to submit applications to Spark, so standard Spark configurations are applicable directly. Please see below links. I found them very useful. https://developer.ibm.com/hadoop/2016/08/22/how-to-run-queries-on-spark-sql-using-jdbc-via-thrift-server/ https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/ https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ Main Properties -> https://spark.apache.org/docs/latest/configuration.html Also STS honors this configuration file -> /etc/spark2/conf/spark-thrift-sparkconf.conf. So set your spark.executor.memory, spark.driver.memory, spark.executor.cores, spark.executor.instances there. Thank You

tsharma · ‎11-14-2017

Hi, Is your thrift server crashing saying no JVM heap? This may be related to STS daemon itself instead of drivers and executors. Please try increasing daemon memory in spark-env.sh (This isn't the memory for driver/executor, it's for spark daemons- history server and STS). It is 1 GB by default. Increase this to 4-6. #Memory for Master, Worker and history server (default: 1024MB) export SPARK_DAEMON_MEMORY=6000m Thank You

tsharma · ‎11-14-2017

Hi Swaapnika, I've tried using Flume for that and had no issues. Investigate this repository for python https://github.com/edenhill/librdkafka. This is the most exhaustive one I guess.

tsharma · ‎11-10-2017

It is explained in detail here- https://cwiki.apache.org/confluence/display/Hive/Permission+Inheritance+in+Hive

tsharma · ‎11-10-2017

If your text always starts,ends with a ", then you can probably use below transformations: text.map(lambda x:(1,x)).reduceByKey(lambda x,y:' '.join([x,y])).map(lambda x:x[1][1:-2]).flatMap(lambda x:x.split('" "')).collect() where text represents an object that reads below lines "The csv file is about to be loaded into Phoenix" "another line to parse" like: ['"The csv','file is about','to be loaded into','Phoenix",'"another line','to parse"'] While loading lines are split on a \n. This reduces them once again to a single line and splits on " ", so you get a list with portions between successive ".

tsharma · ‎11-01-2017

What user are you starting spark-history-server.sh as? Do a su spark, before launching shell script. I think you're starting as root user, so it's saying root user doesn't have access to that folder. Since you've given spark ownership, it should be able to access. If you must start as root, then give root access to that directory.

tsharma · ‎10-25-2017

I assume you're on Spark 2? SparkSession, without explicitly creating SparkConf, SparkContext or SQLContext, encapsulates them within itself. Also SparkSession has merged SQLContext and HiveContext in one object in Spark 2.0. When building a session object, for example: val spark = SparkSession .builder() .appName( "SparkSessionZipsExample" ) .config( "spark.sql.warehouse.dir" , warehouseLocation) .enableHiveSupport() .getOrCreate() .enableHiveSupport() provides HiveContext functions. So you're able to use catalog functions since spark has provided connectivity to hive metastore on doing .enableHiveSupport() https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SparkSession.Builder.html#enableHiveSupport() You'll get more clarity by reading this https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html

tsharma · ‎10-13-2017

Also did you create kerberos database? If not, create it. krb5_newrealm Do check your /etc/krb5.conf again.

Online	Offline
Last Visited	‎01-08-2021 08:08 AM

Member Since	‎07-10-2017 03:41 AM
Last Visited	‎01-08-2021 08:08 AM
Posts	68
Kudos received	30

Cloudera Community

Re: how to check views in hive from hdfs?

Re: Extract timestamp from filename and add it in ...

Re: Pyspark dataframe: How to replace

Re: What is IPC client in Hive? What does it do?

Re: Window Operations on Spark Streaming

Re: Need Spark Thrift Server Design because STS ha...

Re: Need Spark Thrift Server Design because STS ha...

Re: Need Spark Thrift Server Design because STS ha...

Re: Need Spark Thrift Server Design because STS ha...

Re: Ways to get data from Kafka to HDFS

Re: hive.warehouse.subdir.inherit.perms=false

Re: Remove newlines in a quoted string with spark

Re: spark history server - Permission denied

Re: Spark: If I use SparkSession, Am I Using Hive ...

Re: Failed to create principal - hadoop@domain - c...