About schausson

schausson · ‎01-31-2018

Thanks for this article. Everything works fine, except that my thrift server fails to behave properly after hbase user kerberos ticket expiration (10h in my case). Is there a way to automatically refresh/renew ticket so that my thrift server runs endlessly ? Thanks

schausson · ‎11-06-2017

And also, should I keep using "--files" option with hbase-site.xml on the command line or not ?

schausson · ‎11-06-2017

Thanks for your help ! Just an additional question : you had to manually copy hbase-site.xml into $SPARK_HOME/conf folder on ALL nodes of the cluster ?

schausson · ‎10-23-2017

I am in the exact same configuration (no way to reach internet from our cluster...), did you find any other option to make it run ?

schausson · ‎10-19-2017

Hi, I'm trying to execute python code with SHC (spark hbase connector) to connect to hbase from a python spark-based script. Here is a simple example I can provide to illustrate : # readExample.py from pyspark import SparkContext from pyspark.sql import SQLContext sc = SparkContext() sqlc = SQLContext(sc) data_source_format = 'org.apache.spark.sql.execution.datasources.hbase' catalog = ''.join("""{ "table":{"namespace":"default", "name":"firsttable"}, "rowkey":"key", "columns":{ "firstcol":{"cf":"rowkey", "col":"key", "type":"string"}, "secondcol":{"cf":"d", "col":"colname", "type":"string"} } }""".split()) df = sqlc.read\ .options(catalog=catalog)\ .format(data_source_format)\ .load() df.select("secondcol").show() In order to execute this properly, I successfully executed following command line : spark-submit --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/conf/hbase-site.xml readExample.py Great 🙂 Now, I would like to run this exact same example from my jupyter notebook... After a while, I finally figured out how to proceed to pass the required "package" to spark by adding following cell at the begining of my notebook : import os import findspark os.environ["SPARK_HOME"] = '/usr/hdp/current/spark-client' findspark.init('/usr/hdp/current/spark-client') os.environ['PYSPARK_SUBMIT_ARGS'] = ("--repositories http://repo.hortonworks.com/content/groups/public/ " "--packages com.hortonworks:shc-core:1.1.1-1.6-s_2.10 " " pyspark-shell") ...But when I ran all the cells from my notebook, I got following exception : Py4JJavaError: An error occurred while calling o50.showString. : org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the locations at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:312) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:151) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200) at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320) at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295) at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160) at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155) at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:193) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:89) at org.apache.hadoop.hbase.client.MetaScanner.listTableRegionLocations(MetaScanner.java:343) at org.apache.hadoop.hbase.client.HRegionLocator.listRegionLocations(HRegionLocator.java:142) at org.apache.hadoop.hbase.client.HRegionLocator.getStartEndKeys(HRegionLocator.java:118) at org.apache.spark.sql.execution.datasources.hbase.RegionResource$$anonfun$1.apply(HBaseResources.scala:109) at org.apache.spark.sql.execution.datasources.hbase.RegionResource$$anonfun$1.apply(HBaseResources.scala:108) at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.releaseOnException(HBaseResources.scala:77) at org.apache.spark.sql.execution.datasources.hbase.RegionResource.releaseOnException(HBaseResources.scala:88) at org.apache.spark.sql.execution.datasources.hbase.RegionResource.<init>(HBaseResources.scala:108) at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD.getPartitions(HBaseTableScan.scala:61) From what I understood, this exception probably came up because the hbase client component could not use right hbase-site.xml (that defines zookeeper quorum...) I tried to add "--files /etc/hbase/conf/hbase-site.xml" in the content of the PYSPARK_SUBMIT_ARGS environment variable, but this did not change anything... Any idea how to pass the hbase-site.xml properly ? Thanks for your help

schausson · ‎09-15-2017

In fact I realized that I had to set this properties in the "Custom spark-default" section in Ambari. This way, they are written to spark-defaults.conf configuration file and things work fine

schausson · ‎09-14-2017

Hi, I'm trying to set a default value for "spark.driver.extraJavaOptions" configuration property from ambari, in order to avoid that all my users have to define it in the command line arguments. I tried to define this property in the "Custom spark-javaopts-properties" section in the ambari UI, but it didn't work (the property seems not to be used anywhere), and even worse, I am not able to find out where this property ends up ? I thought it should be written to a spark configuration file (spark-defaults.conf or anything else), but couldn't find the property anywhere ... Does anyone know if I picked up the right place to define the property and where it goes in configuration files ? Thanks for your help

schausson · ‎09-14-2017

I figured out what was wrong : In fact my class has to extends Configured and implements Tools in order to parse the confirguration properties from the command line. Works fine now ! I even figured out that I could set the property in ambari : label "MR Map Java Heap Size" actually maps the "mapreduce.map.java.opts" property, which is pretty confusing ...

schausson · ‎09-13-2017

Hi, I'm currently struggling with map reduce configuration... I'm trying to implement the common "wordcount example", but I modified the implementation so that mappers calls an HTTPS web service to track overall progression (just for the sake of demonstration). I have to provide the mappers' JVM with a custom truststore that containe the certificate of the CA that issued the web server's certificate and I tried to use following syntax : hadoop jar mycustommr.jar TestHttpsMR -Dmapreduce.map.java.opts="-Djavax.net.ssl.trustStore=/my/custom/path/cacerts -Djavax.net.ssl.trustStorePassword=mypassword" wordcount_in wordcount_out But I systematically hit following error : "Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory wordcount_in already exists" which indicates that arguments are not properly parsed : it seems that -Dmapreduce.map.java.opts="-Djavax.net.ssl.trustStore=/my/custom/path/cacerts -Djavax.net.ssl.trustStorePassword=mypassword" is interpreted as an application argument (the first one) instead of being passed to the mappers' JVM What's wrong with this syntax ? How could I override mapreduce.map.java.opts property without disturbing application parameters ? Thanks for your help

schausson · ‎09-12-2017

If I understand properly, this configuration is used by spark to secure data exhanges between the nodes, but my use case is slightly different : my executor runs custom java code that performs a call to an HTTPS server and in that context, the SSL handshake relies on the default truststore of the JVM instead of the one I configured with my own CA certificate...Maybe that's not possible and the only way to achieve this is to use the properties I mentionned previously... Thanks for your help

Online	Offline
Last Visited	‎02-14-2018 04:59 PM

Member Since	‎09-27-2016 08:14 AM
Last Visited	‎02-14-2018 04:59 PM
Posts	73
Kudos received	9

Cloudera Community

Re: Where do custom spark properties end ?

Re: how to properly override mapper's JVM options

Re: Start and test HBASE thrift server in a kerber...

Re: Read HBase with pyspark from jupyter notebook

Re: Read HBase with pyspark from jupyter notebook

Re: HDP Spark HBase Connector

Read HBase with pyspark from jupyter notebook

Re: Where do custom spark properties end ?

Where do custom spark properties end ?

Re: how to properly override mapper's JVM options

how to properly override mapper's JVM options

Re: Spark executor default ssl truststore