Support Questions

IgorYakushin · ‎02-22-2017

If I am using self-signed certificates, where do I put the created my.truststore? Into $JAVA_HOME/jre/lib/security/my.truststore or $JAVA_HOME/jre/lib/security/jssecacerts?

It is not clear to me how it is going to be used and where would I need to specify it when configuring Levels 1-3 of TLS. It is also not clear to me if both keystore and truststore are needed or keystores are enough?

Any good introduction on the concepts of certificates, truststore, keystore?

IgorYakushin · ‎03-02-2017

I have tested various Hadoop functionality running from a command line. Mostly things are working unless you try to run pig or pyspark on a local machine:

1) hdfs commands seem to be working

2) mapreduce is working

3) pig is working when submitting a job to the cluster but runs out of memory when submitting a tiny job to a local machine with

pig -x local

2017-03-02 13:48:36,124 [Thread-20] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local1306469224_0004
java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
       at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489)
       at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:549)
Caused by: java.lang.OutOfMemoryError: Java heap space
       at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:987)
       at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:402)
       at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:81)
       at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:698)
       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
       at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       at java.lang.Thread.run(Thread.java:745)
2017-03-02 13:48:42,128 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to
stop immediately on failure.
2017-03-02 13:48:42,128 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_local1306469224_0004 has failed! Stop running all dependent jobs

4) I can submit a spark job to a cluster with

spark-submit --master=yarn --num-executors=3 lettercount.py
but pyspark without arguments crashes:

[ivy2@md01 lab7_Spark]$ pyspark
Python 2.7.5 (default, Nov 20 2015, 02:00:19)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Error: Cluster deploy mode is not applicable to Spark shells.
Run with --help for usage help or --verbose for debug output
Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/spark/python/pyspark/shell.py", line 43, in <module>
   sc = SparkContext(pyFiles=add_files)
File "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/spark/python/pyspark/context.py", line 112, in __init__
   SparkContext._ensure_initialized(self, gateway=gateway)
File "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/spark/python/pyspark/context.py", line 245, in _ensure_initialized
   SparkContext._gateway = gateway or launch_gateway()
File "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/spark/python/pyspark/java_gateway.py", line 94, in launch_gateway
   raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number

On the other hand, it works with --master=yarn:

[ivy2@md01 lab7_Spark]$ pyspark --master=yarn
Python 2.7.5 (default, Nov 20 2015, 02:00:19)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
Welcome to
     ____              __
    / __/__ ___ _____/ /__
   _\ \/ _ \/ _ `/ __/ '_/
  /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
     /_/

Using Python version 2.7.5 (default, Nov 20 2015 02:00:19)
SparkContext available as sc, HiveContext available as sqlContext.

5) hive works

6) HBase works

Whay pyspark and pig misbehave on a local node but are OK when running on the cluster?

IgorYakushin · ‎03-02-2017

I think, I misinterpreted it: pyspark or spark-submit crash on the cluster but not locally.

As far as I understand --master=yarn --deploy-mode=client is running locally and --master=yarn --deploy-mode=cluster is running on the cluster and pyspark by default is probably trying to run on a cluster.

IgorYakushin · ‎03-02-2017

Also, pig actually seem to work both with -x local and -x mapreduce. I think I just mess up the directories the first time. But spark is definitely a problem.

IgorYakushin · ‎03-02-2017

As a back up solution, how do I disable TLS? use_tls=0 in
/etc/cloudera-scm-agent/config.ini plus undo all TLS/SSL enables on two web
pages, then restart server, agents, cloudera management services?
I need to have cluster in production within a few days.

IgorYakushin · ‎03-02-2017

I have undone all the TLS enabling and still had the same problem.

Eventually it occurred to me that some processes might be stuck.

So I physically rebooted all the Hadoop machines and that resolved the problem.

After that I was able to reenable all the steps for TLS.

However, I still have a problem with pyspark and spark-submit: they run with

--master=yarn --deploy-mode=client

but fail with

--master=yarn --deploy-mode=cluster

==============

[ivy2@md01 ~]$ pyspark --master=yarn --deploy-mode=cluster
Python 2.7.5 (default, Nov 20 2015, 02:00:19)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Error: Cluster deploy mode is not applicable to Spark shells.
Run with --help for usage help or --verbose for debug output
Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/spark/python/pyspark/shell.py", line 43, in <module>
   sc = SparkContext(pyFiles=add_files)
File "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/spark/python/pyspark/context.py", line 112, in __init__
   SparkContext._ensure_initialized(self, gateway=gateway)
File "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/spark/python/pyspark/context.py", line 245, in _ensure_initialized
   SparkContext._gateway = gateway or launch_gateway()
File "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/spark/python/pyspark/java_gateway.py", line 94, in launch_gateway
   raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number
>>>
==========

Could TLS intefere with spark?

IgorYakushin · ‎03-02-2017

It looks like Yarn also needs to be told about TLS? Would it work without it if TLS is fully enabled?

IgorYakushin · ‎03-02-2017

And Oozie?

IgorYakushin · ‎03-02-2017

And a lot of other components have TLS in Security section ...

Are those mandatory or only needed for Kerberos?

Cloudera Community

Support Questions

Where to put my.truststore?