About josholsan

josholsan · ‎11-07-2017

That's helpful and it's all I missed. Thanks you so much, I'm marking your last answer as solution. Best regards.

josholsan · ‎11-07-2017

Okay, that is first news for me. Then since I want to use Spark 2, it's the same for spark-submit? I just have to submit my application and having installed Spark2 instead of Spark? Or also this command changes for Spark2? Thanks you so much.

josholsan · ‎11-07-2017

Hi Harsh, thanks you for your reply. The node where I'm executing pyspark doesn't have a Spark 1.6 Gateway role, should have it? It has Spark 2 Gateway role and JobHistoryServer, NodeManager and ResourceManager roles for YARN.

josholsan · ‎11-07-2017

Hi all guys, I had Spark 1.6 in my cluster working with YARN. I wanted to use Spark 2 in my cluster due to Data Frames and I followed the instructions in this link to install it https://www.cloudera.com/documentation/spark2/latest/topics/spark2_installing.html Once I finally installed Spark 2, if I try to start pyspark from console it gives me the following stacktrace: /opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/bin$ pyspark Python 2.7.6 (default, Oct 26 2016, 20:30:19) [GCC 4.8.4] on linux2 Type "help", "copyright", "credits" or "license" for more information. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:123) at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:123) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkProperties(SparkSubmitArguments.scala:123) at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:109) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:114) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 7 more Traceback (most recent call last): File "/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/python/pyspark/shell.py", line 43, in <module> sc = SparkContext(pyFiles=add_files) File "/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/python/pyspark/context.py", line 112, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File "/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/python/pyspark/context.py", line 245, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File "/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/python/pyspark/java_gateway.py", line 94, in launch_gateway raise Exception("Java gateway process exited before sending the driver its port number") Exception: Java gateway process exited before sending the driver its port number >>> Can anyone help me with this? Maybe I missed something in the install proccess? Thanks you so much in advance.

josholsan · ‎09-18-2017

I thought I had tried this before but seems like I didnt it in the right way. Now I tried once again and it worked. Thanks you so much and sorry for the stupid question.

josholsan · ‎09-11-2017

Hello all, I'm trying to do a bulk load from a CSV file to a table on Impala. Table have the same fields as my CSV file and I'm using the following command to load it: LOAD DATA INPATH '/user/myuser/data/file.csv' INTO TABLE my_database.my_table; The path is HDFS path and my file uses \t as separator. When I execute the instruction, everything seems to be okay. After that I query for count(*) and I've exact the same number of rows that lines I had in my file, but when I do a SELECT, all rows and fields are NULL. I readed in Cloudera documentation that: "If a text file has fewer fields than the columns in the corresponding Impala table, all the corresponding columns are set to NULL when the data in that file is read by an Impala query." But since I have the same number of columns I don't know which is the problem in here. Anybody has idea or possibles solutions? Thanks you so muh in advance.

josholsan · ‎09-01-2017

Finally I tested your solution and it worked for me! I'm going to mark your answer as solution. Thanks you so much 😄 Jose.

josholsan · ‎08-30-2017

Thanks you too so much @csguna that looks interesting and may be a possible solution. When my current execution finish, I will try to do this. I will come back and comment my results!!

josholsan · ‎08-30-2017

Thanks for your reply @vanhalen, do you know where I could find more information abouth this? If you know I could provide more information. By the moment what I tried (based on what I watched in another similar topic) is increasing Impala's heapsize and I'm trying to execute the ETL again. If this doesn't work I will try to increase Pentaho's memory setting, but I think the problem is in Impala rather than Pentaho. Thanks you so much once again.

josholsan · ‎08-29-2017

Hello!! I'm doing a ETL Process using Pentaho DI and loading data into my Cloudera Impala cluster. The point is that when I do the load with a big amount of data (are lil bit more than 34K rows) I get this GC Error. Previously I tried the load with fake data (10K rows) and it worked fine without any problem. The error I'm getting in Pentaho DI is the following one: 2017/08/30 07:03:28 - Load to Impala Person.0 - at org.pentaho.di.trans.steps.tableoutput.TableOutput.writeToTable(TableOutput.java:385) 2017/08/30 07:03:28 - Load to Impala Person.0 - at org.pentaho.di.trans.steps.tableoutput.TableOutput.processRow(TableOutput.java:125) 2017/08/30 07:03:28 - Load to Impala Person.0 - at org.pentaho.di.trans.step.RunThread.run(RunThread.java:62) 2017/08/30 07:03:28 - Load to Impala Person.0 - at java.lang.Thread.run(Thread.java:748) 2017/08/30 07:03:28 - Load to Impala Person.0 - Caused by: org.pentaho.di.core.exception.KettleDatabaseException: 2017/08/30 07:03:28 - Load to Impala Person.0 - Error inserting/updating row 2017/08/30 07:03:28 - Load to Impala Person.0 - OutOfMemoryError: GC overhead limit exceeded While if I watch the Impalad logs, I have an impalad.ERROR and impalad.WARNING. The content of both are the following ones: impalad.WARNING W0829 18:06:21.070214 22994 DFSOutputStream.java:954] Caught exception Java exception follows: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1252) at java.lang.Thread.join(Thread.java:1326) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:952) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:690) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:879) E0830 07:03:29.317476 19425 client-request-state.cc:940] ERROR Finalizing DML: OutOfMemoryError: GC overhead limit exceeded impalad.ERROR Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg E0829 13:49:00.567322 3084 logging.cc:124] stderr will be logged to this file. E0830 07:03:29.317476 19425 client-request-state.cc:940] ERROR Finalizing DML: OutOfMemoryError: GC overhead limit exceeded So anybody have any idea on what's happening or have a similar error? Help would be so appreciated. Thanks you so much in advance. Jose.

Online	Offline
Last Visited	‎11-08-2017 03:10 AM

Member Since	‎08-08-2017 03:12 AM
Last Visited	‎11-08-2017 03:10 AM
Posts	17

Cloudera Community

Re: Spark 2 not working after upgrade. PySpark err...

Re: Spark 2 not working after upgrade. PySpark err...

Re: Spark 2 not working after upgrade. PySpark err...

Spark 2 not working after upgrade. PySpark error

Re: Loading CSV to Impala fills table with Null va...

Loading CSV to Impala fills table with Null values

Re: [Impala] - GC overhead limit exceeded error in...

Re: [Impala] - GC overhead limit exceeded error in...

Re: [Impala] - GC overhead limit exceeded error in...

[Impala] - GC overhead limit exceeded error in Imp...