Member since
08-08-2017
17
Posts
0
Kudos Received
0
Solutions
11-07-2017
11:50 PM
That's helpful and it's all I missed. Thanks you so much, I'm marking your last answer as solution. Best regards.
... View more
11-07-2017
11:44 PM
Okay, that is first news for me. Then since I want to use Spark 2, it's the same for spark-submit? I just have to submit my application and having installed Spark2 instead of Spark? Or also this command changes for Spark2? Thanks you so much.
... View more
11-07-2017
11:29 PM
Hi Harsh, thanks you for your reply. The node where I'm executing pyspark doesn't have a Spark 1.6 Gateway role, should have it? It has Spark 2 Gateway role and JobHistoryServer, NodeManager and ResourceManager roles for YARN.
... View more
11-07-2017
01:52 AM
Hi all guys,
I had Spark 1.6 in my cluster working with YARN. I wanted to use Spark 2 in my cluster due to Data Frames and I followed the instructions in this link to install it https://www.cloudera.com/documentation/spark2/latest/topics/spark2_installing.html
Once I finally installed Spark 2, if I try to start pyspark from console it gives me the following stacktrace:
/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/bin$ pyspark
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:123)
at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:123)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkProperties(SparkSubmitArguments.scala:123)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:109)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:114)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/python/pyspark/shell.py", line 43, in <module>
sc = SparkContext(pyFiles=add_files)
File "/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/python/pyspark/context.py", line 112, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
File "/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/python/pyspark/context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/python/pyspark/java_gateway.py", line 94, in launch_gateway
raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number
>>>
Can anyone help me with this? Maybe I missed something in the install proccess?
Thanks you so much in advance.
... View more
Labels:
- Labels:
-
Apache Spark
-
Apache YARN
-
Cloudera Manager
09-18-2017
02:52 AM
I thought I had tried this before but seems like I didnt it in the right way. Now I tried once again and it worked. Thanks you so much and sorry for the stupid question.
... View more
09-11-2017
04:29 AM
Hello all, I'm trying to do a bulk load from a CSV file to a table on Impala. Table have the same fields as my CSV file and I'm using the following command to load it: LOAD DATA INPATH '/user/myuser/data/file.csv' INTO TABLE my_database.my_table; The path is HDFS path and my file uses \t as separator. When I execute the instruction, everything seems to be okay. After that I query for count(*) and I've exact the same number of rows that lines I had in my file, but when I do a SELECT, all rows and fields are NULL. I readed in Cloudera documentation that: "If a text file has fewer fields than the columns in the corresponding Impala table, all the corresponding columns are set to NULL when the data in that file is read by an Impala query." But since I have the same number of columns I don't know which is the problem in here. Anybody has idea or possibles solutions? Thanks you so muh in advance.
... View more
Labels:
- Labels:
-
Apache Impala
09-01-2017
02:26 AM
Finally I tested your solution and it worked for me! I'm going to mark your answer as solution. Thanks you so much 😄 Jose.
... View more
08-30-2017
01:47 AM
Thanks you too so much @csguna that looks interesting and may be a possible solution. When my current execution finish, I will try to do this. I will come back and comment my results!!
... View more
08-30-2017
12:33 AM
Thanks for your reply @vanhalen, do you know where I could find more information abouth this? If you know I could provide more information. By the moment what I tried (based on what I watched in another similar topic) is increasing Impala's heapsize and I'm trying to execute the ETL again. If this doesn't work I will try to increase Pentaho's memory setting, but I think the problem is in Impala rather than Pentaho. Thanks you so much once again.
... View more
08-29-2017
11:22 PM
Hello!! I'm doing a ETL Process using Pentaho DI and loading data into my Cloudera Impala cluster. The point is that when I do the load with a big amount of data (are lil bit more than 34K rows) I get this GC Error. Previously I tried the load with fake data (10K rows) and it worked fine without any problem. The error I'm getting in Pentaho DI is the following one: 2017/08/30 07:03:28 - Load to Impala Person.0 - at org.pentaho.di.trans.steps.tableoutput.TableOutput.writeToTable(TableOutput.java:385)
2017/08/30 07:03:28 - Load to Impala Person.0 - at org.pentaho.di.trans.steps.tableoutput.TableOutput.processRow(TableOutput.java:125)
2017/08/30 07:03:28 - Load to Impala Person.0 - at org.pentaho.di.trans.step.RunThread.run(RunThread.java:62)
2017/08/30 07:03:28 - Load to Impala Person.0 - at java.lang.Thread.run(Thread.java:748)
2017/08/30 07:03:28 - Load to Impala Person.0 - Caused by: org.pentaho.di.core.exception.KettleDatabaseException:
2017/08/30 07:03:28 - Load to Impala Person.0 - Error inserting/updating row
2017/08/30 07:03:28 - Load to Impala Person.0 - OutOfMemoryError: GC overhead limit exceeded While if I watch the Impalad logs, I have an impalad.ERROR and impalad.WARNING. The content of both are the following ones: impalad.WARNING W0829 18:06:21.070214 22994 DFSOutputStream.java:954] Caught exception
Java exception follows:
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
at java.lang.Thread.join(Thread.java:1326)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:952)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:690)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:879)
E0830 07:03:29.317476 19425 client-request-state.cc:940] ERROR Finalizing DML: OutOfMemoryError: GC overhead limit exceeded impalad.ERROR Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0829 13:49:00.567322 3084 logging.cc:124] stderr will be logged to this file.
E0830 07:03:29.317476 19425 client-request-state.cc:940] ERROR Finalizing DML: OutOfMemoryError: GC overhead limit exceeded So anybody have any idea on what's happening or have a similar error? Help would be so appreciated. Thanks you so much in advance. Jose.
... View more
Labels: