Created on 08-29-2017 11:22 PM - edited 09-16-2022 05:10 AM
Hello!!
I'm doing a ETL Process using Pentaho DI and loading data into my Cloudera Impala cluster. The point is that when I do the load with a big amount of data (are lil bit more than 34K rows) I get this GC Error. Previously I tried the load with fake data (10K rows) and it worked fine without any problem. The error I'm getting in Pentaho DI is the following one:
2017/08/30 07:03:28 - Load to Impala Person.0 - at org.pentaho.di.trans.steps.tableoutput.TableOutput.writeToTable(TableOutput.java:385) 2017/08/30 07:03:28 - Load to Impala Person.0 - at org.pentaho.di.trans.steps.tableoutput.TableOutput.processRow(TableOutput.java:125) 2017/08/30 07:03:28 - Load to Impala Person.0 - at org.pentaho.di.trans.step.RunThread.run(RunThread.java:62) 2017/08/30 07:03:28 - Load to Impala Person.0 - at java.lang.Thread.run(Thread.java:748) 2017/08/30 07:03:28 - Load to Impala Person.0 - Caused by: org.pentaho.di.core.exception.KettleDatabaseException: 2017/08/30 07:03:28 - Load to Impala Person.0 - Error inserting/updating row 2017/08/30 07:03:28 - Load to Impala Person.0 - OutOfMemoryError: GC overhead limit exceeded
While if I watch the Impalad logs, I have an impalad.ERROR and impalad.WARNING. The content of both are the following ones:
impalad.WARNING
W0829 18:06:21.070214 22994 DFSOutputStream.java:954] Caught exception Java exception follows: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1252) at java.lang.Thread.join(Thread.java:1326) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:952) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:690) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:879) E0830 07:03:29.317476 19425 client-request-state.cc:940] ERROR Finalizing DML: OutOfMemoryError: GC overhead limit exceeded
impalad.ERROR
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg E0829 13:49:00.567322 3084 logging.cc:124] stderr will be logged to this file. E0830 07:03:29.317476 19425 client-request-state.cc:940] ERROR Finalizing DML: OutOfMemoryError: GC overhead limit exceeded
So anybody have any idea on what's happening or have a similar error? Help would be so appreciated.
Thanks you so much in advance.
Jose.
Created 08-30-2017 01:39 AM
I aint expert in this area but i know that you can change them in the startup script .
could you refer the link .
PENTAHO_DI_JAVA_OPTIONS
Created 08-30-2017 12:28 AM
hard to tell based on the information you provided but see if you can increase Pentaho's memory settings (edit spoon.bat).
If that doesn't work, check Impala's catalog'd memory setting.
Hope this helps.
Created 08-30-2017 12:33 AM
Thanks for your reply @vanhalen, do you know where I could find more information abouth this? If you know I could provide more information.
By the moment what I tried (based on what I watched in another similar topic) is increasing Impala's heapsize and I'm trying to execute the ETL again. If this doesn't work I will try to increase Pentaho's memory setting, but I think the problem is in Impala rather than Pentaho.
Thanks you so much once again.
Created 08-30-2017 01:39 AM
I aint expert in this area but i know that you can change them in the startup script .
could you refer the link .
PENTAHO_DI_JAVA_OPTIONS
Created 08-30-2017 01:47 AM
Thanks you too so much @csguna that looks interesting and may be a possible solution. When my current execution finish, I will try to do this.
I will come back and comment my results!!
Created 08-30-2017 01:54 AM
@josholsan Sure thing 🙂
Created 09-01-2017 02:26 AM
Finally I tested your solution and it worked for me!
I'm going to mark your answer as solution.
Thanks you so much 😄
Jose.