Member since
10-21-2018
19
Posts
0
Kudos Received
0
Solutions
10-29-2020
11:30 PM
Hi, I am trying to write df (length of col names are very large ~100 chars) to hive table by using below statement. I am using PySpark. df.write.mode("overwrite").saveAsTable(TableName) I am able to write the data to hive table when I pass the config explicitly while submitting spark job as below. spark-submit --master yarn --deploy-mode client --conf "spark.executor.memoryOverhead=10g" --executor-memory 4g --num-executors 4 --driver-memory 10g script.py But, my requirement is to maintain all the configs inside python script like below or updating permanently in config fiels of installation directory. When I set configs inside script I am getting java heap space error. Please find below code snippet and error trace. spark = SparkSession\ .builder\ .master("yarn")\ .config("spark.submit.deployMode", "client")\ .config("spark.executor.instances", "4")\ .config("spark.executor.memory", "5g")\ .config("spark.driver.memory", "10g")\ .config("spark.executor.memoryOverhead", "10g")\ .appName("Application Name")\ .enableHiveSupport().getOrCreate() Traceback (most recent call last): ---------------------------------- File "/home/dev/pipeline/script.py", line 117, in <module> main() File "/home/dev/pipeline/script.py", line 57, in main flat_json_df.write.mode("overwrite").saveAsTable(TableName) File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 744, in saveAsTable File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o2390.saveAsTable. : java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3664) at java.lang.String.<init>(String.java:207) at java.lang.StringBuilder.toString(StringBuilder.java:407) at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:404) at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:83) at com.fasterxml.jackson.databind.ObjectWriter.writeValueAsString(ObjectWriter.java:999) at org.json4s.jackson.JsonMethods$class.pretty(JsonMethods.scala:38) at org.json4s.jackson.JsonMethods$.pretty(JsonMethods.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.prettyJson(TreeNode.scala:600) at com.microsoft.peregrine.spark.listeners.PlanLogListener.logPlans(PlanLogListener.java:68) at com.microsoft.peregrine.spark.listeners.PlanLogListener.onSuccess(PlanLogListener.java:57) at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1$$anonfun$apply$mcV$sp$1.apply(QueryExecutionListener.scala:124) at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1$$anonfun$apply$mcV$sp$1.apply(QueryExecutionListener.scala:123) at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling$1.apply(QueryExecutionListener.scala:145) at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling$1.apply(QueryExecutionListener.scala:143) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35) at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) at org.apache.spark.sql.util.ExecutionListenerManager.org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling(QueryExecutionListener.scala:143) at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply$mcV$sp(QueryExecutionListener.scala:123) at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply(QueryExecutionListener.scala:123) at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply(QueryExecutionListener.scala:123) at org.apache.spark.sql.util.ExecutionListenerManager.readLock(QueryExecutionListener.scala:156) at org.apache.spark.sql.util.ExecutionListenerManager.onSuccess(QueryExecutionListener.scala:122) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:658) at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:458) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:437) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:393) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) Just wondering if any one face this same issue, If so can you please help how to set configs inside the script or Do I need to update them in config files directly in installation directory.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
01-13-2019
11:59 PM
Tahnks Naresh for your response. there is one more scenario which I received today along with above requirements. If the individual name has only three words (I mean without title) then the fullname should be like below: current state: fullname jkl! mno pqr expected Last Title First Middle jkl mno pqr
... View more
01-11-2019
02:36 PM
I have a table with column name "fullname" which I would like to split into four columns (LAST_NAME, TITLE, FIRST_NAME, MIDDLE_NAME) while loading into another table. If it is a person's name, the convention is LAST_NAME TITLE! FIRST_NAME MIDDLE_NAME:
for example I have "abc xxx! def ghi" in my table. This should be split and loaded in to 4 different columns. the name should be split into LAST, FIRST, MID. Word Preceding the exclamation is TITLE (like xxx here). If it is an organisation name ("Names of organisations are ended by an exclamation mark) : I should move the entire string to FIRST_NAME.
for example: abc systems! should be loaded to FIRST_NAME. current state: fullname abc xxx! def ghi abc systems! expected result: Last Title First Middle abc xxx def ghi abc systems Can someone help how to write a query for the above requirement? Thanks in Advance!
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
11-05-2018
05:41 PM
@Harsh J is there a way to avoid giving the column names manually... beacuse I have 150 columns per table and more than 200 tables which is a huge number.
... View more
11-04-2018
07:54 PM
@Harsh thanks for providing solution. I am able to generate avro files, but for 10kb flat file I am getting two avro files(part1 and part2). but all my flat files are of more than 50mb, in this case I will get many number of .avro files which difficult to maintain. so is there a away to generate one avro file even if the flat file is large.
... View more
10-24-2018
08:26 PM
yes spark is configured to use HDFS as a default FS. could you please let me know where exaactly I should keep the input file. I tried many ways but did not suceed. I am new to Hadoop
... View more
10-24-2018
07:42 PM
Hi Harsha, can you please tell me the /tmp location (are you refering the tmp folder under root or different one) because I have given in same way but I am getting below error. org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
... View more
10-22-2018
08:24 PM
I am getting below error after trying... scala> df.write.format("com.databricks.spark.avro").save("C:/Users/madan/Downloads/Avro/out/") java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/datasources/FileFormat at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:411) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62) at scala.util.Try.orElse(Try.scala:82) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:219) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:36) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:38) at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:40) at $iwC$$iwC$$iwC$$iwC.<init>(<console>:42) at $iwC$$iwC$$iwC.<init>(<console>:44) at $iwC$$iwC.<init>(<console>:46) at $iwC.<init>(<console>:48) at <init>(<console>:50) at .<init>(<console>:54) at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>) at $print(<console>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.execution.datasources.FileFormat at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 68 more scala>
... View more
10-22-2018
07:48 PM
thanks for the reply. can you please provide the steps in detailed.
... View more
10-21-2018
07:48 PM
I have a flat file with column names separated by comma(,) and column values spearated by pipe(,) and comma(,). can someone help how can I convert this file to avro file/format. for example: "EMP-CO","EMP-ID" - column names |ABC|,|123456| - Values. thanks in advance.
... View more
Labels:
- Labels:
-
Apache Hive
-
HDFS