Member since
10-21-2018
19
Posts
0
Kudos Received
0
Solutions
10-29-2020
11:30 PM
Hi, I am trying to write df (length of col names are very large ~100 chars) to hive table by using below statement. I am using PySpark. df.write.mode("overwrite").saveAsTable(TableName) I am able to write the data to hive table when I pass the config explicitly while submitting spark job as below. spark-submit --master yarn --deploy-mode client --conf "spark.executor.memoryOverhead=10g" --executor-memory 4g --num-executors 4 --driver-memory 10g script.py But, my requirement is to maintain all the configs inside python script like below or updating permanently in config fiels of installation directory. When I set configs inside script I am getting java heap space error. Please find below code snippet and error trace. spark = SparkSession\ .builder\ .master("yarn")\ .config("spark.submit.deployMode", "client")\ .config("spark.executor.instances", "4")\ .config("spark.executor.memory", "5g")\ .config("spark.driver.memory", "10g")\ .config("spark.executor.memoryOverhead", "10g")\ .appName("Application Name")\ .enableHiveSupport().getOrCreate() Traceback (most recent call last): ---------------------------------- File "/home/dev/pipeline/script.py", line 117, in <module> main() File "/home/dev/pipeline/script.py", line 57, in main flat_json_df.write.mode("overwrite").saveAsTable(TableName) File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 744, in saveAsTable File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o2390.saveAsTable. : java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3664) at java.lang.String.<init>(String.java:207) at java.lang.StringBuilder.toString(StringBuilder.java:407) at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:404) at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:83) at com.fasterxml.jackson.databind.ObjectWriter.writeValueAsString(ObjectWriter.java:999) at org.json4s.jackson.JsonMethods$class.pretty(JsonMethods.scala:38) at org.json4s.jackson.JsonMethods$.pretty(JsonMethods.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.prettyJson(TreeNode.scala:600) at com.microsoft.peregrine.spark.listeners.PlanLogListener.logPlans(PlanLogListener.java:68) at com.microsoft.peregrine.spark.listeners.PlanLogListener.onSuccess(PlanLogListener.java:57) at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1$$anonfun$apply$mcV$sp$1.apply(QueryExecutionListener.scala:124) at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1$$anonfun$apply$mcV$sp$1.apply(QueryExecutionListener.scala:123) at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling$1.apply(QueryExecutionListener.scala:145) at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling$1.apply(QueryExecutionListener.scala:143) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35) at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) at org.apache.spark.sql.util.ExecutionListenerManager.org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling(QueryExecutionListener.scala:143) at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply$mcV$sp(QueryExecutionListener.scala:123) at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply(QueryExecutionListener.scala:123) at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply(QueryExecutionListener.scala:123) at org.apache.spark.sql.util.ExecutionListenerManager.readLock(QueryExecutionListener.scala:156) at org.apache.spark.sql.util.ExecutionListenerManager.onSuccess(QueryExecutionListener.scala:122) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:658) at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:458) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:437) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:393) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) Just wondering if any one face this same issue, If so can you please help how to set configs inside the script or Do I need to update them in config files directly in installation directory.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
01-13-2019
11:59 PM
Tahnks Naresh for your response. there is one more scenario which I received today along with above requirements. If the individual name has only three words (I mean without title) then the fullname should be like below: current state: fullname jkl! mno pqr expected Last Title First Middle jkl mno pqr
... View more
01-11-2019
02:36 PM
I have a table with column name "fullname" which I would like to split into four columns (LAST_NAME, TITLE, FIRST_NAME, MIDDLE_NAME) while loading into another table. If it is a person's name, the convention is LAST_NAME TITLE! FIRST_NAME MIDDLE_NAME:
for example I have "abc xxx! def ghi" in my table. This should be split and loaded in to 4 different columns. the name should be split into LAST, FIRST, MID. Word Preceding the exclamation is TITLE (like xxx here). If it is an organisation name ("Names of organisations are ended by an exclamation mark) : I should move the entire string to FIRST_NAME.
for example: abc systems! should be loaded to FIRST_NAME. current state: fullname abc xxx! def ghi abc systems! expected result: Last Title First Middle abc xxx def ghi abc systems Can someone help how to write a query for the above requirement? Thanks in Advance!
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
11-29-2018
02:19 PM
I have a file with below data with 4 columns, out of it 1st column (TEXT1) is only column have data separated by comma (,) and remaining columns has no values/data. "TEXT1","TEXT2","FILLER1","FILLER2" |abcdef,ghijkl,1234,5678|,||,||,|| Now, when I tried to upload the table using ambari hive view, I observed as below in preview table. data of column 1 is spitted and is moved to other three columns because I am using field delimiter as comma (,) TEXT1 TEXT2 FILLER1 FILLER2 abcdef ghijkl 1234 5678 instead I would like to upload the table like below: TEXT1 TEXT2 FILLER1 FILLER2 abcdef,ghijkl,1234,5678 can some one suggest me how can I achieve this. attached the screen shot of the settings (field delimiter, escape character, quote character, Is first row header)
... View more
Labels:
- Labels:
-
Apache Hive
11-28-2018
08:50 PM
I have a file with below data with 4 columns, out of it 1st column (TEXT1) is only column have data separated by comma (,) and remaining columns has no values/data. "TEXT1","TEXT2","FILLER1","FILLER2" |abcdef,ghijkl,1234,5678|,||,||,|| Now, when I tried to upload the table using ambari hive view, I observed as below in preview table. data of column 1 is spitted and is moved to other three columns because I am using field delimiter as comma (,) TEXT1 TEXT2 FILLER1 FILLER2 abcdef ghijkl 1234 5678 instead I would like to upload the table like below: TEXT1 TEXT2 FILLER1 FILLER2 abcdef,ghijkl,1234,5678 can some one suggest me how can I achieve this. attached the screen shot of the settings (field delimiter, escape character, quote character, Is first row header)
... View more
Labels:
- Labels:
-
Apache Hive
-
HDFS
11-05-2018
05:41 PM
@Harsh J is there a way to avoid giving the column names manually... beacuse I have 150 columns per table and more than 200 tables which is a huge number.
... View more
11-04-2018
07:54 PM
@Harsh thanks for providing solution. I am able to generate avro files, but for 10kb flat file I am getting two avro files(part1 and part2). but all my flat files are of more than 50mb, in this case I will get many number of .avro files which difficult to maintain. so is there a away to generate one avro file even if the flat file is large.
... View more
11-04-2018
06:46 PM
I tried using below command but getting error hive> LOAD DATA LOCAL INPATH 'hdfs location of .avro' OVERWRITE INTO TABLE hive_from_avro_schema; FAILED: SemanticException [Error 10028]: Line 1:23 Path is not legal '' hdfs location of .avro '': Source file system should be "file" if "local" is specified hive>
... View more
11-04-2018
06:31 PM
I have created hive external table by using .avsc file. now I would like to load the data from avro file to the table. can some one suggest how to load the data. CREATE EXTERNAL TABLE data_hive ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 'hdfs location to save the table' TBLPROPERTIES ('avro.schema.url'='<avro schema location>');
... View more
Labels:
- Labels:
-
Apache Hive
11-01-2018
09:50 PM
I got the solution...we need to provide fsUri after hdfs to acces hdfs location
... View more
10-31-2018
03:56 PM
I am trying to stream my data from a file to hdfs by using spring XD. stream create --name fieltest1 --definition "file --dir=/home/hdfs/spring-xd-1.3.2.RELEASE/tmp/xd/input/filetest --outputType=text/plain | hdfs" --deploy but I am getting below error. can some one suggest how can I achieve this? org.springframework.messaging.MessageHandlingException: failed to write Message payload to HDFS; nested exception is java.io.IOException: failure to login at org.springframework.xd.integration.hadoop.outbound.HdfsStoreMessageHandler.handleMessageInternal(HdfsStoreMessageHandler.java:129)
... View more
Labels:
- Labels:
-
HDFS
10-24-2018
08:26 PM
yes spark is configured to use HDFS as a default FS. could you please let me know where exaactly I should keep the input file. I tried many ways but did not suceed. I am new to Hadoop
... View more
10-24-2018
07:42 PM
Hi Harsha, can you please tell me the /tmp location (are you refering the tmp folder under root or different one) because I have given in same way but I am getting below error. org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
... View more
10-22-2018
08:24 PM
I am getting below error after trying... scala> df.write.format("com.databricks.spark.avro").save("C:/Users/madan/Downloads/Avro/out/") java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/datasources/FileFormat at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:411) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62) at scala.util.Try.orElse(Try.scala:82) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:219) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:36) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:38) at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:40) at $iwC$$iwC$$iwC$$iwC.<init>(<console>:42) at $iwC$$iwC$$iwC.<init>(<console>:44) at $iwC$$iwC.<init>(<console>:46) at $iwC.<init>(<console>:48) at <init>(<console>:50) at .<init>(<console>:54) at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>) at $print(<console>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.execution.datasources.FileFormat at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 68 more scala>
... View more
10-22-2018
07:48 PM
thanks for the reply. can you please provide the steps in detailed.
... View more
10-22-2018
06:56 PM
I have uploaded a flatfile as a table using hive(ambari) and selected "stored as" field avro. will this generate avsc or avro format files by doing this. If so what is the location.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
-
HDFS
10-21-2018
07:48 PM
I have a flat file with column names separated by comma(,) and column values spearated by pipe(,) and comma(,). can someone help how can I convert this file to avro file/format. for example: "EMP-CO","EMP-ID" - column names |ABC|,|123456| - Values. thanks in advance.
... View more
Labels:
- Labels:
-
Apache Hive
-
HDFS