Member since
04-30-2017
17
Posts
0
Kudos Received
0
Solutions
03-25-2019
12:07 PM
Hello, I have a scenario . while processing delimited file, new line (\n) character present part of value in file. it splits single row to 2 rows. is there any way to handle it in spark. Note: Fields values are NOT quoted/enclosed.
... View more
Labels:
01-07-2019
10:06 AM
Thanks dbompart and sorry for late reply. i have tried different option (like number of cores, number of executors , executor memory, overhead memory) , But still same issue. when i try re-partition before doing action, it takes more time and shuffle read/write has gone till 50 GB (actual size 8.9gb). will keep trying... --- Thanks
... View more
12-28-2018
10:05 AM
Hi, I have all my scripts(hive,pig,spark shell etc...) stored in s3 bucket. Is there any way to trigger them using oozie workflow. (without copying to hdfs). --- Thanks, Mani
... View more
Labels:
12-27-2018
09:35 AM
Hi dbompart, Thank you for your help. the option didn't help.
when i ran the job with executor-memory=10g, job failed with same error and size has changed like (11.8 GB of 10 GB physical memory used.). spark-submit --master yarn --deploy-mode client --driver-memory 5g --executor-memory 10g --class myclass myjar.jar param1 param1 param3 param4 param5 So i tried with 15 gb of executor memory. spark-submit --master yarn --deploy-mode client --driver-memory 5g --executor-memory 15g myclass myjar.jar param1 param1 param3 param4 param5 But tasks are taking long time (to find count - it took 1.2 hrs whereas with below 10gb of executor memory, it took 11 mins).
Due to this task failed with below error. ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Slave lost
... View more
12-25-2018
02:16 AM
Hi, spark-submit --master yarn --deploy-mode client --driver-memory 5g --executor-memory 6g --conf "spark.yarn.executor.memoryOverhead=10g" --class myclass myjar.jar param1 param1 param3 param4 param5
... View more
12-24-2018
06:08 AM
Hi dbompart, Thanks for your suggestion. i have tried spark job with spark.yarn.executor.memoryOverhead =10g. But still the job fails with same issue. ExecutorLostFailure (executor 19 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.4 GB of 6.6 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead And i have tried the work around. Increased partition from 66 (which was before re-partitioning the DF) to 200 using repartition . Still it doesnt work and it takes more time than the previous one(66 partition) since it shuffles data. Could you please help here...
... View more
12-23-2018
04:52 PM
Hi, yes . Count is the first action. But the next step is union of both hivedf and filedf and finding count. That step aslo takes same time (25s +12 mins)12.25 mins. Could you please give some suggestions. Thanks
... View more
12-22-2018
10:03 AM
Hi, I am using spark session to read both file and hive data. Lets say my spark session variable is "spark" with hive support enabled . Reading hive and compressed files table. Val hivedf=spark.table("table name")
Val filedf = spark.read.format("CSV").option("header","true").option("delimiter","|").load(filepath)
hivedf.count()
filedf.count()
... View more
12-21-2018
06:43 AM
Hi All, I am getting count of 2 dataframe in spark 2.2 using spark session. 1st dataframe reads data from hive table which size is 5.2 GB. when i find count it returns in 25s. 2nd dataframe reads data from set of compressed files which size is 3.8 GB. When i find count it takes 12-13mins. Even i tried repartition on 2nd dataframe. it reduces performance since it does shuffling. Can anyone please explain the differences . am i missing anything here. --- thanks
... View more
Labels:
12-20-2018
11:16 AM
Hi, I am running my spark job on emr cluster with executor memory 6g, driver memory 5g and memoryoverhead 1g. But my task is failing with below error while writing into hdfs using spak session. i am storing file in orc format with snappy compression. Error: at
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)<br>Caused
by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 42
in stage 11.0 failed 4 times, most recent failure: Lost task 42.3 in stage 11.0
(TID 3170, "server_IP", executor 23):
ExecutorLostFailure (executor 23 exited caused by one of the running tasks)
Reason: Container killed by YARN for exceeding memory limits. 8.2 GB of 6.6 GB
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. Could you please give some suggestion. thanks,
... View more
Labels:
07-20-2018
03:28 PM
Hi all, I have 2 rdd or dataframe, each has 10 columns. Out 10 columns,2 are common. I want fetch those 2 common columns. But i dont know the common column names. Can someone please help on this. -- Thanks
... View more
Labels:
07-19-2018
05:54 PM
Hi albani, Thanks for sharing the links, i found these threads earlier. These static one(defined for 3 elements). I would expect more dynamic. Eg: Today i may receive 3 elements, tomorrow may be 10 elements. Code should handle it dynamically.
... View more
07-19-2018
05:26 PM
Hi all, Can someone please tell me how to split array into separate column in spark dataframe. Example: Df: A|B ------- 1|(a,b,c,d) 2|(e,f) Output: A|col1|col2|col3|col4 ------------ 1|a|b|c|d 2|e|F| | Thanks Manivel k
... View more
Labels:
07-18-2018
05:05 AM
Hi All, I am new to spark. so please correct me if im wrong... In above example "nums" is a Array variable and "rdd" is a RDD. Since spark does lazy evaluation, it just creates lineage. when a action performs, it reads from source. So we are making changes on Array variable not on RDD. when we try to update/change RDD, it throws error. scala> rdd=sc.parallelize(Array(1,2))
<console>:28: error: reassignment to val
rdd=sc.parallelize(Array(1,2))
^
... View more
07-18-2018
04:48 AM
Hi, Thank you. It helps a lot... --- Thanks Mani
... View more
07-14-2018
06:33 PM
I have more than 2k tables on rdbms. Is it possible to import all tables in single sqoop import-all-tables command? Also will there be any performance issue? Thanks Mani
... View more
Labels:
09-07-2017
05:11 PM
couldnt find practice exam AMI in AWS under Asia-pacific Mumbai region. when created instance under singapore region, the AMI has hadoop exam questions. There is no spark related question. is there any separate AMI for Spark. Please advice. Thanks
... View more
Labels: