Member since
03-26-2017
61
Posts
1
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1954 | 08-27-2018 03:19 PM | |
17550 | 08-27-2018 03:18 PM | |
6652 | 04-02-2018 01:54 PM |
05-20-2019
01:00 PM
Hi All, I'm trying to read a config file in spark read.textfile which basically contains my tables list. my task is to iterate through the table list and convert Avro to ORC format. please find my below code snippet which will do the logic. val tableList = spark.read.textFile('tables.txt')
tableList.collect().foreach(tblName => {
val df = spark.read.format("avro").load(inputPath+ "/" + tblName)
df.write.format("orc").mode("overwrite").save(outputPath+"/"+tblName)}
) Please find my configurations below DriverMemory: 4GB ExecutorMemory: 10GB NoOfExecutors: 5 Input DataSize: 45GB My question here is this will execute in Executor or Driver ? this will throw Out of Memory Error ? please comment your suggestions. Regards, MJ
... View more
Labels:
- Labels:
-
Apache Spark
11-27-2018
08:10 AM
Hi @Felix Albani, Tanx for your response. I reffered this site i'm thinking my issues is related to the story. Cheers, MJ
... View more
11-26-2018
10:20 AM
Hi All, I'm trying to Write 430000 records into Avro file, im using the following Avro Writer in my dependency <dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
<version>1.9.0</version>
</dependency> the file writing successfully completed, but when i try to read the data from Spark using any avro supporting library like Databrics avro, SparkAvro and Apache Avro im getting below Error: (But one important thing is still 200000 i can read my data without the error) 18/11/26 15:48:23 ERROR TaskSetManager: Task 3 in stage 0.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 1 times, most recent failure: Lost task 3.0 in stage 0.0 (TID 3, localhost, executor driver): org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
at org.apache.spark.sql.avro.AvroFileFormat$anonfun$buildReader$1$anon$1.hasNext(AvroFileFormat.scala:202)
at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.FileScanRDD$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$anonfun$11$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Invalid sync!
at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:297)
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:198)
... 18 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$failJobAndIndependentStages(DAGScheduler.scala:1887)
at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
at org.apache.spark.util.EventLoop$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
at org.apache.spark.sql.Dataset$anonfun$count$1.apply(Dataset.scala:2831)
at org.apache.spark.sql.Dataset$anonfun$count$1.apply(Dataset.scala:2830)
at org.apache.spark.sql.Dataset$anonfun$53.apply(Dataset.scala:3365)
at org.apache.spark.sql.execution.SQLExecution$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2830)
... 49 elided
Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
at org.apache.spark.sql.avro.AvroFileFormat$anonfun$buildReader$1$anon$1.hasNext(AvroFileFormat.scala:202)
at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.FileScanRDD$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$anonfun$11$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Invalid sync!
at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:297)
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:198)
... 18 more Even from Hive i can run select * from statement but i cannot perform count(*) operation Help me out on this issue Cheers MJ (+91) - 9688 514 443
... View more
Labels:
- Labels:
-
Apache Spark
09-17-2018
05:49 AM
Hi All,
we know there are formulas available to deteremine Spark job "Executor memory" and "number of Executor" and "executor cores" based on your cluster available Resources, is there any formula available to calculate the same alone with Data size.
case 1: what is the configuration if: data size < 5 GB case 2: what is the configuration if: 5 GB > data size < 10 GB case 3: what is the configuration if: 10 GB > data size < 15 GB case 4: what is the configuration if: 15 GB > data size < 25 GB case 5: what is the configuration if: data size < 25 GB Cheers, MJ
... View more
Labels:
- Labels:
-
Apache Spark
09-17-2018
05:46 AM
Hi @vgarg, Tanx for your reply, i dont think so case statement would helpful for me, coz i need to make this change at globally across all my systems. Cheers, MJ
... View more
09-04-2018
06:25 AM
Hi All, I have Hive table with ORC format, an i have some records stored as Null in the table, is there any configuration in hive to Set Null as Empty string while Querying the data ? Cheers, MJ
... View more
Labels:
- Labels:
-
Apache Hive
08-27-2018
03:19 PM
Hi Issue got resolved, i'm trying to perform Group by operation inside a Columns literal, group by itself will produce a new columns instead writing a query like i asked above we have to change our query accordingly as follow. inputDf = inputDf.groupBy(col("tableName"),col("runDate")) .agg(sum(when(col("MinBusinessDate")< col("runDate")&& col("MaxBusinessDate")> col("runDate"), when(col("business_date")> col("runDate"), col("rowCount")))).alias("NewColumnName"))
... View more
08-27-2018
03:18 PM
Hi Issue got resolved, i'm trying to perform Group by operation inside a Columns literal, group by itself will produce a new columns instead writing a query like i asked above we have to change our query accordingly as follow. inputDf = inputDf.groupBy(col("tableName"),col("runDate")) .agg(sum(when(col("MinBusinessDate") < col("runDate") && col("MaxBusinessDate") > col("runDate"), when(col("business_date") > col("runDate"), col("rowCount")))).alias("NewColumnName"))
... View more
08-26-2018
02:55 PM
Hi All, I'm trying to add a column to a dataframe based on multiple check condition, one of the operation that we are doing is we need to take sum of rows, but im getting Below error: Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset [StorageDayCountBeore: double]
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
at org.apache.spark.sql.catalyst.expressions.Literal$anonfun$create$2.apply(literals.scala:163)
at org.apache.spark.sql.catalyst.expressions.Literal$anonfun$create$2.apply(literals.scala:163)
at scala.util.Try.getOrElse(Try.scala:79)
at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:162)
at org.apache.spark.sql.functions$.typedLit(functions.scala:112)
at org.apache.spark.sql.functions$.lit(functions.scala:95)
at MYDev.ReconTest$.main(ReconTest.scala:35)
at MYDev.ReconTest.main(ReconTest.scala)
and the Query im using is: var df = inputDf df = df.persist() inputDf = inputDf.withColumn("newColumn", when(df("MinBusinessDate") < "2018-08-8" && df("MaxBusinessDate") > "2018-08-08", lit(df.groupBy(df("tableName"),df("runDate")) .agg(sum(when(df("business_date") > "2018-08-08", df("rowCount"))) .alias("finalSRCcount")) .drop("tableName","runDate")))) Cheers, MJ
... View more
Labels:
- Labels:
-
Apache Spark
08-26-2018
02:53 PM
Hi All, Im trying to add a column to a dataframe based on multiple check condition, one of the operation that we are doing is we need to take sum of rows, but im getting Below error: Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset [Column_12: double]
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
at org.apache.spark.sql.catalyst.expressions.Literal$anonfun$create$2.apply(literals.scala:163)
at org.apache.spark.sql.catalyst.expressions.Literal$anonfun$create$2.apply(literals.scala:163)
at scala.util.Try.getOrElse(Try.scala:79)
at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:162)
at org.apache.spark.sql.functions$.typedLit(functions.scala:112)
at org.apache.spark.sql.functions$.lit(functions.scala:95)
at MYDev.ReconTest$.main(ReconTest.scala:35)
at MYDev.ReconTest.main(ReconTest.scala) and the Query im using is: var df = inputDf df = df.persist() inputDf = inputDf.withColumn("newColumn", when(df("MinBusinessDate") < "2018-08-8" && df("MaxBusinessDate") > "2018-08-08", lit(df.groupBy(df("tableName"),df("runDate")) .agg(sum(when(df("business_date") > "2018-08-08", df("rowCount"))) .alias("finalSRCcount")) .drop("tableName","runDate")))) Cheers, MJ
... View more
Labels:
- Labels:
-
Apache Spark
08-25-2018
06:40 AM
Hi @Felix Albani yes your suggestion works fine, i think i have to extend my object as app to make it work. Tanx Bro.
... View more
08-24-2018
05:51 PM
Hi @Felix Albani, Still the same issue Exists, please find my attached build.sbt and sample code attached. import sbt._
import sbt.Keys._
name := "BackupSnippets"
version := "1.0"
scalaVersion := "2.11.8"
val sparkVersion = "2.2.1"
val hadoopVersion = "2.7.1"
val poiVersion = "3.9"
val avroVersion = "1.7.6"
val hortonworksVersion = "2.2.0.2.6.3.79-2"
conflictManager := ConflictManager.latestRevision
/*libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-hive" % sparkVersion,
"org.apache.poi" % "poi-ooxml" % poiVersion,
"org.apache.poi" % "poi" % poiVersion,
"org.apache.avro" % "avro" % avroVersion
)*/
/*resolvers ++= Seq(
"Typesafe repository" at "https://repo.typesafe.com/typesafe/releases/",
"Typesafe Ivyrepository" at "https://repo.typesafe.com/typesafe/ivy-releases/",
"Maven Central" at "https://repo1.maven.org/maven2/",
"Sonatype snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/",
Resolver.sonatypeRepo("releases")
) */
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % hortonworksVersion,
"org.apache.spark" %% "spark-sql" % hortonworksVersion,
"org.apache.spark" %% "spark-hive" % hortonworksVersion)
resolvers ++= Seq("Hortonworks Releases" at "http://repo.hortonworks.com/content/repositories/releases/",
"Jetty Releases" at "http://repo.hortonworks.com/content/repositories/jetty-hadoop/") ************************************************************************************ package BigData101.ORC
import ScalaUtils.SchemaUtils
import org.apache.spark.sql.{Encoders, Row, SparkSession}
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
object ORCTesting {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("ORC Testing")
.enableHiveSupport()
.getOrCreate()
case class Airlines(Airline_id: Integer, Name: String, Alias: String, IATA: String, ICAO: String, Callsign: String,
Country: String, Active: String)
//val AirlineSchema = ScalaReflection.schemaFor[Airlines].dataType.asInstanceOf[StructType]
Encoders.product[Airlines].schema
sys.exit(1)
}
}
... View more
08-24-2018
05:41 PM
Hi @Felix Albani just for an clarification this will work only on Hortonworks dependencies ? please find my build.sbt dependencies and let me know whether i needs to add anything val sparkVersion = "2.2.1"val hadoopVersion = "2.7.1"val poiVersion = "3.9"val avroVersion = "1.7.6" libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-hive" % sparkVersion,
"org.apache.poi" % "poi-ooxml" % poiVersion,
"org.apache.poi" % "poi" % poiVersion,
"org.apache.avro" % "avro" % avroVersion
)
... View more
08-24-2018
01:32 PM
Hi @Felix Albani, Tanx for your response, actully I've tried all the possible options, please find the attached image for reference. is there any other way i can solve my issue ? Cheers, MJ
... View more
08-24-2018
08:28 AM
Hi @Dongjoon Hyun How to add this dependency in build.sbt ? coz im using Spark 2.2.1 which is showing following Errorjava.lang.ClassNotFoundException: org.apache.spark.sql.execution.datasources.orc.DefaultSource.
... View more
08-24-2018
01:13 AM
Hi Aditya, Tanx for your reply, is there any other posts or blog can i find how to implement the schema evolution in ORC ?
... View more
08-24-2018
01:09 AM
Hi All, im trying to convert case class to StryctType schema in spark, im getting error attached in image, please find my case class and conversion technique case class Airlines(Airline_id: Integer, Name: String, Alias: String, IATA: String, ICAO: String, Callsign: String, Country: String, Active: String) val AirlineSchema = ScalaReflection.schemaFor[Airlines].dataType.asInstanceOf[StructType] Reference URL: https://stackoverflow.com/questions/36746055/generate-a-spark-structtype-schema-from-a-case-class Cheers, MJ
... View more
Labels:
- Labels:
-
Apache Spark
-
Schema Registry
08-23-2018
07:47 AM
Hi all, as we all know we can control schema evolution in Avro format for both "forward and backward" schema-compatibility. is there any option can we perfrom the same in ORC file format too ? let me know the possibilities to explore more on this. Cheers, MJ
... View more
Labels:
- Labels:
-
Schema Registry
07-09-2018
01:23 PM
Hi All, I'm getting the following Error in between a sapcrk job is running. actually I'm trying to migrate data from Hadoop cluster-A and Hadoop cluster-B. Im migrating Around 10000 Partitions per Job, the problem here is my job is throwing Error when it crossed half way. please find the following Error and let me know how to Resolve this. 18/07/09 20:49:54 WARN TaskSetManager: Lost task 1.0 in stage 882.0 (TID 1902, DNSName, executor 1): java.net.UnknownHostException: HostName:50075: HostName
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1168)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1104)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:998)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:932)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.connect(WebHdfsFileSystem.java:595)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.connect(WebHdfsFileSystem.java:552)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$ReadRunner.connect(WebHdfsFileSystem.java:1710)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:620)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:472)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:502)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:498)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$ReadRunner.read(WebHdfsFileSystem.java:1659)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$WebHdfsInputStream.read(WebHdfsFileSystem.java:1520)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1880)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1829)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1843)
at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:49)
at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at org.apache.spark.rdd.HadoopRDD$anon$1.liftedTree1$1(HadoopRDD.scala:251)
at org.apache.spark.rdd.HadoopRDD$anon$1.<init>(HadoopRDD.scala:250)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) Tanx and Regards, MJ
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
-
HDFS
-
Security
07-09-2018
08:06 AM
Hi @Sandeep Nemuri, I think i found the answer from Stackoverflow site. Tanx for you info.
... View more
07-09-2018
08:05 AM
Tanx for your info @Felix Albani
... View more
07-05-2018
01:11 PM
How to configure it, i have the code part.
... View more
07-05-2018
11:41 AM
Hi All, i have a scenario like follows, need to split and save my data frame in to multiple partitions based on date and when writing i need to take recorrd count. is there any option to take record count when writing Dataframe pls comment.
... View more
07-05-2018
11:28 AM
Hi, I want to Read files from Remote Hadoop cluster (A) with HA and load it into Cluster(B), please let me know if there are ay options.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
06-13-2018
05:21 PM
It's already installed and the issue resolved now, tanx for you response.
... View more
06-13-2018
05:19 PM
Tanx for your help @Felix Albani, it supported me to run without any platform modification
... View more
06-11-2018
01:01 PM
Hi All, I'm getting the following Error when im trying to submit a spark job to read a sequence file. 18/06/07
19:35:25 ERROR Executor: Exception in task 8.0 in stage 16.0 (TID 611) java.lang.RuntimeException: native snappy
library not available: this version of libhadoop was built without snappy
support.
at
org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:65)
at
org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:193)
at org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:178)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1985)
at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1880)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1829)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1843)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:49)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:251)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:250)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) Following are my details: 1) Spark 2.2.1 2) Scala 2.11.8
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
04-24-2018
02:02 PM
Hi All, Could someone help me out how to start fetching data from YARN Rest API using Java. please share me some sample links. Tanx and Regards, MJ
... View more
Labels:
- Labels:
-
Apache YARN
04-23-2018
08:57 AM
Tanx @Pierre Villard
... View more