Member since
10-06-2016
32
Posts
2
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3345 | 12-04-2016 06:35 AM | |
13470 | 11-18-2016 06:21 AM | |
496 | 10-31-2016 06:48 AM | |
2042 | 10-07-2016 11:33 AM |
07-12-2017
01:30 PM
Hi, In my requirement ,reading the table from hive(Size - around 1 TB)i have to do too many aggregation operation mostly avg & sum.
I tried following code.its running for long time .Is there another way to optimize or efficient wasy of handling multiple agg operatin
finalDF.groupBy($"Dseq", $"FmNum", $"yrs",$"mnt",$"FromDnsty").agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"),avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),
avg($"neg"),avg($"Rd"),avg("savg"),avg("slavg"),avg($"dex"),avg("cur"),avg($"Nexp"), avg($"NExpp"),avg($"Psat"),
avg($"Pexps"),avg($"Pxn"),avg($"Pn"),avg($"AP3"),avg($"APd"),avg($"RInd"),avg($"CP"),avg($"CScr"),
avg($"Fspct7p1"), avg($"Fspts7p1"),avg($"TlpScore"),avg($"Ordrs"),avg($"Drs"),
avg("Lns"),avg("Judg"),avg("ds"),
avg("ob"),sum("Ss"),sum("dol"),sum("liens"),sum("pct"),
sum("jud"),sum("sljd"),sum("pNB"),avg("pctt"),sum($"Dolneg"),sum("Ls"),sum("sl"),sum($"PA"),sum($"DS"),
sum($"DA"),sum("dcur"),sum($"sat"),sum($"Pes"),sum($"Pn"),sum($"Pn"),sum($"Dlo"),sum($"Dol"),sum("pdol"),sum("pct"),sum("judg"))
Note - I am using Spark scala
... View more
Labels:
- Labels:
-
Apache Spark
03-01-2017
05:48 PM
you want to do group by timestamp .Else you want to pick the latest timestamp ?
... View more
12-04-2016
06:35 AM
The error seem to be mismatch data type with data set and case class.Check the each columns data type first Use csv api to read csv file and print schema Eg: val ebaydf = sqlcontect.read.format("com.databricks.spark.csv").option("header", "true").option("InferSchema", "true").load(path)
ebaydf.printschema()
... View more
11-27-2016
01:39 PM
Hi, Its not working for me ..Unable to cast the column data type.Here is my code.I am getting only null values.Thanks! val df4 = businessDF.withColumn("Estimated Gross Loss1", $"Estimated Gross Loss".cast("Int")).select($"Estimated Gross Loss1" as "Estimated Gross Loss")
... View more
11-18-2016
06:21 AM
pointed hbase -conf to hadoop classpath resolved above problem .Thanks!
... View more
11-16-2016
10:45 AM
Already i have verified the above link.But its not useful for my case . Thanks!
... View more
11-16-2016
10:01 AM
yes...using kerberos.. Already i have mentioned the kerberos auth in my coding //
UserGroupInformation.setConfiguration(conf) // val
userGroupInformation =
UserGroupInformation.loginUserFromKeytabAndReturnUGI("hadoop1@ZONE1.SCB.NET","C:\\Users\\1554160\\Downloads\\hadoop1.keytab") //
UserGroupInformation.setLoginUser(userGroupInformation)
... View more
11-16-2016
09:48 AM
Hi All, Hitting with followiong error while i am trying to connect the hbase through spark(using newhadoopAPIRDD) in HDP 2.4.2.Already tried increasing the RPC time in hbase site xml file,its not working.any idea how to fix this? Exception in thread "main" org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions:
Wed Nov 16 14:59:36 IST 2016, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=71216: row 'scores,,00000000000000' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=hklvadcnc06.hk.standardchartered.com,16020,1478491683763, seqNum=0
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:195)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295)
at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160)
at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155)
at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821)
at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:193)
at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:89)
at org.apache.hadoop.hbase.client.MetaScanner.allTableRegions(MetaScanner.java:324)
at org.apache.hadoop.hbase.client.HRegionLocator.getAllRegionLocations(HRegionLocator.java:88)
at org.apache.hadoop.hbase.util.RegionSizeCalculator.init(RegionSizeCalculator.java:94)
at org.apache.hadoop.hbase.util.RegionSizeCalculator.<init>(RegionSizeCalculator.java:81)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:256)
at org.apache.hadoop.hbase.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:237)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:120)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
at scb.Hbasetest$.main(Hbasetest.scala:85)
at scb.Hbasetest.main(Hbasetest.scala)
Caused by: java.net.SocketTimeoutException: callTimeout=60000, callDuration=71216: row 'scores,,00000000000000' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=hklvadcnc06.hk.standardchartered.com,16020,1478491683763, seqNum=0
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Call to hklvadcnc06.hk.standardchartered.com/10.20.235.13:16020 failed on local exception: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Connection to hklvadcnc06.hk.standardchartered.com/10.20.235.13:16020 is closing. Call id=9, waitTime=171
at org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1281)
at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1252)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32651)
at org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:372)
at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:199)
at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:346)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:320)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
... 4 more
Caused by: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Connection to hklvadcnc06.hk.standardchartered.com/10.20.235.13:16020 is closing. Call id=9, waitTime=171
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.cleanupCalls(RpcClientImpl.java:1078)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.close(RpcClientImpl.java:879)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.run(RpcClientImpl.java:604)
16/11/16 14:59:36 INFO SparkContext: Invoking stop() from shutdown hook
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Spark
10-31-2016
06:48 AM
Issue due to java version .I have used the alluxio latest version fixed the problem. If you are using alluxio 1.2 then should use java 7 .
... View more
10-31-2016
06:26 AM
Yes ..able to get the tables from teradata .its working fine .Thanks 🙂
... View more
10-30-2016
05:22 PM
1 Kudo
I am trying to setup alluxio on my local machine .Followed the alluxio doc http://www.alluxio.org/docs/master/en/Running-Alluxio-Locally.html Able to see the service .But getting error while checking on localhost:19999 HTTP ERROR 500 Problem accessing /home. Reason: Server Error
Caused by: org.apache.jasper.JasperException: PWC6033: Error in Javac compilation for JSP PWC6199: Generated servlet error:
The type java.io.ObjectInputStream cannot be resolved. It is indirectly referenced from required .class files please help me !!
... View more
10-29-2016
10:41 AM
Exception in thread "streaming-job-executor-0" 16/10/29 13:23:58 INFO JobScheduler: Stopped JobScheduler
java.lang.Error: java.lang.InterruptedException
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:612)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:920)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:918)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)
at com.jcalc.feed.MarkovPredictor$$anonfun$main$2.apply(MarkovPredictor.scala:124)
at com.jcalc.feed.MarkovPredictor$$anonfun$main$2.apply(MarkovPredictor.scala:122)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
... 2 more
... View more
Labels:
- Labels:
-
Apache Spark
10-28-2016
02:10 AM
Hi Joe, We want to use spark sql instead sqoop.I tried teradata JDBC driver .Unable to download dependencies . Thanks <dependency>
<groupId>com.teradata.jdbc</groupId>
<artifactId>terajdbc4</artifactId>
<version>15.10.00.22</version>
</dependency>
<dependency>
<groupId>com.teradata.jdbc</groupId>
<artifactId>tdgssconfig</artifactId>
<version>15.00.00.22</version>
</dependency>
... View more
10-27-2016
05:28 PM
Labels:
- Labels:
-
Apache Spark
10-25-2016
03:09 PM
Added hbase client JAR .Fixed the issue . Thanks Timothy !
... View more
10-24-2016
10:47 AM
I have followed the above steps and using the JAR zhzhan/shc:0.0.11-1.6.1-s_2.10” . On executing the code ,i am getting the following expection Exception in thread "main" java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:218)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:119)
at org.apache.spark.sql.execution.datasources.hbase.RegionResource.init(HBaseResources.scala:93)
at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.liftedTree1$1(HBaseResources.scala:57)
at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.acquire(HBaseResources.scala:54)
at org.apache.spark.sql.execution.datasources.hbase.RegionResource.acquire(HBaseResources.scala:88)
at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.releaseOnException(HBaseResources.scala:74)
at org.apache.spark.sql.execution.datasources.hbase.RegionResource.releaseOnException(HBaseResources.scala:88)
at org.apache.spark.sql.execution.datasources.hbase.RegionResource.<init>(HBaseResources.scala:108)
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD.getPartitions(HBaseTableScan.scala:60)
at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
at org.apache.spark.sql.DataFrame$anonfun$org$apache$spark$sql$DataFrame$execute$1$1.apply(DataFrame.scala:1538)
at org.apache.spark.sql.DataFrame$anonfun$org$apache$spark$sql$DataFrame$execute$1$1.apply(DataFrame.scala:1538)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2125)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$execute$1(DataFrame.scala:1537)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$collect(DataFrame.scala:1544)
at org.apache.spark.sql.DataFrame$anonfun$head$1.apply(DataFrame.scala:1414)
at org.apache.spark.sql.DataFrame$anonfun$head$1.apply(DataFrame.scala:1413)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2138)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1413)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1495)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:171)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:394)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:355)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:363)
at com.sparhbaseintg.trnsfm.HBasesrc$.main(Hbasesrc.scala:83)
at com.sparhbaseintg.trnsfm.HBasesrc.main(Hbasesrc.scala)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
... 44 more
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.RpcRetryingCallerFactory.instantiate(Lorg/apache/hadoop/conf/Configuration;Lorg/apache/hadoop/hbase/client/ServerStatisticTracker;)Lorg/apache/hadoop/hbase/client/RpcRetryingCallerFactory;
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.createAsyncProcess(ConnectionManager.java:2242)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:690)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:630) I can see the methods org.apache.hadoop.hbase.client.RpcRetryingCallerFactory.instantiate method in hbase client jar. I am not sure why its not referring it. Please help . Thanks!
... View more
10-07-2016
11:33 AM
1 Kudo
As per my understanding ,sqlcontext is automatically infer schema .So it has been read entire data to get the data type. it might be the reason for slowness . Thanks!
... View more