Member since
09-28-2017
5
Posts
0
Kudos Received
0
Solutions
11-17-2017
09:54 PM
Hi I am importing some data from oracle using the ojdbc6.jar . I am adding this jar using he hiveContext. I am then doing simple group by and count aggregations. But when I want to convert the aggregated spark datframe to R then it takes forever aprox (20 to 30) mins for that conversion. There are like only 4 to 5 rows in this dataframe. # Connection to sparkR from Rstudio
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {Sys.setenv(SPARK_HOME = "/usr/hdp/current/spark-client")}
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sc <- sparkR.init(master = "local[*]", sparkEnvir = list(spark.driver.memory="2g"))
sqlContext <- sparkRSQL.init(sc)
hiveContext <- sparkRHive.init(sc)
#Adding the OJDBC6 Jar
df_1 <- sql(hiveContext,"add jar /usr/hdp/2.6.1.0-129/spark/lib/ojdbc6.jar")
#CONNECTION TO ORACLE AND AND STORING IN RDD
df <- loadDF(hiveContext,source="jdbc",url="jdbc:oracle:thin:ks**d/password@135**********",dbtable="(SELECT EXTERNAL_ORDER_NUM,cast(partition_date as Date)as PARTITION_DATE,CHANNEL,ENTERPRISE_TYPE,LOSG_STATUS,LOSG_SUBSTATUS,PARTNER_NAME,SERVICE,PAYMENT_ARRANGEMENT,LNP,ORDER_STATUS,FULFILLMENT_METHOD,CONTRACT_LENGTH,PRODUCT_CATEGORY,BYOD_FLAG from my_table where partition_date >'10-NOV-17')",driver="oracle.jdbc.driver.OracleDriver")
#Loading the Data in temp table
registerTempTable(df, "df11")
#Performing simple GROUP BY AND COUNT
new_df1 <- sql(hiveContext, "SELECT count(EXTERNAL_ORDER_NUM) as COUNT,PARTITION_DATE FROM df11 GROUP BY PARTITION_DATE ORDER BY PARTITION_DATE")
final_frame <- as.data.frame(new_df1) #This final step takes like 30 mins to execute
I am using HDP 2.6.1 on virtualbox , centos 7 , spark version 1.6.3 I think i am either going wrong on how to add jar (ojdbc6.jar)file to all the nodes from RSTUDIO ==> sparkR or the way I am connecting to sparkR. Here is the log if i run the final_frame <- as.data.frame(new_df1)
> final_frame <- as.data.frame(new_df1)
17/11/17 21:47:35 INFO SparkContext: Starting job: dfToCols at NativeMethodAccessorImpl.java:-2
17/11/17 21:47:35 INFO DAGScheduler: Registering RDD 3 (dfToCols at NativeMethodAccessorImpl.java:-2)
17/11/17 21:47:35 INFO DAGScheduler: Got job 0 (dfToCols at NativeMethodAccessorImpl.java:-2) with 200 output partitions
17/11/17 21:47:35 INFO DAGScheduler: Final stage: ResultStage 1 (dfToCols at NativeMethodAccessorImpl.java:-2)
17/11/17 21:47:35 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
17/11/17 21:47:35 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
17/11/17 21:47:35 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at dfToCols at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/11/17 21:47:35 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 12.3 KB, free 1247.2 MB)
17/11/17 21:47:35 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.7 KB, free 1247.2 MB)
17/11/17 21:47:35 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:38269 (size: 5.7 KB, free: 1247.2 MB)
17/11/17 21:47:35 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1008
17/11/17 21:47:35 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at dfToCols at NativeMethodAccessorImpl.java:-2)
17/11/17 21:47:35 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/11/17 21:47:35 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 1959 bytes)
17/11/17 21:47:35 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/11/17 21:47:35 INFO Executor: Fetching http://localhost:43477/jars/ojdbc6.jar with timestamp 1510955214958
17/11/17 21:47:35 INFO Utils: Fetching http://localhost:43477/jars/ojdbc6.jar to /tmp/spark-e6f37cae-3e7a-4eee-8c63-491b96002ccf/userFiles-63f43854-534b-494a-83a4-c1c0e4ec8113/fetchFileTemp5998330918630913912.tmp
17/11/17 21:47:35 INFO Executor: Adding file:/tmp/spark-e6f37cae-3e7a-4eee-8c63-491b96002ccf/userFiles-63f43854-534b-494a-83a4-c1c0e4ec8113/ojdbc6.jar to class loader
17/11/17 21:47:41 INFO GenerateMutableProjection: Code generated in 89.087265 ms
17/11/17 21:47:41 INFO GenerateUnsafeProjection: Code generated in 8.82499 ms
17/11/17 21:47:41 INFO GenerateMutableProjection: Code generated in 7.437145 ms
17/11/17 21:47:41 INFO GenerateUnsafeRowJoiner: Code generated in 5.154208 ms
I might have done some obvious mistake as I am new to hadoop. Please help.
... View more
Labels:
- Labels:
-
Apache Spark