Member since
09-28-2017
5
Posts
0
Kudos Received
0
Solutions
11-17-2017
09:54 PM
Hi I am importing some data from oracle using the ojdbc6.jar . I am adding this jar using he hiveContext. I am then doing simple group by and count aggregations. But when I want to convert the aggregated spark datframe to R then it takes forever aprox (20 to 30) mins for that conversion. There are like only 4 to 5 rows in this dataframe. # Connection to sparkR from Rstudio
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {Sys.setenv(SPARK_HOME = "/usr/hdp/current/spark-client")}
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sc <- sparkR.init(master = "local[*]", sparkEnvir = list(spark.driver.memory="2g"))
sqlContext <- sparkRSQL.init(sc)
hiveContext <- sparkRHive.init(sc)
#Adding the OJDBC6 Jar
df_1 <- sql(hiveContext,"add jar /usr/hdp/2.6.1.0-129/spark/lib/ojdbc6.jar")
#CONNECTION TO ORACLE AND AND STORING IN RDD
df <- loadDF(hiveContext,source="jdbc",url="jdbc:oracle:thin:ks**d/password@135**********",dbtable="(SELECT EXTERNAL_ORDER_NUM,cast(partition_date as Date)as PARTITION_DATE,CHANNEL,ENTERPRISE_TYPE,LOSG_STATUS,LOSG_SUBSTATUS,PARTNER_NAME,SERVICE,PAYMENT_ARRANGEMENT,LNP,ORDER_STATUS,FULFILLMENT_METHOD,CONTRACT_LENGTH,PRODUCT_CATEGORY,BYOD_FLAG from my_table where partition_date >'10-NOV-17')",driver="oracle.jdbc.driver.OracleDriver")
#Loading the Data in temp table
registerTempTable(df, "df11")
#Performing simple GROUP BY AND COUNT
new_df1 <- sql(hiveContext, "SELECT count(EXTERNAL_ORDER_NUM) as COUNT,PARTITION_DATE FROM df11 GROUP BY PARTITION_DATE ORDER BY PARTITION_DATE")
final_frame <- as.data.frame(new_df1) #This final step takes like 30 mins to execute
I am using HDP 2.6.1 on virtualbox , centos 7 , spark version 1.6.3 I think i am either going wrong on how to add jar (ojdbc6.jar)file to all the nodes from RSTUDIO ==> sparkR or the way I am connecting to sparkR. Here is the log if i run the final_frame <- as.data.frame(new_df1)
> final_frame <- as.data.frame(new_df1)
17/11/17 21:47:35 INFO SparkContext: Starting job: dfToCols at NativeMethodAccessorImpl.java:-2
17/11/17 21:47:35 INFO DAGScheduler: Registering RDD 3 (dfToCols at NativeMethodAccessorImpl.java:-2)
17/11/17 21:47:35 INFO DAGScheduler: Got job 0 (dfToCols at NativeMethodAccessorImpl.java:-2) with 200 output partitions
17/11/17 21:47:35 INFO DAGScheduler: Final stage: ResultStage 1 (dfToCols at NativeMethodAccessorImpl.java:-2)
17/11/17 21:47:35 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
17/11/17 21:47:35 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
17/11/17 21:47:35 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at dfToCols at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/11/17 21:47:35 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 12.3 KB, free 1247.2 MB)
17/11/17 21:47:35 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.7 KB, free 1247.2 MB)
17/11/17 21:47:35 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:38269 (size: 5.7 KB, free: 1247.2 MB)
17/11/17 21:47:35 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1008
17/11/17 21:47:35 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at dfToCols at NativeMethodAccessorImpl.java:-2)
17/11/17 21:47:35 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/11/17 21:47:35 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 1959 bytes)
17/11/17 21:47:35 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/11/17 21:47:35 INFO Executor: Fetching http://localhost:43477/jars/ojdbc6.jar with timestamp 1510955214958
17/11/17 21:47:35 INFO Utils: Fetching http://localhost:43477/jars/ojdbc6.jar to /tmp/spark-e6f37cae-3e7a-4eee-8c63-491b96002ccf/userFiles-63f43854-534b-494a-83a4-c1c0e4ec8113/fetchFileTemp5998330918630913912.tmp
17/11/17 21:47:35 INFO Executor: Adding file:/tmp/spark-e6f37cae-3e7a-4eee-8c63-491b96002ccf/userFiles-63f43854-534b-494a-83a4-c1c0e4ec8113/ojdbc6.jar to class loader
17/11/17 21:47:41 INFO GenerateMutableProjection: Code generated in 89.087265 ms
17/11/17 21:47:41 INFO GenerateUnsafeProjection: Code generated in 8.82499 ms
17/11/17 21:47:41 INFO GenerateMutableProjection: Code generated in 7.437145 ms
17/11/17 21:47:41 INFO GenerateUnsafeRowJoiner: Code generated in 5.154208 ms
I might have done some obvious mistake as I am new to hadoop. Please help.
... View more
Labels:
- Labels:
-
Apache Spark
10-09-2017
07:56 PM
Awesome!!!! It worked. Thanks @jnarayanan . Great help. @Timothy Spann thanks for trying to help. Appreciate it. I was using 8 gb ram. Other details posted in question.
... View more
10-08-2017
09:11 PM
I am following the tutorial at https://hortonworks.com/tutorial/predicting-airline-delays-using-sparkr/#step-2--setup-sparkr-on-rstudio It requires installation of RSTUDIO. I can download the file wget https://download2.rstudio.org/rstudio-server-rhel-1.0.153-x86_64.rpm I am getting error while installing . sudo yum install --nogpgcheck rstudio-server-rhel-1.0.153-x86_64.rpm Have included the image of error. Please guide. I am new to HDP sandbox and Hadoop too. I am using HDP-2.6.1 on virtual box and CENTOS 7
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
09-28-2017
08:45 AM
Hello I am unable to install Rstudio on HDP sand box 2.6. Please guide me where should I go and check for details. I have tried this link but I did not succeed . https://community.hortonworks.com/content/kbentry/69424/setting-up-rstudio-on-hortonworks-docker-sandbox-2.html
... View more