Member since
09-29-2015
155
Posts
205
Kudos Received
18
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
8365 | 02-17-2017 12:38 PM | |
1338 | 11-15-2016 03:56 PM | |
1878 | 11-11-2016 05:27 PM | |
15359 | 11-11-2016 12:16 AM | |
3078 | 11-10-2016 06:15 PM |
05-03-2016
01:47 AM
@Vadim So are we saying it is not possible to use zeppelin charting capabilities when using hiveContext and Temp tables?
... View more
05-02-2016
08:11 PM
2 Kudos
@Artem Ervits you do not think of context passing between spark, phoenix, hive. You would load data as Dataframe/Dataset into local variable from your datasource, you would do this for every datasource. Example: val mysqlTableDF = hiveContext.read.format("jdbc")....load(); //load a mysql table
val csvDF = hiveContext.read.format("com.databricks.spark.csv") ...load() //load a csv file
and than you would work with those DataFrames and do joins , filters, etc. ex: val joined_df = hiveTablesDF.join(factsales,"key") For Context sharing Sunile is right , Vadim created an article on HCC that gives more details. But the short version if you want to share context: Log into Amabri as admin Click on the Spark service in the left hand pane
Click on Configs Click on the "Custom spark-defaults" Add a custom property key=spark.sql.hive.thriftServer.singleSession value=true Note this is only required in Spark 1.6 , 1.5 you had automatic context sharing.
... View more
05-02-2016
07:28 PM
2 Kudos
I am running into an issue on HDP Sandbox 2.4 , Spark 1.6 and Zeppelin notebook where the "temp" registered tables are not being found when trying to use %hive I am loading table: val hiveTablesDF = hiveContext.read....
hiveTablesDF.registerTempTable("DimStoreDF")
hiveTablesDF.show()
Output: hiveTablesDF: org.apache.spark.sql.DataFrame = [storekey: int, geographykey: int, storemanager: int, storetype: string, storename: string, storedescription: string, status: string, opendate: timestamp, closedate: timestamp, entitykey: int, zipcode: string, zipcodeextension: string, storephone: string, storefax: string, closereason: string, employeecount: int, sellingareasize: double, lastremodeldate: timestamp, etlloadid: int, somedate1: timestamp, somedate2: timestamp, loaddate: timestamp, updatedate: timestamp]I can also see the table "dimstoredf" using hiveContext.tableNames
hiveContext.tableNames
res27: Array[String] = Array(dimproductdf, dimstoredf, ae2, dimcustomer, dimcustomertemp, dimproduct, dimproducttemp, factonlinesales, factonlinesalestemp, factsales, factsalestemp, health_table, mysql_federated_sample, sample_07, sample_08)
It is also available in beeline and is registered as Temporary: 0: jdbc:hive2://localhost:10002/default> [root@sandbox ~]# beeline -u "jdbc:hive2://localhost:10002/default" -n admin
WARNING: Use "yarn jar" to launch YARN applications.
Connecting to jdbc:hive2://localhost:10002/default
Connected to: Spark SQL (version 1.6.0)
Driver: Hive JDBC (version 1.2.1000.2.4.0.0-169)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1000.2.4.0.0-169 by Apache Hive
0: jdbc:hive2://localhost:10002/default> show tables;
+-------------------------+--------------+--+
| tableName | isTemporary |
+-------------------------+--------------+--+
| dimproductdf | true |
| dimstoredf | true |
| ae2 | false |
| dimcustomer | false |
| dimcustomertemp | false |
| dimproduct | false |
| dimproducttemp | false |
| factonlinesales | false |
| factonlinesalestemp | false |
| factsales | false |
| factsalestemp | false |
| health_table | false |
| mysql_federated_sample | false |
| sample_07 | false |
| sample_08 | false |
+-------------------------+--------------+--+
However when I try to run: %hive
select * from DimStoreDF
I get this error: Error while compiling statement: FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'DimStoreDF'
If I run the sqlstatement from hiveContext.sql all works well. hiveContext.sql("select * from dimstoredf").show()
------------+---------+--------------------+--------------------+--------------------+--------------------+
|storekey|geographykey|storemanager|storetype| storename| storedescription|status| opendate| closedate|entitykey|zipcode|zipcodeextension| storephone| storefax|closereason|employeecount|sellingareasize| lastremodeldate|etlloadid| somedate1| somedate2| loaddate| updatedate|
+--------+------------+------------+---------+--------------------+--------------------+------+--------------------+--------------------+---------+-------+----------------+------------+------------+-----------+-------------+---------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+
Appreciate your input, suggestions?
... View more
Labels:
- Labels:
-
Apache Spark
-
Apache Zeppelin
04-09-2016
06:17 PM
Switch to sanbox for host
... View more
04-01-2016
10:15 PM
Awesome write up @Ancil McBarnett !
... View more
04-01-2016
06:08 PM
1 Kudo
@eorgadn You should wrap the geoDistance functions as hive UDF’s it will be a lot friendlier for most people that will want to use it in hive.
... View more
04-01-2016
05:12 PM
11 Kudos
HDP Sandbox comes pre-installed with SparkR and R .
First let's setup the R Studio on the HDP Sandbox:
To download and install RStudio Server open a terminal window and execute the commands corresponding to get the 64-bit version
wget https://download2.rstudio.org/rstudio-server-rhel-0.99.893-x86_64.rpm
sudo yum install --nogpgcheck rstudio-server-rhel-0.99.893-x86_64.rpm
sudo rstudio-server verify-installation
sudo rstudio-server stop
sudo rstudio-server start
Now setup a local user on the hdp sandbox, to access the rstudio:
useradd alex
passwd xxxx
Next launch web-browser and point to : http://sandbox.hortonworks.com:8787/
Login using the local account created earlier above.
Next lets initialize the sparkr in "yarn-client" mode
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client/")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"),.libPaths()))
library(SparkR)
sc <- SparkR::sparkR.init(master = "yarn-client")
Run the code in rstudio , you should get the following output:
>
>
> Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client/")
>
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"),.libPaths()))
>
> library(SparkR)
Attaching package: ‘SparkR’
The following objects are masked from ‘package:stats’:
cov, filter, lag, na.omit, predict, sd, var
The following objects are masked from ‘package:base’:
colnames, colnames<-, intersect, rank, rbind, sample, subset, summary, table, transform
> sc <- SparkR::sparkR.init(master = "yarn-client")
Launching java with spark-submit command /usr/hdp/current/spark-client//bin/spark-submit sparkr-shell /tmp/RtmpVvKWS8/backend_port38582cab538c
16/04/01 16:28:00 INFO SparkContext: Running Spark version 1.6.0
16/04/01 16:28:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/01 16:28:01 INFO SecurityManager: Changing view acls to: azeltov
16/04/01 16:28:01 INFO SecurityManager: Changing modify acls to: azeltov
16/04/01 16:28:01 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(azeltov); users with modify permissions: Set(azeltov)
16/04/01 16:28:02 INFO Utils: Successfully started service 'sparkDriver' on port 51539.
16/04/01 16:28:02 INFO Slf4jLogger: Slf4jLogger started
16/04/01 16:28:02 INFO Remoting: Starting remoting
16/04/01 16:28:03 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@000.000.000.000:54056]
16/04/01 16:28:03 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 54056.
16/04/01 16:28:03 INFO SparkEnv: Registering MapOutputTracker
16/04/01 16:28:03 INFO SparkEnv: Registering BlockManagerMaster
16/04/01 16:28:03 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-533d709d-f345-4d26-8046-0ae68a009d13
16/04/01 16:28:03 INFO MemoryStore: MemoryStore started with capacity 511.5 MB
16/04/01 16:28:03 INFO SparkEnv: Registering OutputCommitCoordinator
16/04/01 16:28:03 INFO Server: jetty-8.y.z-SNAPSHOT
16/04/01 16:28:03 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
16/04/01 16:28:03 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/04/01 16:28:03 INFO SparkUI: Started SparkUI at http://000.000.000.000:4040
spark.yarn.driver.memoryOverhead is set but does not apply in client mode.
16/04/01 16:28:04 INFO TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
16/04/01 16:28:04 INFO RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/1000.000.000.000:8050
16/04/01 16:28:05 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
16/04/01 16:28:05 INFO Client: Requesting a new application from cluster with 1 NodeManagers
16/04/01 16:28:05 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (5120 MB per container)
16/04/01 16:28:05 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
16/04/01 16:28:05 INFO Client: Setting up container launch context for our AM
16/04/01 16:28:05 INFO Client: Setting up the launch environment for our AM container
16/04/01 16:28:05 INFO Client: Using the spark assembly jar on HDFS because you are using HDP, defaultSparkAssembly:hdfs://sandbox.hortonworks.com:8020/hdp/apps/2.4.0.0-169/spark/spark-hdp-assembly.jar
16/04/01 16:28:05 INFO Client: Preparing resources for our AM container
16/04/01 16:28:05 INFO Client: Using the spark assembly jar on HDFS because you are using HDP, defaultSparkAssembly:hdfs://sandbox.hortonworks.com:8020/hdp/apps/2.4.0.0-169/spark/spark-hdp-assembly.jar
16/04/01 16:28:05 INFO Client: Source and destination file systems are the same. Not copying hdfs://sandbox.hortonworks.com:8020/hdp/apps/2.4.0.0-169/spark/spark-hdp-assembly.jar
16/04/01 16:28:06 INFO Client: Uploading resource file:/tmp/spark-974da35e-9ded-485a-9a13-0fd9a020dbf0/__spark_conf__5714702140855539619.zip -> hdfs://sandbox.hortonworks.com:8020/user/azeltov/.sparkStaging/application_1459455786854_0007/__spark_conf__5714702140855539619.zip
16/04/01 16:28:06 INFO SecurityManager: Changing view acls to: azeltov
16/04/01 16:28:06 INFO SecurityManager: Changing modify acls to: azeltov
16/04/01 16:28:06 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(azeltov); users with modify permissions: Set(azeltov)
16/04/01 16:28:06 INFO Client: Submitting application 7 to ResourceManager
16/04/01 16:28:06 INFO YarnClientImpl: Submitted application application_1459455786854_0007
16/04/01 16:28:06 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1459455786854_0007 and attemptId None
16/04/01 16:28:07 INFO Client: Application report for application_1459455786854_0007 (state: ACCEPTED)
16/04/01 16:28:07 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1459528086330
final status: UNDEFINED
tracking URL: http://sandbox.hortonworks.com:8088/proxy/application_1459455786854_0007/
user: azeltov
16/04/01 16:28:08 INFO Client: Application report for application_1459455786854_0007 (state: ACCEPTED)
16/04/01 16:28:09 INFO Client: Application report for application_1459455786854_0007 (state: ACCEPTED)
16/04/01 16:28:10 INFO Client: Application report for application_1459455786854_0007 (state: ACCEPTED)
16/04/01 16:28:11 INFO Client: Application report for application_1459455786854_0007 (state: ACCEPTED)
16/04/01 16:28:12 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(null)
16/04/01 16:28:12 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> sandbox.hortonworks.com, PROXY_URI_BASES -> http://sandbox.hortonworks.com:8088/proxy/application_1459455786854_0007), /proxy/application_1459455786854_0007
16/04/01 16:28:12 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
16/04/01 16:28:12 INFO Client: Application report for application_1459455786854_0007 (state: RUNNING)
16/04/01 16:28:12 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host:000.000.000.000
ApplicationMaster RPC port: 0
queue: default
start time: 1459528086330
final status: UNDEFINED
tracking URL: http://sandbox.hortonworks.com:8088/proxy/application_1459455786854_0007/
user: azeltov
16/04/01 16:28:12 INFO YarnClientSchedulerBackend: Application application_1459455786854_0007 has started running.
16/04/01 16:28:12 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35844.
16/04/01 16:28:12 INFO NettyBlockTransferService: Server created on 35844
16/04/01 16:28:12 INFO BlockManagerMaster: Trying to register BlockManager
16/04/01 16:28:12 INFO BlockManagerMasterEndpoint: Registering block manager 000.000.000.000:35844 with 511.5 MB RAM, BlockManagerId(driver,000.000.000.000, 35844)
16/04/01 16:28:12 INFO BlockManagerMaster: Registered BlockManager
16/04/01 16:28:12 INFO EventLoggingListener: Logging events to hdfs:///spark-history/application_1459455786854_0007
16/04/01 16:28:18 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (sandbox.hortonworks.com:43282) with ID 1
16/04/01 16:28:18 INFO BlockManagerMasterEndpoint: Registering block manager sandbox.hortonworks.com:58360 with 511.5 MB RAM, BlockManagerId(1, sandbox.hortonworks.com, 58360)
16/04/01 16:28:33 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
Validate that the sparkUI is started, look for this message in your output like the one above:
16/04/01 16:28:03 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/04/01 16:28:03 INFO SparkUI: Started SparkUI at http://000.000.000.000:4040
Next lets try running some sparkR code:
sqlContext <-sparkRSQL.init(sc)
path <-file.path("file:///usr/hdp/2.4.0.0-169/spark/examples/src/main/resources/people.json")
peopleDF <-jsonFile(sqlContext, path)
printSchema(peopleDF)
You should get this output:
> printSchema(peopleDF)
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
Now lets try to do some basic dataframe analysis
# Register this DataFrame as a table.
registerTempTable(peopleDF, "people")
teenagers <- sql(sqlContext, "SELECT name FROM people WHERE age >= 13 AND age <= 19")
teenagersLocalDF <- collect(teenagers)
print(teenagersLocalDF)
... View more
Labels:
04-01-2016
02:21 PM
11 Kudos
First lets create a sample file in S3: In the AWS Console , Go to S3 and create a bucket “S3Demo” and pick your region. Upload the file manually by using the upload button (example file name used later in scala: S3HDPTEST.csv) In the HDP 2.4.0 Sandbox : Download the aws sdk for java https://aws.amazon.com/sdk-for-java/ Uploaded it to the hadoop directory. You should see the aws-java-sdk-1.10.65.jar in /usr/hdp/2.4.0.0-169/hadoop/ [root@sandbox bin]# ll /usr/hdp/2.4.0.0-169/hadoop/
total 242692
-rw-r--r-- 1 root root 32380018 2016-03-31 22:02 aws-java-sdk-1.10.65.jar
drwxr-xr-x 2 root root 4096 2016-02-29 18:05 bin
drwxr-xr-x 2 root root 12288 2016-02-29 17:49 client
lrwxrwxrwx 1 root root 25 2016-03-31 21:08 conf -> /etc/hadoop/2.4.0.0-169/0
drwxr-xr-x 2 root root 4096 2016-02-29 17:46 etc
-rw-r--r-- 1 root root 17366 2016-02-10 06:44 hadoop-annotations-2.7.1.2.4.0.0-169.jar
lrwxrwxrwx 1 root root 40 2016-02-29 17:46 hadoop-annotations.jar -> hadoop-annotations-2.7.1.2.4.0.0-169.jar
-rw-r--r-- 1 root root 71534 2016-02-10 06:44 hadoop-auth-2.7.1.2.4.0.0-169.jar
lrwxrwxrwx 1 root root 33 2016-02-29 17:46 hadoop-auth.jar -> hadoop-auth-2.7.1.2.4.0.0-169.jar
-rw-r--r-- 1 root root 103049 2016-02-10 06:44 hadoop-aws-2.7.1.2.4.0.0-169.jar
lrwxrwxrwx 1 root root 32 2016-02-29 17:46 hadoop-aws.jar -> hadoop-aws-2.7.1.2.4.0.0-169.jar
-rw-r--r-- 1 root root 138488 2016-02-10 06:44 hadoop-azure-2.7.1.2.4.0.0-169.jar
lrwxrwxrwx 1 root root 34 2016-02-29 17:46 hadoop-azure.jar -> hadoop-azure-2.7.1.2.4.0.0-169.jar
-rw-r--r-- 1 root root 3469432 2016-02-10 06:44 hadoop-common-2.7.1.2.4.0.0-169.jar
-rw-r--r-- 1 root root 1903274 2016-02-10 06:44 hadoop-common-2.7.1.2.4.0.0-169-tests.jar
lrwxrwxrwx 1 root root 35 2016-02-29 17:46 hadoop-common.jar -> hadoop-common-2.7.1.2.4.0.0-169.jar
lrwxrwxrwx 1 root root 41 2016-02-29 17:46 hadoop-common-tests.jar -> hadoop-common-2.7.1.2.4.0.0-169-tests.jar
-rw-r--r-- 1 root root 159484 2016-02-10 06:44 hadoop-nfs-2.7.1.2.4.0.0-169.jar
lrwxrwxrwx 1 root root 32 2016-02-29 17:46 hadoop-nfs.jar -> hadoop-nfs-2.7.1.2.4.0.0-169.jar
drwxr-xr-x 5 root root 4096 2016-03-31 20:27 lib
drwxr-xr-x 2 root root 4096 2016-02-29 17:46 libexec
drwxr-xr-x 3 root root 4096 2016-02-29 17:46 man
-rw-r--r-- 1 root root 210216729 2016-02-10 06:44 mapreduce.tar.gz
drwxr-xr-x 2 root root 4096 2016-02-29 17:46 sbin Change directory to spark/bin [root@sandbox bin]# cd /usr/hdp/2.4.0.0-169/spark/bin Start the Spark Scala shell with right aws jars dependencies: ./spark-shell --master yarn-client --jars /usr/hdp/2.4.0.0-169/hadoop/hadoop-aws-2.7.1.2.4.0.0-169.jar,/usr/hdp/2.4.0.0-169/hadoop/hadoop-auth.jar,/usr/hdp/2.4.0.0-169/hadoop/aws-java-sdk-1.10.65.jar --driver-memory 512m --executor-memory 512m Now for some scala code to configure the aws secret keys in hadoopConf val hadoopConf = sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", "xxxxxxx")
hadoopConf.set("fs.s3.awsSecretAccessKey", "xxxxxxx")
and now read the file from s3 bucket val myLines = sc.textFile("s3n://s3hdptest/S3HDPTEST.csv");
myLines.count();
print count;
... View more
Labels:
04-01-2016
01:30 AM
1 Kudo
Dope...Thanks @Simon Elliston Ball that worked! ./spark-shell --master yarn-client --jars /usr/hdp/2.4.0.0-169/hadoop/hadoop-aws-2.7.1.2.4.0.0-169.jar,/usr/hdp/2.4.0.0-169/hadoop/hadoop-auth.jar,/usr/hdp/2.4.0.0-169/hadoop/aws-java-sdk-1.10.65.jar --driver-memory 512m --executor-memory 512m
... View more
03-31-2016
10:17 PM
[root@sandbox bin]# ll /usr/hdp/2.4.0.0-169/hadoop/
total 242692
-rw-r--r-- 1 root root 32380018 2016-03-31 22:02 aws-java-sdk-1.10.65.jar
drwxr-xr-x 2 root root 4096 2016-02-29 18:05 bin
drwxr-xr-x 2 root root 12288 2016-02-29 17:49 client
lrwxrwxrwx 1 root root 25 2016-03-31 21:08 conf -> /etc/hadoop/2.4.0.0-169/0
drwxr-xr-x 2 root root 4096 2016-02-29 17:46 etc
-rw-r--r-- 1 root root 17366 2016-02-10 06:44 hadoop-annotations-2.7.1.2.4.0.0-169.jar
lrwxrwxrwx 1 root root 40 2016-02-29 17:46 hadoop-annotations.jar -> hadoop-annotations-2.7.1.2.4.0.0-169.jar
-rw-r--r-- 1 root root 71534 2016-02-10 06:44 hadoop-auth-2.7.1.2.4.0.0-169.jar
lrwxrwxrwx 1 root root 33 2016-02-29 17:46 hadoop-auth.jar -> hadoop-auth-2.7.1.2.4.0.0-169.jar
-rw-r--r-- 1 root root 103049 2016-02-10 06:44 hadoop-aws-2.7.1.2.4.0.0-169.jar
lrwxrwxrwx 1 root root 32 2016-02-29 17:46 hadoop-aws.jar -> hadoop-aws-2.7.1.2.4.0.0-169.jar
-rw-r--r-- 1 root root 138488 2016-02-10 06:44 hadoop-azure-2.7.1.2.4.0.0-169.jar
lrwxrwxrwx 1 root root 34 2016-02-29 17:46 hadoop-azure.jar -> hadoop-azure-2.7.1.2.4.0.0-169.jar
-rw-r--r-- 1 root root 3469432 2016-02-10 06:44 hadoop-common-2.7.1.2.4.0.0-169.jar
-rw-r--r-- 1 root root 1903274 2016-02-10 06:44 hadoop-common-2.7.1.2.4.0.0-169-tests.jar
lrwxrwxrwx 1 root root 35 2016-02-29 17:46 hadoop-common.jar -> hadoop-common-2.7.1.2.4.0.0-169.jar
lrwxrwxrwx 1 root root 41 2016-02-29 17:46 hadoop-common-tests.jar -> hadoop-common-2.7.1.2.4.0.0-169-tests.jar
-rw-r--r-- 1 root root 159484 2016-02-10 06:44 hadoop-nfs-2.7.1.2.4.0.0-169.jar
lrwxrwxrwx 1 root root 32 2016-02-29 17:46 hadoop-nfs.jar -> hadoop-nfs-2.7.1.2.4.0.0-169.jar
drwxr-xr-x 5 root root 4096 2016-03-31 20:27 lib
drwxr-xr-x 2 root root 4096 2016-02-29 17:46 libexec
drwxr-xr-x 3 root root 4096 2016-02-29 17:46 man
-rw-r--r-- 1 root root 210216729 2016-02-10 06:44 mapreduce.tar.gz
drwxr-xr-x 2 root root 4096 2016-02-29 17:46 sbin
... View more