Member since
10-07-2015
107
Posts
73
Kudos Received
23
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2544 | 02-23-2017 04:57 PM | |
1994 | 12-08-2016 09:55 AM | |
8910 | 11-24-2016 07:24 PM | |
3970 | 11-24-2016 02:17 PM | |
9347 | 11-24-2016 09:50 AM |
11-24-2016
02:17 PM
1 Kudo
You might need to restart the Spark Interpreter (or restart Zeppelin notebook in Ambari, so that the Python Remote Interpreters know about the freshly installed pandas and import it If you are you running on a cluster, then Zeppelin will run in yarn client mode and the Python Remote Interpreters are started on other nodes than the zeppelin node. In this case install pandas on all machines of your cluster and restart Zeppelin.
... View more
11-24-2016
02:05 PM
good to hear. btw. it is good practice to accept the answer so that is marked as resolved in the overview
... View more
11-24-2016
09:50 AM
3 Kudos
Per default, Atlas uses Basic Authentication. So use your Atlas user and password, e.g. like curl -s -u admin:admin http://atlas-server:21000/api/atlas/types
... View more
11-23-2016
07:41 AM
I am not aware of Hive Version 1.5.0 (do you mean Hive View?) Anyhow, it works on Hive 1.2.1 (as of HDP 2.5) $ beeline -u "jdbc:hive2://192.168.124.145:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2" -n hive
Connecting to jdbc:hive2://192.168.124.145:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
Connected to: Apache Hive (version 1.2.1000.2.5.0.0-1245)
Driver: Hive JDBC (version 1.2.1000.2.5.0.0-1245)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1000.2.5.0.0-1245 by Apache Hive
0: jdbc:hive2://192.168.124.145:2181/> select dept_name, md5(dept_name) from departments limit 1;
+-------------------+-----------------------------------+--+
| dept_name | _c1 |
+-------------------+-----------------------------------+--+
| Customer Service | d5552e0564007d93ff5937a9cb3bc491 |
+-------------------+-----------------------------------+--+
1 row selected (0.337 seconds) and on Hive 2.1 (TP in HDP 2.5) $ beeline -u "jdbc:hive2://192.168.124.145:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2" -n hive
Connecting to jdbc:hive2://192.168.124.145:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2
Connected to: Apache Hive (version 2.1.0.2.5.0.0-1245)
Driver: Hive JDBC (version 1.2.1000.2.5.0.0-1245)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1000.2.5.0.0-1245 by Apache Hive
0: jdbc:hive2://192.168.124.145:2181/> select dept_name, md5(dept_name) from departments limit 1;
+-------------------+-----------------------------------+--+
| dept_name | c1 |
+-------------------+-----------------------------------+--+
| Customer Service | d5552e0564007d93ff5937a9cb3bc491 |
+-------------------+-----------------------------------+--+
1 row selected (6.083 seconds)
... View more
11-22-2016
05:45 PM
1 Kudo
You should use HiveContext, see https://spark.apache.org/docs/1.6.1/sql-programming-guide.html#hive-tables from pyspark.sql import HiveContext
sqlContext = HiveContext(sc1) and then you can test access to your table. Also try sqlContext.sql("show databases").show()
sqlContext.sql("show tables").show() to see what you can acccess
... View more
11-22-2016
08:59 AM
The driver will create the Spark Context. This can either be the spark-shell or Zeppelin, or a standalone Spark application (see http://spark.apache.org/docs/1.6.2/quick-start.html#self-contained-applications to learn how to create a spark context in an application). To distribute the execution you need to choose "yarn client" or "yarn cluster" mode (not local, which is the default), see spark-shell --master yarn --deploy-mode client --num-executors 3 This will create a driver with a Spark Context that controls 3 executors (could be on 1,2 or 3 machines, check "ps ax | grep Coarse") When you now call sc.textFile(...), then Spark will create an RDD by loading on each executor partitions of the file (hence afterwards it is already partitioned! Every further command will then run distributed. So it is not you to bring the Spark Context to the executors but the SparkContext is used by the driver to distribute the load across all started executors. That's why I linked the Spark docs above. You need to first understand the Spark cluster mode. If you only start "spark-shell", it will not be distributed but in "standalone" mode. Only in "yarn client" and "yarn cluster" mode it will be distributed. Having a distributed Spark Context created in the driver, there is no need to care about it any more. Just use the context and Spark will distribute your load. Readings: Overview: http://spark.apache.org/docs/latest/cluster-overview.html Spark Standalone: http://spark.apache.org/docs/latest/spark-standalone.html Spark on YARN: http://spark.apache.org/docs/latest/running-on-yarn.html
... View more
11-21-2016
02:43 PM
Spark Context is the cluster coordinator, details see http://spark.apache.org/docs/latest/cluster-overview.html
... View more
11-21-2016
01:31 PM
1 Kudo
1. Both "Hives" will run simultaneously with different jdbc URLs (using zookeeper for discovery) jdbc:hive2://node1:2181,node2:2181,node3:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
jdbc:hive2://node1:2181,node2:2181,node3:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2 and Hive Server ports (10000, 10500) 2. Technically "Hive" and "Hive Interactive" (2.1) run well on one cluster (different ports and folders). Be aware that Hive 2.1 is Tech Preview and that of course the machines need to be powerful enough (e.g. memory) 3. You can turn it off where you turned it on (under Hive Config, Interactive Query)
... View more
11-21-2016
11:59 AM
... and be careful, since it is not splittable, every zipfile will be read by exactly one mapper (low parallelism)
... View more
11-21-2016
11:58 AM
1 Kudo
ZIP files are not splittable and not a default hadoop input format. You need an appropriate input format, see http://cutler.io/2012/07/hadoop-processing-zip-files-in-mapreduce/ I used it to load ZIP files with Spark (https://github.com/bernhard-42/spark-unzip)
... View more