About bwalter1

bwalter1 · ‎11-24-2016

You might need to restart the Spark Interpreter (or restart Zeppelin notebook in Ambari, so that the Python Remote Interpreters know about the freshly installed pandas and import it If you are you running on a cluster, then Zeppelin will run in yarn client mode and the Python Remote Interpreters are started on other nodes than the zeppelin node. In this case install pandas on all machines of your cluster and restart Zeppelin.

bwalter1 · ‎11-24-2016

good to hear. btw. it is good practice to accept the answer so that is marked as resolved in the overview

bwalter1 · ‎11-24-2016

Per default, Atlas uses Basic Authentication. So use your Atlas user and password, e.g. like curl -s -u admin:admin http://atlas-server:21000/api/atlas/types

bwalter1 · ‎11-23-2016

I am not aware of Hive Version 1.5.0 (do you mean Hive View?) Anyhow, it works on Hive 1.2.1 (as of HDP 2.5) $ beeline -u "jdbc:hive2://192.168.124.145:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2" -n hive Connecting to jdbc:hive2://192.168.124.145:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2 Connected to: Apache Hive (version 1.2.1000.2.5.0.0-1245) Driver: Hive JDBC (version 1.2.1000.2.5.0.0-1245) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1000.2.5.0.0-1245 by Apache Hive 0: jdbc:hive2://192.168.124.145:2181/> select dept_name, md5(dept_name) from departments limit 1; +-------------------+-----------------------------------+--+ | dept_name | _c1 | +-------------------+-----------------------------------+--+ | Customer Service | d5552e0564007d93ff5937a9cb3bc491 | +-------------------+-----------------------------------+--+ 1 row selected (0.337 seconds) and on Hive 2.1 (TP in HDP 2.5) $ beeline -u "jdbc:hive2://192.168.124.145:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2" -n hive Connecting to jdbc:hive2://192.168.124.145:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2 Connected to: Apache Hive (version 2.1.0.2.5.0.0-1245) Driver: Hive JDBC (version 1.2.1000.2.5.0.0-1245) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1000.2.5.0.0-1245 by Apache Hive 0: jdbc:hive2://192.168.124.145:2181/> select dept_name, md5(dept_name) from departments limit 1; +-------------------+-----------------------------------+--+ | dept_name | c1 | +-------------------+-----------------------------------+--+ | Customer Service | d5552e0564007d93ff5937a9cb3bc491 | +-------------------+-----------------------------------+--+ 1 row selected (6.083 seconds)

bwalter1 · ‎11-22-2016

You should use HiveContext, see https://spark.apache.org/docs/1.6.1/sql-programming-guide.html#hive-tables from pyspark.sql import HiveContext sqlContext = HiveContext(sc1) and then you can test access to your table. Also try sqlContext.sql("show databases").show() sqlContext.sql("show tables").show() to see what you can acccess

bwalter1 · ‎11-22-2016

The driver will create the Spark Context. This can either be the spark-shell or Zeppelin, or a standalone Spark application (see http://spark.apache.org/docs/1.6.2/quick-start.html#self-contained-applications to learn how to create a spark context in an application). To distribute the execution you need to choose "yarn client" or "yarn cluster" mode (not local, which is the default), see spark-shell --master yarn --deploy-mode client --num-executors 3 This will create a driver with a Spark Context that controls 3 executors (could be on 1,2 or 3 machines, check "ps ax | grep Coarse") When you now call sc.textFile(...), then Spark will create an RDD by loading on each executor partitions of the file (hence afterwards it is already partitioned! Every further command will then run distributed. So it is not you to bring the Spark Context to the executors but the SparkContext is used by the driver to distribute the load across all started executors. That's why I linked the Spark docs above. You need to first understand the Spark cluster mode. If you only start "spark-shell", it will not be distributed but in "standalone" mode. Only in "yarn client" and "yarn cluster" mode it will be distributed. Having a distributed Spark Context created in the driver, there is no need to care about it any more. Just use the context and Spark will distribute your load. Readings: Overview: http://spark.apache.org/docs/latest/cluster-overview.html Spark Standalone: http://spark.apache.org/docs/latest/spark-standalone.html Spark on YARN: http://spark.apache.org/docs/latest/running-on-yarn.html

bwalter1 · ‎11-21-2016

Spark Context is the cluster coordinator, details see http://spark.apache.org/docs/latest/cluster-overview.html

bwalter1 · ‎11-21-2016

1. Both "Hives" will run simultaneously with different jdbc URLs (using zookeeper for discovery) jdbc:hive2://node1:2181,node2:2181,node3:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2 jdbc:hive2://node1:2181,node2:2181,node3:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2 and Hive Server ports (10000, 10500) 2. Technically "Hive" and "Hive Interactive" (2.1) run well on one cluster (different ports and folders). Be aware that Hive 2.1 is Tech Preview and that of course the machines need to be powerful enough (e.g. memory) 3. You can turn it off where you turned it on (under Hive Config, Interactive Query)

bwalter1 · ‎11-21-2016

... and be careful, since it is not splittable, every zipfile will be read by exactly one mapper (low parallelism)

bwalter1 · ‎11-21-2016

ZIP files are not splittable and not a default hadoop input format. You need an appropriate input format, see http://cutler.io/2012/07/hadoop-processing-zip-files-in-mapreduce/ I used it to load ZIP files with Spark (https://github.com/bernhard-42/spark-unzip)

Online	Offline
Last Visited	‎06-07-2017 08:08 AM

Member Since	‎10-07-2015 10:28 PM
Last Visited	‎06-07-2017 08:08 AM
Posts	107
Kudos received	72

Cloudera Community

Re: Spark and HIVE

Re: Spark SQL 2.0 - performance of Plain SQL query...

Re: ReplaceText Regex to replace double quotes in ...

Re: How could I use pandas library in Pyspark in Z...

Re: Got 401 - Full authentication is required to a...

Re: How could I use pandas library in Pyspark in Z...

Re: Got 401 - Full authentication is required to a...

Re: Got 401 - Full authentication is required to a...

Re: Hive: Can't get the md5 value

Re: spark-submit and hive tables - 'Table not foun...

Re: How to get SparkContext in executor

Re: How to get SparkContext in executor

Re: HIVE 2.1 in HDP 2.5

Re: Load .Zip files to hive

Re: Load .Zip files to hive