Reply
New Contributor
Posts: 5
Registered: ‎04-12-2018
Accepted Solution

How do you connect to Kudu via PySpark

Trying to create a dataframe like so

 

kuduOptions = {"kudu.master":"my.master.server", "kudu.table":"myTable"}

df = sqlContext.read.options(kuduOptions).kudu

 

The above code is a "port" of Scala code. Scala sample had kuduOptions defined as map.

 

I get an error stating "options expecting 1 parameter but was given 2"

 

How do you connect to Kudu via PySpark SQL Context? 

Cloudera Employee
Posts: 54
Registered: ‎11-16-2015

Re: How do you connect to Kudu via PySpark

@rams the error is correct as the syntax in pyspark varies from that of scala. 

 

For reference here are the steps that you'd need to query a kudu table in pyspark2

 

Create a kudu table using impala-shell

# impala-shell 

CREATE TABLE test_kudu (id BIGINT PRIMARY KEY, s STRING) 
PARTITION BY HASH(id) PARTITIONS 2 STORED AS KUDU; 
insert into test_kudu values (100, 'abc'); 
insert into test_kudu values (101, 'def'); 
insert into test_kudu values (102, 'ghi'); 

 

Launch pyspark2 with the artifacts and query the kudu table

# pyspark2 --packages org.apache.kudu:kudu-spark2_2.11:1.4.0

____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT
/_/

Using Python version 2.7.5 (default, Nov 6 2016 00:28:07)
SparkSession available as 'spark'.


>>> kuduDF = spark.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"nightly512-1.xxx.xxx.com:7051").option('kudu.table',"impala::default.test_kudu").load()

 

>>> kuduDF.show(3)

+---+---+
| id| s|
+---+---+
|100|abc|
|101|def|
|102|ghi|
+---+---+

 

For records, the same thing can be achieved using the following commands in spark2-shell

 

# spark2-shell --packages org.apache.kudu:kudu-spark2_2.11:1.4.0

Spark context available as 'sc' (master = yarn, app id = application_1525159578660_0011).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT

 

scala> import org.apache.kudu.spark.kudu._
import org.apache.kudu.spark.kudu._

 

scala> val df = spark.sqlContext.read.options(Map("kudu.master" -> "nightly512-1.xx.xxx.com:7051","kudu.table" -> "impala::default.test_kudu")).kudu

 

scala> df.show(3)

+---+---+
| id| s|
+---+---+
|100|abc|
|101|def|
|102|ghi|
+---+---+

 

Announcements