Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How do you connect to Kudu via PySpark

Solved Go to solution

How do you connect to Kudu via PySpark

Explorer

Trying to create a dataframe like so

 

kuduOptions = {"kudu.master":"my.master.server", "kudu.table":"myTable"}

df = sqlContext.read.options(kuduOptions).kudu

 

The above code is a "port" of Scala code. Scala sample had kuduOptions defined as map.

 

I get an error stating "options expecting 1 parameter but was given 2"

 

How do you connect to Kudu via PySpark SQL Context? 

1 ACCEPTED SOLUTION

Accepted Solutions

Re: How do you connect to Kudu via PySpark

Expert Contributor

@rams the error is correct as the syntax in pyspark varies from that of scala. 

 

For reference here are the steps that you'd need to query a kudu table in pyspark2

 

Create a kudu table using impala-shell

# impala-shell 

CREATE TABLE test_kudu (id BIGINT PRIMARY KEY, s STRING) 
PARTITION BY HASH(id) PARTITIONS 2 STORED AS KUDU; 
insert into test_kudu values (100, 'abc'); 
insert into test_kudu values (101, 'def'); 
insert into test_kudu values (102, 'ghi'); 

 

Launch pyspark2 with the artifacts and query the kudu table

# pyspark2 --packages org.apache.kudu:kudu-spark2_2.11:1.4.0

____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT
/_/

Using Python version 2.7.5 (default, Nov 6 2016 00:28:07)
SparkSession available as 'spark'.


>>> kuduDF = spark.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"nightly512-1.xxx.xxx.com:7051").option('kudu.table',"impala::default.test_kudu").load()

 

>>> kuduDF.show(3)

+---+---+
| id| s|
+---+---+
|100|abc|
|101|def|
|102|ghi|
+---+---+

 

For records, the same thing can be achieved using the following commands in spark2-shell

 

# spark2-shell --packages org.apache.kudu:kudu-spark2_2.11:1.4.0

Spark context available as 'sc' (master = yarn, app id = application_1525159578660_0011).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT

 

scala> import org.apache.kudu.spark.kudu._
import org.apache.kudu.spark.kudu._

 

scala> val df = spark.sqlContext.read.options(Map("kudu.master" -> "nightly512-1.xx.xxx.com:7051","kudu.table" -> "impala::default.test_kudu")).kudu

 

scala> df.show(3)

+---+---+
| id| s|
+---+---+
|100|abc|
|101|def|
|102|ghi|
+---+---+

 

1 REPLY 1

Re: How do you connect to Kudu via PySpark

Expert Contributor

@rams the error is correct as the syntax in pyspark varies from that of scala. 

 

For reference here are the steps that you'd need to query a kudu table in pyspark2

 

Create a kudu table using impala-shell

# impala-shell 

CREATE TABLE test_kudu (id BIGINT PRIMARY KEY, s STRING) 
PARTITION BY HASH(id) PARTITIONS 2 STORED AS KUDU; 
insert into test_kudu values (100, 'abc'); 
insert into test_kudu values (101, 'def'); 
insert into test_kudu values (102, 'ghi'); 

 

Launch pyspark2 with the artifacts and query the kudu table

# pyspark2 --packages org.apache.kudu:kudu-spark2_2.11:1.4.0

____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT
/_/

Using Python version 2.7.5 (default, Nov 6 2016 00:28:07)
SparkSession available as 'spark'.


>>> kuduDF = spark.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"nightly512-1.xxx.xxx.com:7051").option('kudu.table',"impala::default.test_kudu").load()

 

>>> kuduDF.show(3)

+---+---+
| id| s|
+---+---+
|100|abc|
|101|def|
|102|ghi|
+---+---+

 

For records, the same thing can be achieved using the following commands in spark2-shell

 

# spark2-shell --packages org.apache.kudu:kudu-spark2_2.11:1.4.0

Spark context available as 'sc' (master = yarn, app id = application_1525159578660_0011).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT

 

scala> import org.apache.kudu.spark.kudu._
import org.apache.kudu.spark.kudu._

 

scala> val df = spark.sqlContext.read.options(Map("kudu.master" -> "nightly512-1.xx.xxx.com:7051","kudu.table" -> "impala::default.test_kudu")).kudu

 

scala> df.show(3)

+---+---+
| id| s|
+---+---+
|100|abc|
|101|def|
|102|ghi|
+---+---+