Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How do you connect to Kudu via PySpark

avatar
Explorer

Trying to create a dataframe like so

 

kuduOptions = {"kudu.master":"my.master.server", "kudu.table":"myTable"}

df = sqlContext.read.options(kuduOptions).kudu

 

The above code is a "port" of Scala code. Scala sample had kuduOptions defined as map.

 

I get an error stating "options expecting 1 parameter but was given 2"

 

How do you connect to Kudu via PySpark SQL Context? 

1 ACCEPTED SOLUTION

avatar
Master Collaborator

@rams the error is correct as the syntax in pyspark varies from that of scala. 

 

For reference here are the steps that you'd need to query a kudu table in pyspark2

 

Create a kudu table using impala-shell

# impala-shell 

CREATE TABLE test_kudu (id BIGINT PRIMARY KEY, s STRING) 
PARTITION BY HASH(id) PARTITIONS 2 STORED AS KUDU; 
insert into test_kudu values (100, 'abc'); 
insert into test_kudu values (101, 'def'); 
insert into test_kudu values (102, 'ghi'); 

 

Launch pyspark2 with the artifacts and query the kudu table

# pyspark2 --packages org.apache.kudu:kudu-spark2_2.11:1.4.0

____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT
/_/

Using Python version 2.7.5 (default, Nov 6 2016 00:28:07)
SparkSession available as 'spark'.


>>> kuduDF = spark.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"nightly512-1.xxx.xxx.com:7051").option('kudu.table',"impala::default.test_kudu").load()

 

>>> kuduDF.show(3)

+---+---+
| id| s|
+---+---+
|100|abc|
|101|def|
|102|ghi|
+---+---+

 

For records, the same thing can be achieved using the following commands in spark2-shell

 

# spark2-shell --packages org.apache.kudu:kudu-spark2_2.11:1.4.0

Spark context available as 'sc' (master = yarn, app id = application_1525159578660_0011).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT

 

scala> import org.apache.kudu.spark.kudu._
import org.apache.kudu.spark.kudu._

 

scala> val df = spark.sqlContext.read.options(Map("kudu.master" -> "nightly512-1.xx.xxx.com:7051","kudu.table" -> "impala::default.test_kudu")).kudu

 

scala> df.show(3)

+---+---+
| id| s|
+---+---+
|100|abc|
|101|def|
|102|ghi|
+---+---+

 

View solution in original post

1 REPLY 1

avatar
Master Collaborator

@rams the error is correct as the syntax in pyspark varies from that of scala. 

 

For reference here are the steps that you'd need to query a kudu table in pyspark2

 

Create a kudu table using impala-shell

# impala-shell 

CREATE TABLE test_kudu (id BIGINT PRIMARY KEY, s STRING) 
PARTITION BY HASH(id) PARTITIONS 2 STORED AS KUDU; 
insert into test_kudu values (100, 'abc'); 
insert into test_kudu values (101, 'def'); 
insert into test_kudu values (102, 'ghi'); 

 

Launch pyspark2 with the artifacts and query the kudu table

# pyspark2 --packages org.apache.kudu:kudu-spark2_2.11:1.4.0

____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT
/_/

Using Python version 2.7.5 (default, Nov 6 2016 00:28:07)
SparkSession available as 'spark'.


>>> kuduDF = spark.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"nightly512-1.xxx.xxx.com:7051").option('kudu.table',"impala::default.test_kudu").load()

 

>>> kuduDF.show(3)

+---+---+
| id| s|
+---+---+
|100|abc|
|101|def|
|102|ghi|
+---+---+

 

For records, the same thing can be achieved using the following commands in spark2-shell

 

# spark2-shell --packages org.apache.kudu:kudu-spark2_2.11:1.4.0

Spark context available as 'sc' (master = yarn, app id = application_1525159578660_0011).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT

 

scala> import org.apache.kudu.spark.kudu._
import org.apache.kudu.spark.kudu._

 

scala> val df = spark.sqlContext.read.options(Map("kudu.master" -> "nightly512-1.xx.xxx.com:7051","kudu.table" -> "impala::default.test_kudu")).kudu

 

scala> df.show(3)

+---+---+
| id| s|
+---+---+
|100|abc|
|101|def|
|102|ghi|
+---+---+