How do you connect to Kudu via PySpark

rams — Fri, 16 Sep 2022 13:09:03 GMT

Trying to create a dataframe like so

kuduOptions = {"kudu.master":"my.master.server", "kudu.table":"myTable"}

df = sqlContext.read.options(kuduOptions).kudu

The above code is a "port" of Scala code. Scala sample had kuduOptions defined as map.

I get an error stating "options expecting 1 parameter but was given 2"

How do you connect to Kudu via PySpark SQL Context?

Re: How do you connect to Kudu via PySpark

AutoIN — Tue, 01 May 2018 12:19:15 GMT

@rams the error is correct as the syntax in pyspark varies from that of scala.

For reference here are the steps that you'd need to query a kudu table in pyspark2

Create a kudu table using impala-shell

# impala-shell

CREATE TABLE test_kudu (id BIGINT PRIMARY KEY, s STRING)
PARTITION BY HASH(id) PARTITIONS 2 STORED AS KUDU;
insert into test_kudu values (100, 'abc');
insert into test_kudu values (101, 'def');
insert into test_kudu values (102, 'ghi');

Launch pyspark2 with the artifacts and query the kudu table

# pyspark2 --packages org.apache.kudu:kudu-spark2_2.11:1.4.0

____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT
/_/

Using Python version 2.7.5 (default, Nov 6 2016 00:28:07)
SparkSession available as 'spark'.

>>> kuduDF = spark.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"nightly512-1.xxx.xxx.com:7051").option('kudu.table',"impala::default.test_kudu").load()

>>> kuduDF.show(3)

+---+---+
| id| s|
+---+---+
|100|abc|
|101|def|
|102|ghi|
+---+---+

For records, the same thing can be achieved using the following commands in spark2-shell

# spark2-shell --packages org.apache.kudu:kudu-spark2_2.11:1.4.0

Spark context available as 'sc' (master = yarn, app id = application_1525159578660_0011).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT

scala> import org.apache.kudu.spark.kudu._
import org.apache.kudu.spark.kudu._

scala> val df = spark.sqlContext.read.options(Map("kudu.master" -> "nightly512-1.xx.xxx.com:7051","kudu.table" -> "impala::default.test_kudu")).kudu

scala> df.show(3)

+---+---+
| id| s|
+---+---+
|100|abc|
|101|def|
|102|ghi|
+---+---+

question Re: How do you connect to Kudu via PySpark in Archives of Support Questions (Read Only)

How do you connect to Kudu via PySpark

Re: How do you connect to Kudu via PySpark