Created on 04-26-2018 12:49 PM - edited 09-16-2022 06:09 AM
Trying to create a dataframe like so
kuduOptions = {"kudu.master":"my.master.server", "kudu.table":"myTable"}
df = sqlContext.read.options(kuduOptions).kudu
The above code is a "port" of Scala code. Scala sample had kuduOptions defined as map.
I get an error stating "options expecting 1 parameter but was given 2"
How do you connect to Kudu via PySpark SQL Context?
Created 05-01-2018 05:19 AM
@rams the error is correct as the syntax in pyspark varies from that of scala.
For reference here are the steps that you'd need to query a kudu table in pyspark2
Create a kudu table using impala-shell
# impala-shell
CREATE TABLE test_kudu (id BIGINT PRIMARY KEY, s STRING)
PARTITION BY HASH(id) PARTITIONS 2 STORED AS KUDU;
insert into test_kudu values (100, 'abc');
insert into test_kudu values (101, 'def');
insert into test_kudu values (102, 'ghi');
Launch pyspark2 with the artifacts and query the kudu table
# pyspark2 --packages org.apache.kudu:kudu-spark2_2.11:1.4.0
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT
/_/
Using Python version 2.7.5 (default, Nov 6 2016 00:28:07)
SparkSession available as 'spark'.
>>> kuduDF = spark.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"nightly512-1.xxx.xxx.com:7051").option('kudu.table',"impala::default.test_kudu").load()
>>> kuduDF.show(3)
+---+---+
| id| s|
+---+---+
|100|abc|
|101|def|
|102|ghi|
+---+---+
For records, the same thing can be achieved using the following commands in spark2-shell
# spark2-shell --packages org.apache.kudu:kudu-spark2_2.11:1.4.0
Spark context available as 'sc' (master = yarn, app id = application_1525159578660_0011).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT
scala> import org.apache.kudu.spark.kudu._
import org.apache.kudu.spark.kudu._
scala> val df = spark.sqlContext.read.options(Map("kudu.master" -> "nightly512-1.xx.xxx.com:7051","kudu.table" -> "impala::default.test_kudu")).kudu
scala> df.show(3)
+---+---+
| id| s|
+---+---+
|100|abc|
|101|def|
|102|ghi|
+---+---+
Created 05-01-2018 05:19 AM
@rams the error is correct as the syntax in pyspark varies from that of scala.
For reference here are the steps that you'd need to query a kudu table in pyspark2
Create a kudu table using impala-shell
# impala-shell
CREATE TABLE test_kudu (id BIGINT PRIMARY KEY, s STRING)
PARTITION BY HASH(id) PARTITIONS 2 STORED AS KUDU;
insert into test_kudu values (100, 'abc');
insert into test_kudu values (101, 'def');
insert into test_kudu values (102, 'ghi');
Launch pyspark2 with the artifacts and query the kudu table
# pyspark2 --packages org.apache.kudu:kudu-spark2_2.11:1.4.0
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT
/_/
Using Python version 2.7.5 (default, Nov 6 2016 00:28:07)
SparkSession available as 'spark'.
>>> kuduDF = spark.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"nightly512-1.xxx.xxx.com:7051").option('kudu.table',"impala::default.test_kudu").load()
>>> kuduDF.show(3)
+---+---+
| id| s|
+---+---+
|100|abc|
|101|def|
|102|ghi|
+---+---+
For records, the same thing can be achieved using the following commands in spark2-shell
# spark2-shell --packages org.apache.kudu:kudu-spark2_2.11:1.4.0
Spark context available as 'sc' (master = yarn, app id = application_1525159578660_0011).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT
scala> import org.apache.kudu.spark.kudu._
import org.apache.kudu.spark.kudu._
scala> val df = spark.sqlContext.read.options(Map("kudu.master" -> "nightly512-1.xx.xxx.com:7051","kudu.table" -> "impala::default.test_kudu")).kudu
scala> df.show(3)
+---+---+
| id| s|
+---+---+
|100|abc|
|101|def|
|102|ghi|
+---+---+