Support Questions
Find answers, ask questions, and share your expertise

Spark to Phoenix

Guru

Trying to connect spark with phoenix using JDBC. Appended location of phoenix-client.jar to the SPARK_CLASSPATH in spark_env.sh.

When I launch Spark shell, I get the following errors:

<console>:10: error: not found: value sqlContext
       import sqlContext.implicits._
              ^
<console>:10: error: not found: value sqlContext
       import sqlContext.sql
1 ACCEPTED SOLUTION

@Vedant Jain

Example below works with Sandbox 2.3.2:

PS: Note I haven't changed classpath, I only used --jars option

from shell:

spark-shell --master yarn-client --jars /usr/hdp/current/phoenix-client/phoenix-client.jar

inside spark-shell:

//option 1, read table
val jdbcDF = sqlContext.read.format("jdbc").options( 
  Map(
  "driver" -> "org.apache.phoenix.jdbc.PhoenixDriver",
  "url" -> "jdbc:phoenix:sandbox.hortonworks.com:2181:/hbase-unsecure",
  "dbtable" -> "TABLE1")).load()
  
jdbcDF.show


//option 2, read custom query
import java.sql.{Connection, DriverManager, DatabaseMetaData, ResultSet}
import org.apache.spark.rdd.JdbcRDD

def getConn(driverClass: => String, connStr: => String, user: => String, pass: => String): Connection = {
  var conn:Connection = null
  try{
    Class.forName(driverClass)
     conn = DriverManager.getConnection(connStr, user, pass)
  }catch{ case e: Exception => e.printStackTrace }
  conn
}

val myRDD = new JdbcRDD( sc, () => getConn("org.apache.phoenix.jdbc.PhoenixDriver", "jdbc:phoenix:localhost:2181:/hbase-unsecure", "", "") ,
"select sum(10) from TABLE1 where ? <= id and id <= ?",
1, 10, 2)
myRDD.take(10)

val myRDD = new JdbcRDD( sc, () => getConn("org.apache.phoenix.jdbc.PhoenixDriver", "jdbc:phoenix:localhost:2181:/hbase-unsecure", "", "") ,
"select col1 from TABLE1 where ? <= id and id <= ?",
1, 10, 2)

myRDD.take(10)

Also note that Phoenix team recommends to use Phoenix Spark instead of jdbc directly: http://phoenix.apache.org/phoenix_spark.html

Here an example with PhoenixSpark package:

from shell:

spark-shell --master yarn-client --jars /usr/hdp/current/phoenix-client/phoenix-client.jar,/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.4.0.2.3.2.0-2950.jar --conf "spark.executor.extraClassPath=/usr/hdp/current/phoenix-client/phoenix-client.jar"

inside spark-shell:

import org.apache.phoenix.spark._

val df = sqlContext.load(
  "org.apache.phoenix.spark",
  Map("table" -> "TABLE1", "zkUrl" -> "localhost:2181:/hbase-unsecure")
)
df.show

And here a sample project that can be built and executed thru spark-submit:

https://github.com/gbraccialli/SparkUtils

git clone https://github.com/gbraccialli/SparkUtils
cd SparkUtils/
mvn clean package
spark-submit --class com.github.gbraccialli.spark.PhoenixSparkSample target/SparkUtils-1.0.0-SNAPSHOT.jar

Also check @Randy Gelhausen project that use Phoenix Spark to automatic load data from Hive to Phoenix:

https://github.com/randerzander/HiveToPhoenix (I copied my pom.xml from Randy's project)

View solution in original post

12 REPLIES 12

Master Collaborator

Is this the full stack trace or there is more?

Explorer

Can you please share the code and full error log?

Guru

See the stack trace attached. spark-phoenix-stack-trace.txtI ran:

spark-submit --master yarn

Master Collaborator

This looks like a classpath issue. I suspect your phoenix-client.jar is packaged with classes compiled against a different jackson jar then what it finds in the classpath.

Guru

It works fine ... if I don't modify the Spark classpath

Explorer

Looks like some conflict b/w Spark and Phoenix jars. No?

Googling on the data in the stack trace, it looks related to Jackson. I'm not familair with Phoenix - does it use it's own version?

Hi @Vedant Jain,

How did you fix this issue. I have HDP 2.4.0 installed and i used the following command to invoke the spark-shell

spark-shell --master yarn-client --jars /usr/hdp/current/phoenix-client/phoenix-client.jar,/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.4.0.2.4.0.0-169.jar

I didnt cahnge anything in the default configurations. just installed spark and tried and got the below eception

Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row

Any thoughts would be great. Also i tried what @Guilherme Braccialli said. it didnt work too. is it a bug in HDP 2.4.0?

@Vedant Jain

Example below works with Sandbox 2.3.2:

PS: Note I haven't changed classpath, I only used --jars option

from shell:

spark-shell --master yarn-client --jars /usr/hdp/current/phoenix-client/phoenix-client.jar

inside spark-shell:

//option 1, read table
val jdbcDF = sqlContext.read.format("jdbc").options( 
  Map(
  "driver" -> "org.apache.phoenix.jdbc.PhoenixDriver",
  "url" -> "jdbc:phoenix:sandbox.hortonworks.com:2181:/hbase-unsecure",
  "dbtable" -> "TABLE1")).load()
  
jdbcDF.show


//option 2, read custom query
import java.sql.{Connection, DriverManager, DatabaseMetaData, ResultSet}
import org.apache.spark.rdd.JdbcRDD

def getConn(driverClass: => String, connStr: => String, user: => String, pass: => String): Connection = {
  var conn:Connection = null
  try{
    Class.forName(driverClass)
     conn = DriverManager.getConnection(connStr, user, pass)
  }catch{ case e: Exception => e.printStackTrace }
  conn
}

val myRDD = new JdbcRDD( sc, () => getConn("org.apache.phoenix.jdbc.PhoenixDriver", "jdbc:phoenix:localhost:2181:/hbase-unsecure", "", "") ,
"select sum(10) from TABLE1 where ? <= id and id <= ?",
1, 10, 2)
myRDD.take(10)

val myRDD = new JdbcRDD( sc, () => getConn("org.apache.phoenix.jdbc.PhoenixDriver", "jdbc:phoenix:localhost:2181:/hbase-unsecure", "", "") ,
"select col1 from TABLE1 where ? <= id and id <= ?",
1, 10, 2)

myRDD.take(10)

Also note that Phoenix team recommends to use Phoenix Spark instead of jdbc directly: http://phoenix.apache.org/phoenix_spark.html

Here an example with PhoenixSpark package:

from shell:

spark-shell --master yarn-client --jars /usr/hdp/current/phoenix-client/phoenix-client.jar,/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.4.0.2.3.2.0-2950.jar --conf "spark.executor.extraClassPath=/usr/hdp/current/phoenix-client/phoenix-client.jar"

inside spark-shell:

import org.apache.phoenix.spark._

val df = sqlContext.load(
  "org.apache.phoenix.spark",
  Map("table" -> "TABLE1", "zkUrl" -> "localhost:2181:/hbase-unsecure")
)
df.show

And here a sample project that can be built and executed thru spark-submit:

https://github.com/gbraccialli/SparkUtils

git clone https://github.com/gbraccialli/SparkUtils
cd SparkUtils/
mvn clean package
spark-submit --class com.github.gbraccialli.spark.PhoenixSparkSample target/SparkUtils-1.0.0-SNAPSHOT.jar

Also check @Randy Gelhausen project that use Phoenix Spark to automatic load data from Hive to Phoenix:

https://github.com/randerzander/HiveToPhoenix (I copied my pom.xml from Randy's project)

New Contributor

The solution doesn't work for HDP 2.4.0, running this I get this error when running df.show:

16/05/12 00:53:21 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 8, usfit-hdpdev-n02.global.internal): java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row

If I use pyspark it also dies with a different stack trace.

I attempt to follow the directions at http://phoenix.apache.org/phoenix_spark.html and update the spark-config to include the spark-client.jar for both ‘spark.executor.extraClassPath’ and ‘spark.driver.extraClassPath’ spark-shell then dies with a pretty long stack trace when attempting to setup the SQLContext. Same thing for pyspark.

It appears there are some incompatibilities with the Phoenix jars and with the Spark jars as built in HDP. Does anyone have a work around?

thanks!

Super Collaborator

@Chris Tarnas

You are right, on HDP2.4 this is broken. But it still works as long as you don't go for the DF option, that one fails. It needs Phoenix 4.7.0.0 to work, that will be available in HDP2.5.

The Phoenix_Spark connector for getting Hbase data into a RDD (not DF !) works for me.

An update on @Guilherme Braccialli excellent guidance, to make the Phoenix Spark thing work, for RDD option, but then for the 2.4 Sandbox:

spark-shell --master yarn-client --jars /usr/hdp/2.4.0.0-169/phoenix/phoenix-4.4.0.2.4.0.0-169-client.jar,/usr/hdp/2.4.0.0-169/phoenix/lib/phoenix-spark-4.4.0.2.4.0.0-169.jar --conf "spark.executor.extraClassPath=/usr/hdp/2.4.0.0-169/phoenix/lib/phoenix-spark-4.4.0.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/phoenix/phoenix-4.4.0.2.4.0.0-169-client.jar"

Rising Star

Got this working in zeppelin. Similar answer as @Guilherme Braccialli but translated for zeppelin users

1. Run this block first

%dep 
z.reset 
z.load("/usr/hdp/current/phoenix-client/phoenix-4.4.0.2.3.2.0-2950-client.jar") 
z.load("/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.4.0.2.3.2.0-2950.jar")

2. Then in a separate snippet run this

import org.apache.phoenix.spark._
val df = sqlc.load(
  "org.apache.phoenix.spark",
  Map("table" -> "TRANSACTIONHISTORY", 
      "zkUrl" -> "sandbox.hortonworks.com:2181:/hbase-unsecure")
)
df.show

New Contributor

Tried the above lines in sandbox HDP 2.4 - Zeppelin, its not working.

error.png

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.