Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

AttributeError in Spark

avatar

Hi,

The below code is not working in Spark 2.3 , but its working in 1.7.

Can someone modify the code as per Spark 2.3

import os

from pyspark import SparkConf,SparkContext

from pyspark.sql import HiveContext

conf = (SparkConf() .setAppName("data_import") .set("spark.dynamicAllocation.enabled","true") .set("spark.shuffle.service.enabled","true"))

sc = SparkContext(conf = conf)

sqlctx = HiveContext(sc)

df = sqlctx.load( source="jdbc", url="jdbc:sqlserver://10.24.40.29;database=CORE;username=user1;password=Passw0rd", dbtable="test")

## this is how to write to an ORC file df.write.format("orc").save("/tmp/orc_query_output")

## this is how to write to a hive table df.write.mode('overwrite').format('orc').saveAsTable("test")

Error : AttributeError: 'HiveContext' object has no attribute 'load'

1 ACCEPTED SOLUTION

avatar

@Debananda Sahoo

In spark 2 you should leverage spark session instead of spark context. To read jdbc datasource just use the following code:

from pyspark.sql import SparkSession
from pyspark.sql import Row

spark = SparkSession \
    .builder \
    .appName("data_import") \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.shuffle.service.enabled", "true") \
    .enableHiveSupport() \
    .getOrCreate()

jdbcDF2 = spark.read \
    .jdbc("jdbc:sqlserver://10.24.40.29;database=CORE;username=user1;password=Passw0rd", "test")

More information and examples on this link:

https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#jdbc-to-other-databases

Please let me know if that works for you.

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

View solution in original post

4 REPLIES 4

avatar

@Debananda Sahoo

In spark 2 you should leverage spark session instead of spark context. To read jdbc datasource just use the following code:

from pyspark.sql import SparkSession
from pyspark.sql import Row

spark = SparkSession \
    .builder \
    .appName("data_import") \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.shuffle.service.enabled", "true") \
    .enableHiveSupport() \
    .getOrCreate()

jdbcDF2 = spark.read \
    .jdbc("jdbc:sqlserver://10.24.40.29;database=CORE;username=user1;password=Passw0rd", "test")

More information and examples on this link:

https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#jdbc-to-other-databases

Please let me know if that works for you.

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

avatar

Thanks Felix for your quick response. It worked. Thanks a lot.

avatar

@Felix Albani There is still some issue. Tables were exist in hive but I am not able to access it. Its showing below error while I am doing a select * from table.

<small>java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1531811351810_0064_1_00, diagnostics=[Task failed, taskId=task_1531811351810_0064_1_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://sandbox-hdp.hortonworks.com:8020/apps/hive/warehouse/t_currency/part-00000-2feb31ba-70a4-40a0-a64f-e976b8dd587a-c000.snappy.parquet
	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194)
	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185)
	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181)
	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
<strong>Caused by: java.lang.RuntimeException: java.io.IOException: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://sandbox-hdp.hortonworks.com:8020/apps/hive/warehouse/t_currency/part-00000-2feb31ba-70a4-40a0-a64f-e976b8dd587a-c000.snappy.parquet</strong>
	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:196)
	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:135)
	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:101)
	at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:149)
	at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80)
	at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:674)
	at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:633)
	at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:145)
	at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109)
	at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:405)
	at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:124)
	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:149)</small>

avatar

@Deb This looks to be related to parquet way for coding being different in spark than in hive. Have you tried reading a different non parquet table?

Try adding the following configuration for the parquet table:

.config("spark.sql.parquet.writeLegacyFormat","true")

If that does not work please open a new thread on this issue and we can follow up on this new thread.

Thanks!