Support Questions

debananda_sahoo · ‎07-17-2018

Hi,

The below code is not working in Spark 2.3 , but its working in 1.7.

Can someone modify the code as per Spark 2.3

import os

from pyspark import SparkConf,SparkContext

from pyspark.sql import HiveContext

conf = (SparkConf() .setAppName("data_import") .set("spark.dynamicAllocation.enabled","true") .set("spark.shuffle.service.enabled","true"))

sc = SparkContext(conf = conf)

sqlctx = HiveContext(sc)

df = sqlctx.load( source="jdbc", url="jdbc:sqlserver://10.24.40.29;database=CORE;username=user1;password=Passw0rd", dbtable="test")

## this is how to write to an ORC file df.write.format("orc").save("/tmp/orc_query_output")

## this is how to write to a hive table df.write.mode('overwrite').format('orc').saveAsTable("test")

Error : AttributeError: 'HiveContext' object has no attribute 'load'

falbani · ‎07-17-2018

@Debananda Sahoo

In spark 2 you should leverage spark session instead of spark context. To read jdbc datasource just use the following code:

from pyspark.sql import SparkSession
from pyspark.sql import Row

spark = SparkSession \
    .builder \
    .appName("data_import") \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.shuffle.service.enabled", "true") \
    .enableHiveSupport() \
    .getOrCreate()

jdbcDF2 = spark.read \
    .jdbc("jdbc:sqlserver://10.24.40.29;database=CORE;username=user1;password=Passw0rd", "test")

More information and examples on this link:

https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#jdbc-to-other-databases

Please let me know if that works for you.

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

View solution in original post

falbani · ‎07-17-2018

@Debananda Sahoo

In spark 2 you should leverage spark session instead of spark context. To read jdbc datasource just use the following code:

from pyspark.sql import SparkSession
from pyspark.sql import Row

spark = SparkSession \
    .builder \
    .appName("data_import") \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.shuffle.service.enabled", "true") \
    .enableHiveSupport() \
    .getOrCreate()

jdbcDF2 = spark.read \
    .jdbc("jdbc:sqlserver://10.24.40.29;database=CORE;username=user1;password=Passw0rd", "test")

More information and examples on this link:

https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#jdbc-to-other-databases

Please let me know if that works for you.

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

debananda_sahoo · ‎07-17-2018

Thanks Felix for your quick response. It worked. Thanks a lot.

debananda_sahoo · ‎07-18-2018

@Felix Albani There is still some issue. Tables were exist in hive but I am not able to access it. Its showing below error while I am doing a select * from table.

<small>java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1531811351810_0064_1_00, diagnostics=[Task failed, taskId=task_1531811351810_0064_1_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://sandbox-hdp.hortonworks.com:8020/apps/hive/warehouse/t_currency/part-00000-2feb31ba-70a4-40a0-a64f-e976b8dd587a-c000.snappy.parquet
	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194)
	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185)
	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181)
	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
<strong>Caused by: java.lang.RuntimeException: java.io.IOException: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://sandbox-hdp.hortonworks.com:8020/apps/hive/warehouse/t_currency/part-00000-2feb31ba-70a4-40a0-a64f-e976b8dd587a-c000.snappy.parquet</strong>
	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:196)
	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:135)
	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:101)
	at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:149)
	at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80)
	at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:674)
	at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:633)
	at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:145)
	at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109)
	at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:405)
	at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:124)
	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:149)</small>

falbani · ‎07-18-2018

@Deb This looks to be related to parquet way for coding being different in spark than in hive. Have you tried reading a different non parquet table?

Try adding the following configuration for the parquet table:

.config("spark.sql.parquet.writeLegacyFormat","true")

If that does not work please open a new thread on this issue and we can follow up on this new thread.

Thanks!

Cloudera Community

Support Questions

AttributeError in Spark

Spark 3 legacy configurations list ( Spark 2 behav...

Spark Python Supportability Matrix

AttributeError in Spark 2.3

Spark and Java versions Supportability Matrix

Spark Scala Version Compatibility Matrix

Spark Python Integration Test Result Exceptions

Pyspark issue AttributeError: 'DataFrame' object h...

Spark Memory Management

Dynamic Allocation in Apache Spark

Apache Spark and Iceberg Supportability Matrix