Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark SQL join fails parse on temporary variable, works on real (orc) table

Spark SQL join fails parse on temporary variable, works on real (orc) table

New Contributor

hdpjoinscalatempvariable.zip

Hello, can anyone see what I am doing incorrectly? Here's the SQL against the real table:

scala> val sq110m_Q601 = sqlContext.sql("SELECT COUNT(*) FROM gendb2bg T1, gendb

2bg T2 WHERE T1.K100K = 49 AND T1.K250K = T2.k500k")

16/11/29 17:48:38 INFO ParseDriver: Parsing command: SELECT COUNT(*) FROM gendb2bg T1, gendb2bg T2 WHERE T1.K100K = 49 AND T1.K250K = T2.k500k

16/11/29 17:48:38 INFO ParseDriver: Parse Completed

16/11/29 17:48:38 INFO OrcRelation: Listing hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/gendb2bg on driver

16/11/29 17:48:38 INFO OrcRelation: Listing hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/gendb2bg on driver

sq110m_Q601: org.apache.spark.sql.DataFrame = [_c0: bigint]

scala>

Against temp variable, I've included a prior SQL to show the variable is assigned properly, notice the spurious column names:

16/11/29 17:49:28 INFO ParseDriver: Parsing command: SELECT K10, K25, COUNT(*) FROM db_temp1 GROUP BY K10, K25

16/11/29 17:49:28 INFO ParseDriver: Parse Completed

sq110m_Q503: org.apache.spark.sql.DataFrame = [K10: string, K25: string, _c2: bigint]

scala> val sq110m_Q601 = sqlContext.sql("SELECT COUNT(*) FROM db_temp1 T1, db_te

mp1 T2 WHERE T1.K100K = 49 AND T1.K250K = T2.k500k")

16/11/29 17:49:29 INFO ParseDriver: Parsing command: SELECT COUNT(*) FROM db_temp1 T1, db_temp1 T2 WHERE T1.K100K = 49 AND T1.K250K = T2.k500k

16/11/29 17:49:29 INFO ParseDriver: Parse Completed

org.apache.spark.sql.AnalysisException: cannot resolve 'T2.k500k' given input columns: [k10, k5m, _col13, _col1, _col7, _col14, s4, _col8, _col9, k1m, k10m, k100m, k2m, k25, _col23, kseq, sortchar, _col2, k500m, sortbin, s6, _col24, _col20, _col12, k100, k4, k1b, _col0, k5, _col21, sortpack, _col28, k100k, k1k, s7, _col17, s5, _col11, k500k, _col4, _col10, _col18, s8, _col15, _col16, _col25, k10k, _col6, _col26, kpart, _col19, k40k, k250k, _col3, _col22, _col5, _col27, k2]; line 1 pos 81

at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)

Thanks for taking the time to consider this issue.

Here's the DDL from a similar setup on Google Cloud Dataproc, works ok there:

CREATE TABLE SQ4

( KPART int,

KSEQ int ,

K1B int ,

K500M int ,

K100M int ,

K10M int ,

K5M int ,

K2M int ,

K1M int ,

K500K int ,

K250K int ,

K100K int ,

K40K int ,

K10K int ,

K1K int ,

K100 int ,

K25 int ,

K10 int ,

K5 int ,

K4 int ,

K2 int ,

SORTBIN int ,

SORTPACK int ,

SORTCHAR string ,

S4 string ,

S5 string ,

S6 string ,

S7 string ,

S8 string)

row format delimited

fields terminated by ','

stored as orc

tblproperties ("orc.compress"="SNAPPY",

'orc.block.size'='268435456');

Can anyone see what I am doing wrong?

Thank you.

2 REPLIES 2

Re: Spark SQL join fails parse on temporary variable, works on real (orc) table

can you try running the job after setting this spark configuration spark.sql.caseSensitive=false and see if it helps?

Re: Spark SQL join fails parse on temporary variable, works on real (orc) table

Expert Contributor

Hi, @David Young

If this still exists, could you try `spark.sql.hive.convertMetastoreOrc=false` ?