Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

IllegalArgumentException when select with where clause on hive external table pointing to parquet data

avatar
Explorer

Followed is the workflow

Hive>create external table if not exists table1 ( C1 string, C2 string, C3 int)

stored as parquet

location 'hdfs:/project/table1.parquet';

Parquet data are created by spark as

df.write.mode("overwrite").parquet('hdfs:/project/table1.parquet')

Hive>select * from table1 where C1='toto';

OK

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

Failed with exception java.io.IOException:java.lang.IllegalArgumentException: Column [c1] was not found in schema!

Time taken: 0.254 seconds

1 ACCEPTED SOLUTION

avatar
Super Guru

Hi @Hocine Bouzelat Can you please check if you have defined same column name while creating parquet file schema through spark?

View solution in original post

7 REPLIES 7

avatar
Super Guru

Hi @Hocine Bouzelat Can you please check if you have defined same column name while creating parquet file schema through spark?

avatar
Super Guru

can you please post the output of below command?

parquet-tools schema <parquet file path>

If parquet-tools is not configured then please follow below link.

https://github.com/Parquet/parquet-mr/tree/master/parquet-tools

avatar
Super Guru

This seems to be an known issue, please set below parameter on hive shell before running select query.

set hive.optimize.ppd = false;

https://issues.apache.org/jira/browse/HIVE-11401

avatar
Super Guru

Hi @Hocine Bouzelatplease accept this answer if suggsted workaround worked for you.

avatar
Explorer

yes, schema spark and schema hive are same.

also,

hive> describe table1

gives the three columns

avatar
Explorer

Hi Jitendra,

followed is the complete processs and very explicite

pyspark program

>>>data = [('C1_1',None,'C3_1'), ('C1_2','C2_2',None),('C1_3',None,None),(None,None,'C3_4'),('C1_5','C2_5','C3_5')] 
>>>df = sqlContext.createDataFrame(data, ['C1', 'C2','C3']) 
>>>df.printSchema() 

root

|-- C1: string (nullable = true)

|-- C2: string (nullable = true)

|-- C3: string (nullable = true)

>>>df.show()

+---------+-------+-------+

| C1 | C2 | C3 |

+--------+-------+--------+

| C1_1 | null | C3_1 |

| C1_2 | C2_2| null |

| C1_3| null | null |

| null | null | C3_4 |

| C1_5| C2_5 | C3_5 |

+------+--------+-------+

>>>df.write.mode("overwrite").parquet('hdfs:/tmp/table1.parquet') 

hdfs commands

$ hdfs dfs -ls /tmp/table1.parquet 

Found 11 items

/tmp/table1.parquet/_SUCCESS /tmp/table1.parquet/_common_metadata

/tmp/table1.parquet/_metadata

/tmp/table1.parquet/part-r-00000-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet

/tmp/table1.parquet/part-r-00001-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet

/tmp/table1.parquet/part-r-00002-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet

/tmp/table1.parquet/part-r-00003-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet

/tmp/table1.parquet/part-r-00004-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet

/tmp/table1.parquet/part-r-00005-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet

/tmp/table1.parquet/part-r-00006-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet

/tmp/table1.parquet/part-r-00007-3b1ecab2-1942-47eb-b393-c6988876d4

Hive commands

hive>create external table table1 ( C1 string, C2 string, C3 string)
   stored as parquet 
   location 'hdfs:/tmp/table1.parquet'; 

OK Time taken: 6.469 seconds

hive>describe table1; 

OK

c1 string

c2 string

c3 string

Time taken: 1.455 seconds,

Fetched: 3 row(s)

hive> select * from table1; 

OK

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

C1_1 NULL C3_1

C1_2 C2_2 NULL

C1_3 NULL NULL

NULL NULL C3_4

C1_5 C2_5 C3_5

Time taken: 0.355 seconds,

Fetched: 5 row(s)

hive> select * from table1 where C1='toto'; 

OK

Failed with exception java.io.IOException:java.lang.IllegalArgumentException: Column [c1] was not found in schema!

Time taken: 0.277 seconds

hive> select * from table1 where c1='toto'; 

OK

Failed with exception java.io.IOException:java.lang.IllegalArgumentException: Column [c1] was not found in schema!

Time taken: 0.096 seconds

avatar
Explorer

As workaround, the following insrt solves the inssue

hive> set hive.optimize.ppd = false;

thnks @Jitendra Yada