Support Questions

Find answers, ask questions, and share your expertise

IllegalArgumentException when select with where clause on hive external table pointing to parquet data

Explorer

Followed is the workflow

Hive>create external table if not exists table1 ( C1 string, C2 string, C3 int)

stored as parquet

location 'hdfs:/project/table1.parquet';

Parquet data are created by spark as

df.write.mode("overwrite").parquet('hdfs:/project/table1.parquet')

Hive>select * from table1 where C1='toto';

OK

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

Failed with exception java.io.IOException:java.lang.IllegalArgumentException: Column [c1] was not found in schema!

Time taken: 0.254 seconds

1 ACCEPTED SOLUTION

Hi @Hocine Bouzelat Can you please check if you have defined same column name while creating parquet file schema through spark?

View solution in original post

7 REPLIES 7

Hi @Hocine Bouzelat Can you please check if you have defined same column name while creating parquet file schema through spark?

can you please post the output of below command?

parquet-tools schema <parquet file path>

If parquet-tools is not configured then please follow below link.

https://github.com/Parquet/parquet-mr/tree/master/parquet-tools

This seems to be an known issue, please set below parameter on hive shell before running select query.

set hive.optimize.ppd = false;

https://issues.apache.org/jira/browse/HIVE-11401

Hi @Hocine Bouzelatplease accept this answer if suggsted workaround worked for you.

Explorer

yes, schema spark and schema hive are same.

also,

hive> describe table1

gives the three columns

Explorer

Hi Jitendra,

followed is the complete processs and very explicite

pyspark program

>>>data = [('C1_1',None,'C3_1'), ('C1_2','C2_2',None),('C1_3',None,None),(None,None,'C3_4'),('C1_5','C2_5','C3_5')] 
>>>df = sqlContext.createDataFrame(data, ['C1', 'C2','C3']) 
>>>df.printSchema() 

root

|-- C1: string (nullable = true)

|-- C2: string (nullable = true)

|-- C3: string (nullable = true)

>>>df.show()

+---------+-------+-------+

| C1 | C2 | C3 |

+--------+-------+--------+

| C1_1 | null | C3_1 |

| C1_2 | C2_2| null |

| C1_3| null | null |

| null | null | C3_4 |

| C1_5| C2_5 | C3_5 |

+------+--------+-------+

>>>df.write.mode("overwrite").parquet('hdfs:/tmp/table1.parquet') 

hdfs commands

$ hdfs dfs -ls /tmp/table1.parquet 

Found 11 items

/tmp/table1.parquet/_SUCCESS /tmp/table1.parquet/_common_metadata

/tmp/table1.parquet/_metadata

/tmp/table1.parquet/part-r-00000-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet

/tmp/table1.parquet/part-r-00001-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet

/tmp/table1.parquet/part-r-00002-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet

/tmp/table1.parquet/part-r-00003-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet

/tmp/table1.parquet/part-r-00004-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet

/tmp/table1.parquet/part-r-00005-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet

/tmp/table1.parquet/part-r-00006-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet

/tmp/table1.parquet/part-r-00007-3b1ecab2-1942-47eb-b393-c6988876d4

Hive commands

hive>create external table table1 ( C1 string, C2 string, C3 string)
   stored as parquet 
   location 'hdfs:/tmp/table1.parquet'; 

OK Time taken: 6.469 seconds

hive>describe table1; 

OK

c1 string

c2 string

c3 string

Time taken: 1.455 seconds,

Fetched: 3 row(s)

hive> select * from table1; 

OK

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

C1_1 NULL C3_1

C1_2 C2_2 NULL

C1_3 NULL NULL

NULL NULL C3_4

C1_5 C2_5 C3_5

Time taken: 0.355 seconds,

Fetched: 5 row(s)

hive> select * from table1 where C1='toto'; 

OK

Failed with exception java.io.IOException:java.lang.IllegalArgumentException: Column [c1] was not found in schema!

Time taken: 0.277 seconds

hive> select * from table1 where c1='toto'; 

OK

Failed with exception java.io.IOException:java.lang.IllegalArgumentException: Column [c1] was not found in schema!

Time taken: 0.096 seconds

Explorer

As workaround, the following insrt solves the inssue

hive> set hive.optimize.ppd = false;

thnks @Jitendra Yada