Created 07-28-2016 10:00 AM
Followed is the workflow
Hive>create external table if not exists table1 ( C1 string, C2 string, C3 int)
stored as parquet
location 'hdfs:/project/table1.parquet';
Parquet data are created by spark as
df.write.mode("overwrite").parquet('hdfs:/project/table1.parquet')
Hive>select * from table1 where C1='toto';
OK
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: Column [c1] was not found in schema!
Time taken: 0.254 seconds
Created 07-28-2016 10:05 AM
Hi @Hocine Bouzelat Can you please check if you have defined same column name while creating parquet file schema through spark?
Created 07-28-2016 10:05 AM
Hi @Hocine Bouzelat Can you please check if you have defined same column name while creating parquet file schema through spark?
Created 07-28-2016 10:19 AM
can you please post the output of below command?
parquet-tools schema <parquet file path>
If parquet-tools is not configured then please follow below link.
https://github.com/Parquet/parquet-mr/tree/master/parquet-tools
Created 07-28-2016 11:31 AM
This seems to be an known issue, please set below parameter on hive shell before running select query.
set hive.optimize.ppd = false;
Created 07-28-2016 12:49 PM
Hi @Hocine Bouzelatplease accept this answer if suggsted workaround worked for you.
Created 07-28-2016 10:11 AM
yes, schema spark and schema hive are same.
also,
hive> describe table1
gives the three columns
Created 07-28-2016 11:21 AM
Hi Jitendra,
followed is the complete processs and very explicite
pyspark program
>>>data = [('C1_1',None,'C3_1'), ('C1_2','C2_2',None),('C1_3',None,None),(None,None,'C3_4'),('C1_5','C2_5','C3_5')]
>>>df = sqlContext.createDataFrame(data, ['C1', 'C2','C3'])
>>>df.printSchema()
root
|-- C1: string (nullable = true)
|-- C2: string (nullable = true)
|-- C3: string (nullable = true)
>>>df.show()
+---------+-------+-------+
| C1 | C2 | C3 |
+--------+-------+--------+
| C1_1 | null | C3_1 |
| C1_2 | C2_2| null |
| C1_3| null | null |
| null | null | C3_4 |
| C1_5| C2_5 | C3_5 |
+------+--------+-------+
>>>df.write.mode("overwrite").parquet('hdfs:/tmp/table1.parquet')
hdfs commands
$ hdfs dfs -ls /tmp/table1.parquet
Found 11 items
/tmp/table1.parquet/_SUCCESS /tmp/table1.parquet/_common_metadata
/tmp/table1.parquet/_metadata
/tmp/table1.parquet/part-r-00000-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet
/tmp/table1.parquet/part-r-00001-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet
/tmp/table1.parquet/part-r-00002-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet
/tmp/table1.parquet/part-r-00003-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet
/tmp/table1.parquet/part-r-00004-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet
/tmp/table1.parquet/part-r-00005-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet
/tmp/table1.parquet/part-r-00006-3b1ecab2-1942-47eb-b393-c6988876d4a1.gz.parquet
/tmp/table1.parquet/part-r-00007-3b1ecab2-1942-47eb-b393-c6988876d4
Hive commands
hive>create external table table1 ( C1 string, C2 string, C3 string) stored as parquet location 'hdfs:/tmp/table1.parquet';
OK Time taken: 6.469 seconds
hive>describe table1;
OK
c1 string
c2 string
c3 string
Time taken: 1.455 seconds,
Fetched: 3 row(s)
hive> select * from table1;
OK
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
C1_1 NULL C3_1
C1_2 C2_2 NULL
C1_3 NULL NULL
NULL NULL C3_4
C1_5 C2_5 C3_5
Time taken: 0.355 seconds,
Fetched: 5 row(s)
hive> select * from table1 where C1='toto';
OK
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: Column [c1] was not found in schema!
Time taken: 0.277 seconds
hive> select * from table1 where c1='toto';
OK
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: Column [c1] was not found in schema!
Time taken: 0.096 seconds
Created 07-28-2016 02:58 PM
As workaround, the following insrt solves the inssue
hive> set hive.optimize.ppd = false;
thnks @Jitendra Yada