Created 06-12-2020 12:39 AM
Hi All,
Need your expertise in understanding and solving the below issue.
1. Data is available in parquet(partitioned) and external table has been created in HIVE and do MSCK.
2.Query from HIVE and SPARK gives results for all columns
----issue starts from here
3.Add new columns in parquet and alter table to add new columns(with same schema and correct datatypes) and do MSCK
4.Reads values from Hive for new columns but from spark 2.3 and 2.2 will give null values
------We also did other way
5.Drop and recreate table with all columns(new and old)---add latest partition manually which has data for all new columns added and then do msck for other partitions ---in somecases it works but somecases same issue giving null values from spark
We have seen this is not an issue in spark2.4. Can you please explain why it behaves like this and if any way forward
Created 06-12-2020 05:28 AM
Can I check the logic here?
Are you using HIVE QL to create the table, and to add the new columns?
It is the Spark read that is then giving inconsistent results?
There are known issues in how Spark 2.2 handles HIVE schemas. For example https://issues.apache.org/jira/browse/SPARK-21841
If you have the example Spark code, that may help. It looks like being explicit in how spark reads the hive table may help in this case.
Created on 06-13-2020 07:11 AM - edited 06-13-2020 07:42 AM
Hi Royles,
I am using hiveql for creating the table, altering the table for adding new columns.
Doing all the operations like msck repair table,add partition to table everything I am doing from hiveql only.Only we are reading table from sparksql.
After reading your reply,I tried to create external table,do msck repair,alter table to add new columns everything from sparksql. I got the below results
1.No results from spark when reading data from table
2.No results from hive shell when reading table
3.If I see the tblproperties,parquet schema is not matching .So there are no results from hiveql and from spark
The only solution which I am following till now is(for adding new columns to external tbls)
1.Drop and create table using hiveql from hiveshell with all columns(old + new)
2.add latest partition manually which has data for all new columns added so far apart from beginning creation of table from hiveshell
3.query table from spark.Then check for tblproperties and parquet schema should be reflecting and mapped with hive columns
4.If the schema is not matching like testData in parquet is reflecting as testdata in hive tblproperties then we will get null values form spark
5.If both the schemas are matching,then we can see results from spark
4.then do msck repair which is giving me results in both spark 2.2 and 2.3
But I feel there must be some other way of adding new columns instead of dropping table and recreating it.