Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive with spark table schema changes sensitivity

avatar
New Contributor

Hi all,

I work with cloudera 7.4.4 as our solution works with hive over spark.

As I load text files into hive , I may have schema changes on 3 manners:

1. the source data has added a column - causes data loss on insertion till the column is updated on Hive

2. the source data has omitted a column - fails the insert since that column was not dropped on hive

3. the data type has escalated to different type - had Hive is not updated in the new type , for example , int to bigint , the result will be null

nevertheless, inferschema of spark may change numeric fields to alpha and vice versa.

 

is there a certain way , to make a non external Hive table to comply with these changes.

I did manage to create a program that do a filler of omitted columns to the dataframe and auto add new columns and escalates the data type, but is there a built in method?

for change alphanumeric to numeric and vice versa i don't have a solution.

 

Or, would you suggest to put the Hive as an external table over hbase/mongo/cassandra (any other that is better?!) and is a "refresh" of the structure will be as a snap of update a structure or lock my table till data will be rebalanced?

 

the attachment shows that i have an initial schema and the necessity to update

 

thx in advanced

 

3 REPLIES 3

avatar
Rising Star

Hi @hades_63146 ,

 

If you are creating a Managed table in Hive via Spark, you need to use HiveWarehouseConnector.
https://docs.cloudera.com/cdp-private-cloud-base/7.1.3/integrating-hive-and-bi/topics/hive_hivewareh...

 

If you are already using HWC, and it's failing, please share the code here and we can try to check what is missing.

 

Good Luck.

avatar
New Contributor

Hi, 

my question is not about the connector. my question is how dynamically i can work with spark dataframe that should handle multiple different schema.

look at the attachment given.

nevertheless, let me add some insights. on spark 3.0 , we have allowmissingcolumns parameter for unionbyname command; what do we have on 2.0 which is equevalent? 

avatar
Expert Contributor

If my understanding is correct, the schema is altered for different input files, which implies that the data itself lacks a structured schema.

Given the frequent changes in the schema, it is advisable to store the data in a column-oriented system such as HBASE.

The Same HBASE data can be accessed through spark using HBase-Spark Connector. 

Ref - https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/accessing-hbase/topics/hbase-example-using-hb...