Support Questions

hades_63146 · ‎02-01-2023

Hi all,

I work with cloudera 7.4.4 as our solution works with hive over spark.

As I load text files into hive , I may have schema changes on 3 manners:

1. the source data has added a column - causes data loss on insertion till the column is updated on Hive

2. the source data has omitted a column - fails the insert since that column was not dropped on hive

3. the data type has escalated to different type - had Hive is not updated in the new type , for example , int to bigint , the result will be null

nevertheless, inferschema of spark may change numeric fields to alpha and vice versa.

is there a certain way , to make a non external Hive table to comply with these changes.

I did manage to create a program that do a filler of omitted columns to the dataframe and auto add new columns and escalates the data type, but is there a built in method?

for change alphanumeric to numeric and vice versa i don't have a solution.

Or, would you suggest to put the Hive as an external table over hbase/mongo/cassandra (any other that is better?!) and is a "refresh" of the structure will be as a snap of update a structure or lock my table till data will be rebalanced?

the attachment shows that i have an initial schema and the necessity to update

thx in advanced

Shmoo · ‎02-01-2023

Hi @hades_63146 ,

If you are creating a Managed table in Hive via Spark, you need to use HiveWarehouseConnector.
https://docs.cloudera.com/cdp-private-cloud-base/7.1.3/integrating-hive-and-bi/topics/hive_hivewareh...

If you are already using HWC, and it's failing, please share the code here and we can try to check what is missing.

Good Luck.

hades_63146 · ‎02-05-2023

Hi,

my question is not about the connector. my question is how dynamically i can work with spark dataframe that should handle multiple different schema.

look at the attachment given.

nevertheless, let me add some insights. on spark 3.0 , we have allowmissingcolumns parameter for unionbyname command; what do we have on 2.0 which is equevalent?

ggangadharan · ‎07-14-2023

If my understanding is correct, the schema is altered for different input files, which implies that the data itself lacks a structured schema.

Given the frequent changes in the schema, it is advisable to store the data in a column-oriented system such as HBASE.

The Same HBASE data can be accessed through spark using HBase-Spark Connector.

Ref - https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/accessing-hbase/topics/hbase-example-using-hb...

Cloudera Community

Support Questions

Hive with spark table schema changes sensitivity