Support Questions

hades_63146 · ‎02-01-2023

Hi all,

I work with cloudera 7.4.4 as our solution works with hive over spark.

As I load text files into hive , I may have schema changes on 3 manners:

1. the source data has added a column - causes data loss on insertion till the column is updated on Hive

2. the source data has omitted a column - fails the insert since that column was not dropped on hive

3. the data type has escalated to different type - had Hive is not updated in the new type , for example , int to bigint , the result will be null

nevertheless, inferschema of spark may change numeric fields to alpha and vice versa.

is there a certain way , to make a non external Hive table to comply with these changes.

I did manage to create a program that do a filler of omitted columns to the dataframe and auto add new columns and escalates the data type, but is there a built in method?

for change alphanumeric to numeric and vice versa i don't have a solution.

Or, would you suggest to put the Hive as an external table over hbase/mongo/cassandra (any other that is better?!) and is a "refresh" of the structure will be as a snap of update a structure or lock my table till data will be rebalanced?

the attachment shows that i have an initial schema and the necessity to update

thx in advanced

Shmoo · ‎02-01-2023

Hi @hades_63146 ,

If you are creating a Managed table in Hive via Spark, you need to use HiveWarehouseConnector.
https://docs.cloudera.com/cdp-private-cloud-base/7.1.3/integrating-hive-and-bi/topics/hive_hivewareh...

If you are already using HWC, and it's failing, please share the code here and we can try to check what is missing.

Good Luck.

hades_63146 · ‎02-05-2023

Hi,

my question is not about the connector. my question is how dynamically i can work with spark dataframe that should handle multiple different schema.

look at the attachment given.

nevertheless, let me add some insights. on spark 3.0 , we have allowmissingcolumns parameter for unionbyname command; what do we have on 2.0 which is equevalent?

ggangadharan · ‎07-14-2023

If my understanding is correct, the schema is altered for different input files, which implies that the data itself lacks a structured schema.

Given the frequent changes in the schema, it is advisable to store the data in a column-oriented system such as HBASE.

The Same HBASE data can be accessed through spark using HBase-Spark Connector.

Ref - https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/accessing-hbase/topics/hbase-example-using-hb...

Cloudera Community

Support Questions

Hive with spark table schema changes sensitivity

Can Hive avro tables support changing schemas?

Managing Hive Schema Changes

Spark to read the Hive table sub-directory data

nifi login case sensitivity

Hive BDR job from CDH to CDP changes Hive table lo...

Change spark history event log location.

How to automate creation of Avro and Hive Schemas ...

Support Video: How to generate Hive Random Data ba...

Creation of Hive-Phoenix integrated table fails if...

Saving Spark 2.2 dataframs in Hive table