Created 02-01-2023 08:48 AM
Hi all,
I work with cloudera 7.4.4 as our solution works with hive over spark.
As I load text files into hive , I may have schema changes on 3 manners:
1. the source data has added a column - causes data loss on insertion till the column is updated on Hive
2. the source data has omitted a column - fails the insert since that column was not dropped on hive
3. the data type has escalated to different type - had Hive is not updated in the new type , for example , int to bigint , the result will be null
nevertheless, inferschema of spark may change numeric fields to alpha and vice versa.
is there a certain way , to make a non external Hive table to comply with these changes.
I did manage to create a program that do a filler of omitted columns to the dataframe and auto add new columns and escalates the data type, but is there a built in method?
for change alphanumeric to numeric and vice versa i don't have a solution.
Or, would you suggest to put the Hive as an external table over hbase/mongo/cassandra (any other that is better?!) and is a "refresh" of the structure will be as a snap of update a structure or lock my table till data will be rebalanced?
the attachment shows that i have an initial schema and the necessity to update
thx in advanced
Created on 02-01-2023 12:54 PM - edited 02-01-2023 12:59 PM
Hi @hades_63146 ,
If you are creating a Managed table in Hive via Spark, you need to use HiveWarehouseConnector.
https://docs.cloudera.com/cdp-private-cloud-base/7.1.3/integrating-hive-and-bi/topics/hive_hivewareh...
If you are already using HWC, and it's failing, please share the code here and we can try to check what is missing.
Good Luck.
Created 02-05-2023 10:00 PM
Hi,
my question is not about the connector. my question is how dynamically i can work with spark dataframe that should handle multiple different schema.
look at the attachment given.
nevertheless, let me add some insights. on spark 3.0 , we have allowmissingcolumns parameter for unionbyname command; what do we have on 2.0 which is equevalent?
Created 07-14-2023 02:39 AM
If my understanding is correct, the schema is altered for different input files, which implies that the data itself lacks a structured schema.
Given the frequent changes in the schema, it is advisable to store the data in a column-oriented system such as HBASE.
The Same HBASE data can be accessed through spark using HBase-Spark Connector.
Ref - https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/accessing-hbase/topics/hbase-example-using-hb...