Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Best approach to ingest CSV with changing schema?

avatar
Expert Contributor

Hello,

I'm still new to Hadoop technology and I'm struggling defining the best approach for the following 2 similar challenges:

In first instance, trying to ingest the files -> Hive. InferAVROSchema in Nifi is limited as it does not always recognize the right data type, generating a fair amount of error when the files are ingested.

Switching to specifiy the schema manually bring the following problems:

- Ingesting CSV files that have schema updates over the year, I have a versioning documentation giving me the schema changes, however the date in the versioning document do not match the date of effective change.

- Ingesting hourly CSV files with a schema depending of the business activity (a set of columns is mandatory, a large set is optionnal and will only be seen when the underling options have been used) . The schema of the files is different from hours to hours, and I can't predict which one is to expect.

My feelings are that I have to move to NoSql type of DB / storage, but I'm not exactly sure how to tackle this in the best way.

Has anyone faced similar problematic?

Thanks

Christophe

1 ACCEPTED SOLUTION

avatar

If your schema is changing from hour to hour, then maybe try this:

1. Use Spark with the CSV reader from DataBricks to process the data. The CSV reader can automatically infer the schema.

2. Write the DataFrame to HBase. With HBase you don't need a schema defined and each row can have varying number of columns. When you are ready to analyze the data in HBase you can use Apache Phoenix to create a schema atop the HBase table.

3. You could even check the number of columns in the DataFrame and then route to a Hive table based on the number of columns. For example, if I count 7 fields, then route to table A, if 10 fields, then table B. Hive has a fixed number of columns, whereas HBase does not.

View solution in original post

2 REPLIES 2

avatar

If your schema is changing from hour to hour, then maybe try this:

1. Use Spark with the CSV reader from DataBricks to process the data. The CSV reader can automatically infer the schema.

2. Write the DataFrame to HBase. With HBase you don't need a schema defined and each row can have varying number of columns. When you are ready to analyze the data in HBase you can use Apache Phoenix to create a schema atop the HBase table.

3. You could even check the number of columns in the DataFrame and then route to a Hive table based on the number of columns. For example, if I count 7 fields, then route to table A, if 10 fields, then table B. Hive has a fixed number of columns, whereas HBase does not.

avatar
Expert Contributor

Hi @Binu Mathew

Thanks for your answer. I'll dive into this approach & post further if/when required.

Thanks!

Christoohe