Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

csv file to orc table with transformation and validation process

Highlighted

csv file to orc table with transformation and validation process

Explorer

Hello ,

I want to load data from csv file from different Datasources in HDFS to ORC table , including some data validations ( business rules ...) and transformation ... . Actually , my general process is to load all csv files with the same structure from the same datasource to one unique external table , then processing and applying the validation rules and the data transformation , then load the data to orc table .

My question is , what is the best way to automate the process (loading , validation and transformation ) to make scheduling and monitoring easy and also all the CRUD problems .

PS : i started working with java spark and oozie for scheduling

thx

4 REPLIES 4

Re: csv file to orc table with transformation and validation process

Hi @Réda

You can use NiFi to implement and industrialize these ingestion tasks. You have out of the box connectors to read data from HDFS (List and FetchHDFS), to transform data (for example converting CSV to ORC) and to write it to HDFS or Hive.

You can do all this with UI so it offers you better time to market. You have also built-in security, HA and lineage features.

Look at these processors to have an idea on what you can do with NiFi:

and many many others : https://nifi.apache.org/docs.html

Re: csv file to orc table with transformation and validation process

Explorer

Hi @Abdelkrim Hadjidj,

Thank you for the answer .

we got certain constraints related to the environment and to the team that manages the datalake. Apache nifi is unfortunately not supported. the solutions proposed revolved around java/spark and oozie.

thx again ,

Réda

Re: csv file to orc table with transformation and validation process



1. If the validation and transformation is complex , please write Spark jobs scheduled with oozie and save the output data on to HDFS (ORC) and map it to a hive table.


2. If the validation and transformation are simple one can use Pig and save the output into HDFS(ORC) and mapping a hive table to it. pig supports only Map reduce and Tez

3. Its a csv so the assumption is data is not huge,nor streaming. distcp data to hdfs else You can use NIFI to ingest data onto HDFS and run the spark job over it.

4. If the ingestion needs to be done periodically , please maintain partition of YYYY-MM-DD on the output directory on hdfs and map the partition to hive.


Re: csv file to orc table with transformation and validation process

Explorer

hello @kgautam ,

Thx for you answer .

the validation and transformation consist in file name validation , column validation , datatype .. adding some additional columns ( update date , expiration ...) , If all the checks are OK, then load the file in the ORC table.

There is some rules and transformation when is a creation , update or delete . ( expiration and data validation process )

Also i need to use use parameters

  • to enable or disable one or many checks,
  • to determine how many errors are allowed before stop the process ...

The ingestion is normally at the End of the day .

Thx

Don't have an account?
Coming from Hortonworks? Activate your account here