I want to load data from csv file from different Datasources in HDFS to ORC table , including some data validations ( business rules ...) and transformation ... . Actually , my general process is to load all csv files with the same structure from the same datasource to one unique external table , then processing and applying the validation rules and the data transformation , then load the data to orc table .
My question is , what is the best way to automate the process (loading , validation and transformation ) to make scheduling and monitoring easy and also all the CRUD problems .
PS : i started working with java spark and oozie for scheduling
You can use NiFi to implement and industrialize these ingestion tasks. You have out of the box connectors to read data from HDFS (List and FetchHDFS), to transform data (for example converting CSV to ORC) and to write it to HDFS or Hive.
You can do all this with UI so it offers you better time to market. You have also built-in security, HA and lineage features.
Look at these processors to have an idea on what you can do with NiFi:
and many many others : https://nifi.apache.org/docs.html
Thank you for the answer .
we got certain constraints related to the environment and to the team that manages the datalake. Apache nifi is unfortunately not supported. the solutions proposed revolved around java/spark and oozie.
thx again ,
1. If the validation and transformation is complex , please write Spark jobs scheduled with oozie and save the output data on to HDFS (ORC) and map it to a hive table.
2. If the validation and transformation are simple one can use Pig and save the output into HDFS(ORC) and mapping a hive table to it. pig supports only Map reduce and Tez
3. Its a csv so the assumption is data is not huge,nor streaming. distcp data to hdfs else You can use NIFI to ingest data onto HDFS and run the spark job over it.
4. If the ingestion needs to be done periodically , please maintain partition of YYYY-MM-DD on the output directory on hdfs and map the partition to hive.
hello @kgautam ,
Thx for you answer .
the validation and transformation consist in file name validation , column validation , datatype .. adding some additional columns ( update date , expiration ...) , If all the checks are OK, then load the file in the ORC table.
There is some rules and transformation when is a creation , update or delete . ( expiration and data validation process )
Also i need to use use parameters
The ingestion is normally at the End of the day .