Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Selecting appropriate tool for data ingestion and processing

Highlighted

Selecting appropriate tool for data ingestion and processing

New Contributor

Hi,

We are currently designing a platform to enable data ingestion (from sources like RDBMS, streams, files etc) to hadoop and then perform data processing. This is intended to be a self service platform for our business users.

We want job administration and scheduling capabilities. Moreover we want data pipeline to be automatically get created depending upon the inputs from the user. We are evaluating some options for the same like Apache Nifi, Apache Oozie and Airflow. Nifi is cool and extensible but lacks a central job monitoring like Oozie and Airflow. On the other hand we found Oozie to be less convenient for the streaming jobs. Our initial exploration gives us an impression that Airflow lacks native REST API for job handling. We were also interested to explore Apache Falcon, but it is deprecated as a part of HDP 2.6. I'm not much aware as how much active development and support is around these different technologies.

It will be helpful for us if anyone can help with their experiences over similar problem.

Please let me know if you need more details about our project.

Thanks,

Amit

2 REPLIES 2
Highlighted

Re: Selecting appropriate tool for data ingestion and processing

New Contributor

https://kylo.io they have a sandbox of Kylo on HDP. https://kylo.io/quickstart.html

It was designed to solve the business user use case and the central job monitoring use case.

The sandbox makes it pretty easy to try out.

Re: Selecting appropriate tool for data ingestion and processing

@Amit Ranjan

Falcon has been deprecated and replaced by more comprehensive services included in Data Plane Services: https://docs.hortonworks.com/HDPDocuments/DPS1/DPS-1.1.0/index.html

I agree with your assessment of those three tools. However, I would like to point-out that NiFi provides reporting tasks and I have seen enterprises enabling those reporting tasks and built custom dashboards (grafana).

https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.1.1/bk_user-guide/content/Reporting_Tasks.html

https://pierrevillard.com/2017/05/16/monitoring-nifi-ambari-grafana/

Keep in mind that NiFi can execute Spark jobs interactively via Livy, also that it can start flows on schedule or event. Each flow can be considered a job and can be monitored via reporting task, so if you build a dashboard monitoring all the flows, you could have that operational monitoring per "job". Additionally, remember that with NiFi you get the lineage and data governance integration with Atlas, to not mention integrated security via Ranger.

Data Plane Services, specifically "Data Steward Studio" will provide with that enterprise level data governance combining information from multiple clusters. See: https://docs.hortonworks.com/HDPDocuments/DSS1/DSS-1.0.0/getting-started/content/dss_data_steward_st...

Don't have an account?
Coming from Hortonworks? Activate your account here