Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Complex, high fidelity data transformation

Complex, high fidelity data transformation

New Contributor

Hi, just starting off with NiFi. I am impressed with the orchestration capabilities and am now considering how to best handle the "T" in ETL. Ultimately I need to support hundreds of different formats of structured data, typically transforming into one or two target formats. Often these formats have hundreds of fields and significant munging is required to reach the target format. External enrichment might be required (NER, geocoding, you name it.)

My thought right now is that I'll need a NiFi processor that handles translations using some external capability. Either my team can build this definition-driven translation engine and make it available to NiFi via a processor.... OR, has anyone here ever successfully integrated some other ETL tool? Open source and even commercial seems to have extraction and loading covered but I've not yet run into a great tool that specializes in allowing a data guru to create deep schema transformation definitions that can be executed at run time. The idea is roughly like XSLT in raw spirit but more more extensive in transformation capabilities.... would love to hear your recommendations if you have any.

1 REPLY 1
Highlighted

Re: Complex, high fidelity data transformation

Expert Contributor

NiFi is a tool intended primarily for Flow Management as our docs state. Some light transformation work can be done but it is not intended for larger amounts of manipulation. For that level of work, you could take an ELT approach and transform the data after landing it into your HDP cluster. In that setup, NiFi would do the Flow Management work to get the data into HDFS on HDP; then you could use any number of approaches (Spark, Hive, Pig, etc) to transform the data into your intended structure.

Hortonworks also has many partners that specialize in ETL on Hadoop. Syncsort is one of them; we have a tutorial on how you can set that up with their product. https://hortonworks.com/hadoop-tutorial/deploying-hadoop-etl-in-the-hortonworks-sandbox/

Don't have an account?
Coming from Hortonworks? Activate your account here