Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to create Etl job in spark

How to create Etl job in spark

Contributor

Hi
I wonder how can i use SPARK for etl  purpose? Since I am new to HADOOP. My confusion here is how do i create ETL job in SPARK or is there any GUI features available in spark in order to create ETL JOB ? Moreover, If you can recomend any videos/url available where i can find creating etl job in SPARK will be much helpful ? Please advise

 

Will look forward to hear from you

 

Thank You

Ujjwal Rana

12 REPLIES 12

Re: How to create Etl job in spark

Champion
Not the traditional ETL process. Spark needs to have access to the data you are processing on all of the nodes. This is accomplished in CDH using HDFS. So the data must be loaded there first. It can shine in in transforming the data but depending on the process it may not be a huge improvement over say MapReduce.

Most Spark work I have seen to data involves code jobs in Scala, Python, or Java. There is a slight first in the landscape as Spark has matured to the point that most tools that fit somewhere in the ETL spectrum or sphere support Spark as an execution engine. Hive on Spark is also allowing more traditional ETL to happen in Spark as it just requires SQL knowledge to work with Hive and the execution layer is handed over to Spark without you needed to know anything..

I don't have any fancy links to share right now. When looking at tools out there, old and new, search through the docs to see if they support Spark as an execution engine. Find the Hive on Spark wiki or CDH documentation to read up more on that.

Re: How to create Etl job in spark

Champion

@UjjwalRana

 

You can use "Apache NiFi" to get ETL like feature: https://nifi.apache.org/

 

Refer the below link to know "Stream Processing Nifi & Spark":

https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark

 

Also go to youtube and search for "apache nifi tutorial" for more details

 

Hope this will help you!!

 

Thanks

Kumar

 

Re: How to create Etl job in spark

Contributor

Hi Kumar

Thanks for the response. So does APACHE NIFI is the only one tool available for processing ETL job or is there others as well ? What is the most popular tool for ETL in spark ? or is it APACHE NIFI only ?

 

Thank You

 

Ujjwal

Re: How to create Etl job in spark

Champion

@UjjwalRana

 

In hadoop world, most of the tools will have an alternative tool. so we cannot say this is the only one tool but to my knowledge this is one of the best tool.  For different situations, we have to use different tools like NiFi, Oozie, Falcon, etc

 

Note: When you refer some random blogs, pls make sure to get the latest to get better idea about current situation

 

Thanks

Kumar

Re: How to create Etl job in spark

Contributor

Hi Kumar

Since i am new to hadoop world. Right now i am working under pentaho ETL but using APACHE NIFI is totally new for me. Therefore is there any online tutorial which explain how to create an etl job like merge/fact job under APACHE NIFI. I only wanted to see the demo only. Any recomendation ?

 

It will be great help if you could guide me on my above requirment.

 

Will look forward to hear from you

 

Thank You

 

Ujjwal Rana

Re: How to create Etl job in spark

Champion
@UjjwalRana Honestly, based on your original inquiry, Pentaho and tools like it are better suited unless you are looking to add skills. Others include Datameer, RapidMiner, and a third that I can't recall at the moment, let's call it the Department of Energy.

http://www.pentaho.com/blog/2014/06/30/spark-on-fire-integrating-pentaho-and-spark

Here is where you want to start for NiFi though.

https://nifi.apache.org/
Highlighted

Re: How to create Etl job in spark

Contributor

Hi mbigelow

I am not looking for third party  tool like pentaho. My only concern here is if i can designed a etl job directly from APACHE NIFIN. For more clear below is the url of etl job screenshot designed in pentaho now i wanted to design same job in apache nifin too . So is it possible ?

 

Moreover If it is possible then i wonder if you can recomend any url where i can learn how to design etl as mentioned in the below mentioned url ?

 

https://drive.google.com/open?id=0B-wEtRLWeFvMMGt1LWJUbURsTDA

 

Will look forward to hear from you

 

Thank You

Re: How to create Etl job in spark

Contributor

Hi mbigelow

I am not looking for third party etl tool like pentaho. My confusion here is if i can designed the etl job in apache nifin or not ? For example if you look at the below mentioned url then it contains the etl job screnshot designed in pentaho. Now i wanted to designed same like wise job in APACHE NIFIN as well ? So is it possible ?

 

If it is possible then, I wonder if you can recomend any url where i can learn how can i design etl job in apache nifin same likewise of that pentaho. Please advise

 

https://drive.google.com/file/d/0B-wEtRLWeFvMMGt1LWJUbURsTDA/view

 

Will look forward to hear from you

 

Thank You

 

Ujjwal Rana

Re: How to create Etl job in spark

Expert Contributor

Just a clarification, but Apache Nifi is not included as part of the Hadoop platform and would run in a seperate cluster as a third party tool.  It also does not create Spark ETL jobs and is an alternative to Spark.  There is some functionality to bring data from Nifi into Spark job, but you are writing Spark yourself.  Apache Nifi is used for streaming data to ingest external data into Hadoop.  Streamsets is another popular tool similar to Nifi that is used to ingest streaming data.

 

As mbigelow mentioned, there are other tools that will actually run Spark jobs and have a GUI interface as aposed to running seperately like Nifi.  Pentaho, Informatica, Talend just to name a few but there are many others.

 

There's lots of choices available for you and Cloudera certifies partners. If possible, you could speak with a local cloudera representative to help identify partner solutions or other ETL\Data Integration solutions.