Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Now Live: Explore expert insights and technical deep dives on the new Cloudera Community BlogsRead the Announcement
Labels (2)
avatar
Super Collaborator

I was reviewing some posts related to Pig, and found the following question interesting:

https://community.hortonworks.com/questions/47720/apache-pig-guarantee-that-all-the-value-in-a-colum...

I wanted to share an alternative solution using Pentaho Data Integration (PDI), an open source ETL tool, that provides visual mapreduce capabilities. PDI is YARN ready, so when you configure PDI to use your HDP cluster (or sandbox) and run the attached job, it will run as a YARN application.

The following image is your Mapper.

6168-mapreduce-main.jpg

Above, you see the main transformation. It reads input, which you configure in the Pentaho MapReduce Job (seen below). The transformation follows a pattern, which is to immediately split the delimited file into individual fields. Next, I use a Java Expression to determine if a field is numeric. If not, the we set the value of the field as the String, null. Next, to prepare for MapReduce output, we concatenate the fields back together as a single value and pass the key / value to the MapReduce Output.

Once you have the main MapReduce transformation created, you wrap that into a PDI MapReduce Job. If you're familiar with MapReduce, you will recognize the configuration options below, which you would set in your code.

6169-mapreduce-job-setup.jpg

Next, configure your Mapper.

6170-mapreduce-mapper.jpg

The Job Succeeds!

6181-yarn-app.jpg

And the file is in HDFS.

6182-hdfs.jpg

738 Views
Version history
Last update:
‎08-17-2019 11:07 AM
Updated by:
Contributors