Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (2)
Expert Contributor

I was reviewing some posts related to Pig, and found the following question interesting:

https://community.hortonworks.com/questions/47720/apache-pig-guarantee-that-all-the-value-in-a-colum...

I wanted to share an alternative solution using Pentaho Data Integration (PDI), an open source ETL tool, that provides visual mapreduce capabilities. PDI is YARN ready, so when you configure PDI to use your HDP cluster (or sandbox) and run the attached job, it will run as a YARN application.

The following image is your Mapper.

6168-mapreduce-main.jpg

Above, you see the main transformation. It reads input, which you configure in the Pentaho MapReduce Job (seen below). The transformation follows a pattern, which is to immediately split the delimited file into individual fields. Next, I use a Java Expression to determine if a field is numeric. If not, the we set the value of the field as the String, null. Next, to prepare for MapReduce output, we concatenate the fields back together as a single value and pass the key / value to the MapReduce Output.

Once you have the main MapReduce transformation created, you wrap that into a PDI MapReduce Job. If you're familiar with MapReduce, you will recognize the configuration options below, which you would set in your code.

6169-mapreduce-job-setup.jpg

Next, configure your Mapper.

6170-mapreduce-mapper.jpg

The Job Succeeds!

6181-yarn-app.jpg

And the file is in HDFS.

6182-hdfs.jpg

104 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 11:07 AM
Updated by:
 
Contributors
Top Kudoed Authors