Community Articles

Find and share helpful community-sourced technical articles.
Labels (2)
avatar
Super Collaborator

I was reviewing some posts related to Pig, and found the following question interesting:

https://community.hortonworks.com/questions/47720/apache-pig-guarantee-that-all-the-value-in-a-colum...

I wanted to share an alternative solution using Pentaho Data Integration (PDI), an open source ETL tool, that provides visual mapreduce capabilities. PDI is YARN ready, so when you configure PDI to use your HDP cluster (or sandbox) and run the attached job, it will run as a YARN application.

The following image is your Mapper.

6168-mapreduce-main.jpg

Above, you see the main transformation. It reads input, which you configure in the Pentaho MapReduce Job (seen below). The transformation follows a pattern, which is to immediately split the delimited file into individual fields. Next, I use a Java Expression to determine if a field is numeric. If not, the we set the value of the field as the String, null. Next, to prepare for MapReduce output, we concatenate the fields back together as a single value and pass the key / value to the MapReduce Output.

Once you have the main MapReduce transformation created, you wrap that into a PDI MapReduce Job. If you're familiar with MapReduce, you will recognize the configuration options below, which you would set in your code.

6169-mapreduce-job-setup.jpg

Next, configure your Mapper.

6170-mapreduce-mapper.jpg

The Job Succeeds!

6181-yarn-app.jpg

And the file is in HDFS.

6182-hdfs.jpg

592 Views