Community Articles

bhagan · ‎07-29-2016

I was reviewing some posts related to Pig, and found the following question interesting:

https://community.hortonworks.com/questions/47720/apache-pig-guarantee-that-all-the-value-in-a-colum...

I wanted to share an alternative solution using Pentaho Data Integration (PDI), an open source ETL tool, that provides visual mapreduce capabilities. PDI is YARN ready, so when you configure PDI to use your HDP cluster (or sandbox) and run the attached job, it will run as a YARN application.

The following image is your Mapper.

Above, you see the main transformation. It reads input, which you configure in the Pentaho MapReduce Job (seen below). The transformation follows a pattern, which is to immediately split the delimited file into individual fields. Next, I use a Java Expression to determine if a field is numeric. If not, the we set the value of the field as the String, null. Next, to prepare for MapReduce output, we concatenate the fields back together as a single value and pass the key / value to the MapReduce Output.

Once you have the main MapReduce transformation created, you wrap that into a PDI MapReduce Job. If you're familiar with MapReduce, you will recognize the configuration options below, which you would set in your code.

Next, configure your Mapper.

The Job Succeeds!

And the file is in HDFS.

Cloudera Community

Community Articles

Finding Non-Numerics in a File - Pig Alternative

Apache Hadoop

Apache Pig