I wanted to share an alternative solution using Pentaho Data
Integration (PDI), an open source ETL tool, that provides visual mapreduce
capabilities. PDI is YARN ready, so when you configure PDI to use your HDP
cluster (or sandbox) and run the attached job, it will run as a YARN
application.
The following image is your Mapper.
Above, you see the main transformation. It
reads input, which you configure in the Pentaho MapReduce Job (seen below). The
transformation follows a pattern, which is to immediately split the delimited
file into individual fields. Next, I use a Java Expression to determine if a
field is numeric. If not, the we set the value of the field as the String,
null. Next, to prepare for MapReduce output, we concatenate the
fields back together as a single value and pass the key / value to the
MapReduce Output.
Once you have the main MapReduce
transformation created, you wrap that into a PDI MapReduce Job. If you're
familiar with MapReduce, you will recognize the configuration options below,
which you would set in your code.