Support Questions

Find answers, ask questions, and share your expertise

Cleaning Data and Staging Tables

New Contributor

Hello,

I have encountered situations in the past where the data that I want to upload requires an amount of cleaning prior to use. Sometimes it needs to be de-duplicated, all NULL values changed to 0, or adjusted to another format (e.g. remove headers and rows that contain table elements). For instance, sometimes I have been provided a file that looks like this:

+-------------------------+-------------------------+-------------------------+-------------------------+     |          HEADING         |      HEADING             |            HEADING      |          HEADING           | +-------------------------+-------------------------+-------------------------+-------------------------+     |     DATA (w/ spaces) |       DATA                   |          NULL          |           DATA   |

Obviously, if this was saved in .csv format, I would want to remove the white space and "table-like" formatting so that it is easier to load in applications such as Pig and Hive.

I have heard of people using a staging table and python/R scripts to help in this process. However, I was wondering what tools/methods are available in the Hadoop Environment (perhaps in Hive or Pig) to adjust the data during the upload process. So far I have tried some work-around scripts in Pig, but I think there must be a better solution. For example:

import = LOAD '/path/to/data' using PigStorage('|'); -- only pipe characters seemed to delimit the file
dropfirstcol = FOREACH import GENERATE $1..; -- remove the first column since it will not have data
removeformatrows = FILTER dropfirstcol by $0 != ''; -- remove empty rows that had "+-----+" 
removewhitespace = FOREACH removeformatrows GENERATE REPLACE($0, ' ', ''); -- eliminate white space

If more clarification is needed, please let me know. I have reviewed this but still am having issues.

Thanks!

2 REPLIES 2

You're right that there are (plenty of) times when you need to do some cleansing/transforming/enhancing/etc of your data and you're also right that you have multiples tools and approaches to this. I talked about this (at a high-level) in my recent www.devnexus.com preso that you can find at https://www.slideshare.net/lestermartin/transformation-processing-smackdown-spark-vs-hive-vs-pig. The good (and bad) news is that you get to make some choices here which I believe are usually decided upon based on your, and your team's, experiences and desires as much as anything.

If you have some specific scenarios you want help on, it might be best to open a specific HCC question for each of them and you'll probably get a more targeted response as this question appears to be rather high-level and the answers could quickly become subjective; again, based on individuals' experiences and preferences. Good luck and happy Hadooping!