Support Questions

Find answers, ask questions, and share your expertise

Data cleaning before storing in HDFS

avatar
Expert Contributor

I am using sqoop to import data to HDFS and hive and have my job scheduled using Oozie.

How do I introduce a data cleaning layer in the system before storing this data in hive or hdfs?What are the tools available/suitable for the purpose in hadoop ecosystem?

1 ACCEPTED SOLUTION

avatar

@Simran Kaur for this kind of use case, one often designs and leverages a tiered data architecture within HDFS. Namely, the rawest data from the source system would be landed in HDFS in a Land tier, without much of any transformation. On the other side of the spectrum, a Presentation tier would often contain objects that have gone through a data transformation pipeline and are exposed to applications (such as BI tools).

In Land, one would often serialize data as text (for simplicity of ingest and since these objects are not read directly very often), whereas in Presentation, the data would be stored in a Hive table, often serialized as ORC to drive read performance.

As far as data cleaning and ETL, these transformations would often be implemented within Pig, Hive, or Spark (noting that Pig and Hive would be generating MapReduce code under the covers). There are many commercial solutions in the market for Data Quality/ETL as well.

Please let us know if you have further specific questions.

View solution in original post

8 REPLIES 8

avatar

@Simran Kaur for this kind of use case, one often designs and leverages a tiered data architecture within HDFS. Namely, the rawest data from the source system would be landed in HDFS in a Land tier, without much of any transformation. On the other side of the spectrum, a Presentation tier would often contain objects that have gone through a data transformation pipeline and are exposed to applications (such as BI tools).

In Land, one would often serialize data as text (for simplicity of ingest and since these objects are not read directly very often), whereas in Presentation, the data would be stored in a Hive table, often serialized as ORC to drive read performance.

As far as data cleaning and ETL, these transformations would often be implemented within Pig, Hive, or Spark (noting that Pig and Hive would be generating MapReduce code under the covers). There are many commercial solutions in the market for Data Quality/ETL as well.

Please let us know if you have further specific questions.

avatar
Expert Contributor
@slachterman

Hi, Thank you for your response, How exactly do I use hive for data cleaning purpose?Could you please give an example? I have not really touched spark so far so do you think spark is the way to go only for introducing the data cleaning layer in the system? I have heard all the good stuff about spark but would like to know if this would be a stuitable use case for it?

avatar
Master Guru

NIFI would be great for this, have it do some basic cleaning on ingest into HDFS and do further cleaning with a Spark or Pig job

avatar

@Simran Kaur an example of using Hive for data cleansing is in this article (see section 3.5 in particular).

Regarding Spark, it is used widely for extract, transformation, and load logic and is usually well-suited for those kinds of use cases. Both MapReduce and Spark are very general computation paradigms. It would be helpful to know what data cleaning transformations you have in mind.

avatar
New Contributor

Hi everyone! I'm very happy to announce that now there is a data cleansing framework that connects directly to Apache Spark. It also uses Apache Spark to do the data cleaning. It is call Optimus, and it is kinda new but fully functional. It is also compatible with spark 2.2.0. It will work with Hive and HDFS for your purposes and much more!

It is registered as a **PyPi **package and also a Spark Package.

Please check it out here:

https://github.com/ironmussa/Optimus

Here is a short description of the framework:

Optimus is the missing library for cleaning and pre-processing data in a distributed fashion. It uses all the power of Apache Spark to do so. It implements several handy tools for data wrangling and munging that will make data scientist’s life much easier. The first obvious advantage over any other data cleaning library is that it will work on your laptop or your big cluster, and second, it is amazingly easy to install, use and understand.

PyPi:https://pypi.org/project/optimuspyspark/

Spark Package: https://spark-packages.org/package/ironmussa/Optimus

avatar
Master Guru

looks cool. a full tutorial and example as an HCC article would be cool

avatar
Contributor

Hi @slachterman I am planning to use just Nifi, to pass csv data for data cleansing. The data cleansing involves filling in missing timestamps, rows of data, correct corrupt timestamps etc. Will just Nifi be enough, to fill in missing data as well?

avatar

Yes, use UpdateAttribute and the expression language to add missing values as appropriate.