Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Data cleaning before storing in HDFS

avatar
Expert Contributor

I am using sqoop to import data to HDFS and hive and have my job scheduled using Oozie.

How do I introduce a data cleaning layer in the system before storing this data in hive or hdfs?What are the tools available/suitable for the purpose in hadoop ecosystem?

1 ACCEPTED SOLUTION

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
8 REPLIES 8

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Expert Contributor
@slachterman

Hi, Thank you for your response, How exactly do I use hive for data cleaning purpose?Could you please give an example? I have not really touched spark so far so do you think spark is the way to go only for introducing the data cleaning layer in the system? I have heard all the good stuff about spark but would like to know if this would be a stuitable use case for it?

avatar
Master Guru

NIFI would be great for this, have it do some basic cleaning on ingest into HDFS and do further cleaning with a Spark or Pig job

avatar

@Simran Kaur an example of using Hive for data cleansing is in this article (see section 3.5 in particular).

Regarding Spark, it is used widely for extract, transformation, and load logic and is usually well-suited for those kinds of use cases. Both MapReduce and Spark are very general computation paradigms. It would be helpful to know what data cleaning transformations you have in mind.

avatar
New Contributor

Hi everyone! I'm very happy to announce that now there is a data cleansing framework that connects directly to Apache Spark. It also uses Apache Spark to do the data cleaning. It is call Optimus, and it is kinda new but fully functional. It is also compatible with spark 2.2.0. It will work with Hive and HDFS for your purposes and much more!

It is registered as a **PyPi **package and also a Spark Package.

Please check it out here:

https://github.com/ironmussa/Optimus

Here is a short description of the framework:

Optimus is the missing library for cleaning and pre-processing data in a distributed fashion. It uses all the power of Apache Spark to do so. It implements several handy tools for data wrangling and munging that will make data scientist’s life much easier. The first obvious advantage over any other data cleaning library is that it will work on your laptop or your big cluster, and second, it is amazingly easy to install, use and understand.

PyPi:https://pypi.org/project/optimuspyspark/

Spark Package: https://spark-packages.org/package/ironmussa/Optimus

avatar
Master Guru

looks cool. a full tutorial and example as an HCC article would be cool

avatar
Contributor

Hi @slachterman I am planning to use just Nifi, to pass csv data for data cleansing. The data cleansing involves filling in missing timestamps, rows of data, correct corrupt timestamps etc. Will just Nifi be enough, to fill in missing data as well?

avatar

Yes, use UpdateAttribute and the expression language to add missing values as appropriate.