Created 06-27-2016 10:58 AM
I am using sqoop to import data to HDFS and hive and have my job scheduled using Oozie.
How do I introduce a data cleaning layer in the system before storing this data in hive or hdfs?What are the tools available/suitable for the purpose in hadoop ecosystem?
Created 06-27-2016 01:56 PM
@Simran Kaur for this kind of use case, one often designs and leverages a tiered data architecture within HDFS. Namely, the rawest data from the source system would be landed in HDFS in a Land tier, without much of any transformation. On the other side of the spectrum, a Presentation tier would often contain objects that have gone through a data transformation pipeline and are exposed to applications (such as BI tools).
In Land, one would often serialize data as text (for simplicity of ingest and since these objects are not read directly very often), whereas in Presentation, the data would be stored in a Hive table, often serialized as ORC to drive read performance.
As far as data cleaning and ETL, these transformations would often be implemented within Pig, Hive, or Spark (noting that Pig and Hive would be generating MapReduce code under the covers). There are many commercial solutions in the market for Data Quality/ETL as well.
Please let us know if you have further specific questions.
Created 06-27-2016 01:56 PM
@Simran Kaur for this kind of use case, one often designs and leverages a tiered data architecture within HDFS. Namely, the rawest data from the source system would be landed in HDFS in a Land tier, without much of any transformation. On the other side of the spectrum, a Presentation tier would often contain objects that have gone through a data transformation pipeline and are exposed to applications (such as BI tools).
In Land, one would often serialize data as text (for simplicity of ingest and since these objects are not read directly very often), whereas in Presentation, the data would be stored in a Hive table, often serialized as ORC to drive read performance.
As far as data cleaning and ETL, these transformations would often be implemented within Pig, Hive, or Spark (noting that Pig and Hive would be generating MapReduce code under the covers). There are many commercial solutions in the market for Data Quality/ETL as well.
Please let us know if you have further specific questions.
Created 06-28-2016 06:13 AM
Hi, Thank you for your response, How exactly do I use hive for data cleaning purpose?Could you please give an example? I have not really touched spark so far so do you think spark is the way to go only for introducing the data cleaning layer in the system? I have heard all the good stuff about spark but would like to know if this would be a stuitable use case for it?
Created 06-27-2016 02:47 PM
NIFI would be great for this, have it do some basic cleaning on ingest into HDFS and do further cleaning with a Spark or Pig job
Created 06-28-2016 01:39 PM
@Simran Kaur an example of using Hive for data cleansing is in this article (see section 3.5 in particular).
Regarding Spark, it is used widely for extract, transformation, and load logic and is usually well-suited for those kinds of use cases. Both MapReduce and Spark are very general computation paradigms. It would be helpful to know what data cleaning transformations you have in mind.
Created 09-08-2017 08:12 PM
Hi everyone! I'm very happy to announce that now there is a data cleansing framework that connects directly to Apache Spark. It also uses Apache Spark to do the data cleaning. It is call Optimus, and it is kinda new but fully functional. It is also compatible with spark 2.2.0. It will work with Hive and HDFS for your purposes and much more!
It is registered as a **PyPi **package and also a Spark Package.
Please check it out here:
https://github.com/ironmussa/Optimus
Here is a short description of the framework:
Optimus is the missing library for cleaning and pre-processing data in a distributed fashion. It uses all the power of Apache Spark to do so. It implements several handy tools for data wrangling and munging that will make data scientist’s life much easier. The first obvious advantage over any other data cleaning library is that it will work on your laptop or your big cluster, and second, it is amazingly easy to install, use and understand.
PyPi:https://pypi.org/project/optimuspyspark/
Spark Package: https://spark-packages.org/package/ironmussa/Optimus
Created 09-08-2017 09:15 PM
looks cool. a full tutorial and example as an HCC article would be cool
Created 01-10-2018 06:46 AM
Hi @slachterman I am planning to use just Nifi, to pass csv data for data cleansing. The data cleansing involves filling in missing timestamps, rows of data, correct corrupt timestamps etc. Will just Nifi be enough, to fill in missing data as well?
Created 01-11-2018 05:54 AM
Yes, use UpdateAttribute and the expression language to add missing values as appropriate.