About Favio_vazquezp

Favio_vazquezp · ‎09-08-2017

Hi everyone! I'm very happy to announce that now there is a data cleansing framework that connects directly to Apache Spark. It also uses Apache Spark to do the data cleaning. It is call Optimus, and it is kinda new but fully functional. It is also compatible with spark 2.2.0. It will work with Hive and HDFS for your purposes and much more! It is registered as a **PyPi **package and also a Spark Package. Please check it out here: https://github.com/ironmussa/Optimus Here is a short description of the framework: Optimus is the missing library for cleaning and pre-processing data in a distributed fashion. It uses all the power of Apache Spark to do so. It implements several handy tools for data wrangling and munging that will make data scientist’s life much easier. The first obvious advantage over any other data cleaning library is that it will work on your laptop or your big cluster, and second, it is amazingly easy to install, use and understand. PyPi:https://pypi.org/project/optimuspyspark/ Spark Package: https://spark-packages.org/package/ironmussa/Optimus

Online	Offline
Last Visited	‎09-08-2017 08:12 PM

Member Since	‎09-08-2017 05:19 PM
Last Visited	‎09-08-2017 08:12 PM
Posts	1
Kudos received	1

Cloudera Community

Re: Data cleaning before storing in HDFS