Member since
09-08-2017
1
Post
1
Kudos Received
0
Solutions
09-08-2017
08:12 PM
1 Kudo
Hi everyone! I'm very happy to announce that now there is a data cleansing framework that connects directly to Apache Spark. It also uses Apache Spark to do the data cleaning. It is call Optimus, and it is kinda new but fully functional. It is also compatible with spark 2.2.0. It will work with Hive and HDFS for your purposes and much more! It is registered as a **PyPi **package and also a Spark Package. Please check it out here: https://github.com/ironmussa/Optimus Here is a short description of the framework: Optimus is the missing library for cleaning and pre-processing data in a distributed fashion. It uses all the power of Apache Spark to do so. It implements several handy tools for data wrangling and munging that will make data scientist’s life much easier. The first obvious advantage over any other data cleaning library is that it will work on your laptop or your big cluster, and second, it is amazingly easy to install, use and understand. PyPi:https://pypi.org/project/optimuspyspark/ Spark Package: https://spark-packages.org/package/ironmussa/Optimus
... View more