- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Any Tool/python library available for data profiling?
- Labels:
-
Apache Hive
Created on 09-09-2016 08:21 AM - edited 09-16-2022 03:38 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi ,
I want to profile data which is present in sQL type databases(like hive/MySQL).
Does anyone know which tool is suitable for this?
It that tool is of hortonworks community then that will be best for me.
I have already tried pyxplorer python library but that didn't work for me because some installation problem.
please sent me tool link,Every where I am seeing Ta lend for data profiling but does talend provide all features related to data quality and data profiling(features like counting number of table,number of columns,distinct column values,min/max column value etc).
Created 09-09-2016 01:14 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For profiling data off Hadoop, see https://community.hortonworks.com/questions/35396/data-quality-analysis.html
For profiling data on Hadoop, the best solution for you should be:
- zeppelin as your client/UI
- spark in zeppelin as your toolset to profile
Both zeppelin and spark are extremely powerful tools for interacting with data and are packaged in HDP. Zeppelin is a browser-based notebook UI (like iPython/Jupyter) that excels at interacting with and exploring data. Spark of course is in-memory data analysis and is lightening fast. Both are key pieces in the future of Big Data analysis.
BTW, you can use python in spark or you can use scala, including integration of external libraries.
See the following links to get started:
http://hortonworks.com/apache/zeppelin/
http://www.social-3.com/solutions/personal_data_profiling.php
Created 09-09-2016 01:14 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For profiling data off Hadoop, see https://community.hortonworks.com/questions/35396/data-quality-analysis.html
For profiling data on Hadoop, the best solution for you should be:
- zeppelin as your client/UI
- spark in zeppelin as your toolset to profile
Both zeppelin and spark are extremely powerful tools for interacting with data and are packaged in HDP. Zeppelin is a browser-based notebook UI (like iPython/Jupyter) that excels at interacting with and exploring data. Spark of course is in-memory data analysis and is lightening fast. Both are key pieces in the future of Big Data analysis.
BTW, you can use python in spark or you can use scala, including integration of external libraries.
See the following links to get started:
http://hortonworks.com/apache/zeppelin/
http://www.social-3.com/solutions/personal_data_profiling.php
