Created 06-14-2016 03:54 PM
I am looking to use Hadoop as a new way to store data for my company. Before I can delve into the details of this project, there are several questions I need to get answered. For example, we currently have a Microsoft database, is it possible to implement a data lake and create a hybrid type solution with Hadoop? Or would it be easier to just start over with Hadoop? Second, What tools are available for security? How do we make sure the data ingestion is secure and the data is high quality? Also what are some tools for data analysis and data mining using Hadoop?
Sorry if these qiestions are general, I am pretty new to this software. Any Answers are greatly appreciated and any links or literature to help me on this topic would be helpful as well.
Created 06-14-2016 04:57 PM
@Alpha3645 There are a number of ways in which Hadoop can help your company and a data lake is a great way to start using Hadoop. By "Microsoft database", do you mean Microsoft SQL Server? If so, one possibility is to use Sqoop to help move over databases and tables into your data lake and run that in conjunction with your SQL Server instance. You can then use Hive for SQL queries in the data lake. There is no need to replace everything with Hadoop at first. Regarding security, HDP ships with both Apache Knox (perimeter security and API access) and Apache Ranger (fine grained user access) and in many cases these two will meet organizational security requirements with nothing else needed.
Regarding data quality, there are a number of commercial tools available such as Talend, Informatica, Trifacta, etc. Hadoop has a number of tools built in for analysis and data manipulation such as Hive, HBase, Pig, and Zeppelin. Here are a few links to get you started:
Sqoop - https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
Hadoop Security - http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_Security_Guide/content/ch_hdp-security-gu...
Talend - https://www.talend.com/resource/data-quality-tools.html
Trifacta - https://www.trifacta.com/
Informatica - https://www.informatica.com/products/data-quality.html
Apache Zeppelin - https://zeppelin.apache.org/
Apache Hive - https://hive.apache.org/
Apache HBase - https://hbase.apache.org/
Apache Pig - https://pig.apache.org/
Created 06-14-2016 04:57 PM
@Alpha3645 There are a number of ways in which Hadoop can help your company and a data lake is a great way to start using Hadoop. By "Microsoft database", do you mean Microsoft SQL Server? If so, one possibility is to use Sqoop to help move over databases and tables into your data lake and run that in conjunction with your SQL Server instance. You can then use Hive for SQL queries in the data lake. There is no need to replace everything with Hadoop at first. Regarding security, HDP ships with both Apache Knox (perimeter security and API access) and Apache Ranger (fine grained user access) and in many cases these two will meet organizational security requirements with nothing else needed.
Regarding data quality, there are a number of commercial tools available such as Talend, Informatica, Trifacta, etc. Hadoop has a number of tools built in for analysis and data manipulation such as Hive, HBase, Pig, and Zeppelin. Here are a few links to get you started:
Sqoop - https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
Hadoop Security - http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_Security_Guide/content/ch_hdp-security-gu...
Talend - https://www.talend.com/resource/data-quality-tools.html
Trifacta - https://www.trifacta.com/
Informatica - https://www.informatica.com/products/data-quality.html
Apache Zeppelin - https://zeppelin.apache.org/
Apache Hive - https://hive.apache.org/
Apache HBase - https://hbase.apache.org/
Apache Pig - https://pig.apache.org/