Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Looking for an intro to Hadoop

avatar
New Contributor

I am looking to use Hadoop as a new way to store data for my company. Before I can delve into the details of this project, there are several questions I need to get answered. For example, we currently have a Microsoft database, is it possible to implement a data lake and create a hybrid type solution with Hadoop? Or would it be easier to just start over with Hadoop? Second, What tools are available for security? How do we make sure the data ingestion is secure and the data is high quality? Also what are some tools for data analysis and data mining using Hadoop?

Sorry if these qiestions are general, I am pretty new to this software. Any Answers are greatly appreciated and any links or literature to help me on this topic would be helpful as well.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

@Alpha3645 There are a number of ways in which Hadoop can help your company and a data lake is a great way to start using Hadoop. By "Microsoft database", do you mean Microsoft SQL Server? If so, one possibility is to use Sqoop to help move over databases and tables into your data lake and run that in conjunction with your SQL Server instance. You can then use Hive for SQL queries in the data lake. There is no need to replace everything with Hadoop at first. Regarding security, HDP ships with both Apache Knox (perimeter security and API access) and Apache Ranger (fine grained user access) and in many cases these two will meet organizational security requirements with nothing else needed.

Regarding data quality, there are a number of commercial tools available such as Talend, Informatica, Trifacta, etc. Hadoop has a number of tools built in for analysis and data manipulation such as Hive, HBase, Pig, and Zeppelin. Here are a few links to get you started:

Sqoop - https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html

Hadoop Security - http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_Security_Guide/content/ch_hdp-security-gu...

Talend - https://www.talend.com/resource/data-quality-tools.html

Trifacta - https://www.trifacta.com/

Informatica - https://www.informatica.com/products/data-quality.html

Apache Zeppelin - https://zeppelin.apache.org/

Apache Hive - https://hive.apache.org/

Apache HBase - https://hbase.apache.org/

Apache Pig - https://pig.apache.org/

View solution in original post

1 REPLY 1

avatar
Expert Contributor

@Alpha3645 There are a number of ways in which Hadoop can help your company and a data lake is a great way to start using Hadoop. By "Microsoft database", do you mean Microsoft SQL Server? If so, one possibility is to use Sqoop to help move over databases and tables into your data lake and run that in conjunction with your SQL Server instance. You can then use Hive for SQL queries in the data lake. There is no need to replace everything with Hadoop at first. Regarding security, HDP ships with both Apache Knox (perimeter security and API access) and Apache Ranger (fine grained user access) and in many cases these two will meet organizational security requirements with nothing else needed.

Regarding data quality, there are a number of commercial tools available such as Talend, Informatica, Trifacta, etc. Hadoop has a number of tools built in for analysis and data manipulation such as Hive, HBase, Pig, and Zeppelin. Here are a few links to get you started:

Sqoop - https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html

Hadoop Security - http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_Security_Guide/content/ch_hdp-security-gu...

Talend - https://www.talend.com/resource/data-quality-tools.html

Trifacta - https://www.trifacta.com/

Informatica - https://www.informatica.com/products/data-quality.html

Apache Zeppelin - https://zeppelin.apache.org/

Apache Hive - https://hive.apache.org/

Apache HBase - https://hbase.apache.org/

Apache Pig - https://pig.apache.org/