Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

What are the resources and technologies required to create Data Lake?

avatar
New Contributor

Dear All,

I am new to Big Data but have Big Data and Hadoop knowldege. I am going to work on a Data Lake project in coming week, We would be using one of the AWS services for cloud storage and Redshift for database use.

I want to know if anyone has already worked on such project or have thorough knowledge on building Data Lake, Could you please provide me resources or links or guide me here step by steps to build the Data Lake.

Thanks in advance!!

Kind Regards, Manish

1 ACCEPTED SOLUTION

avatar
Super Collaborator

A Data Lake is not tied to a platform or technology. Hadoop is not a requirement for a datalake either.

IMO, a "data lake project" should not be a project description or the end goal; you can say you got your data from "source X", using "code Y", transformed and analyzed using "framework Z", but the combinations of tools out in the market that support such statements are so broad and vague that it really depends on what business use cases you are trying to solve.

For example, S3 is replaceable with HDFS or GCS or Azure Storage. Redshift is replaceable with Postgres (and you really should use Athena anyway if the data you want to query is in S3, where Athena is replaceable by PrestoDB), and those can be compared to Google BigQuery.

My suggestion would be not to tie yourself to a certain toolset, but if you are in AWS, their own documentation pages are very extensive. Since you are not asking about a Hortonworks specific question, I'm not sure what information you are looking for from this site.

View solution in original post

3 REPLIES 3

avatar
Super Collaborator

A Data Lake is not tied to a platform or technology. Hadoop is not a requirement for a datalake either.

IMO, a "data lake project" should not be a project description or the end goal; you can say you got your data from "source X", using "code Y", transformed and analyzed using "framework Z", but the combinations of tools out in the market that support such statements are so broad and vague that it really depends on what business use cases you are trying to solve.

For example, S3 is replaceable with HDFS or GCS or Azure Storage. Redshift is replaceable with Postgres (and you really should use Athena anyway if the data you want to query is in S3, where Athena is replaceable by PrestoDB), and those can be compared to Google BigQuery.

My suggestion would be not to tie yourself to a certain toolset, but if you are in AWS, their own documentation pages are very extensive. Since you are not asking about a Hortonworks specific question, I'm not sure what information you are looking for from this site.

avatar
New Contributor

Thanks Jordan for providing your inputs. I just wanted to know if there is any documentation available on Hortonworks or on any other sources, where i can go through and understand Data Lake for proper implementation.

avatar
Super Collaborator