Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

What are the resources and technologies required to create Data Lake?

Solved Go to solution

What are the resources and technologies required to create Data Lake?

New Contributor

Dear All,

I am new to Big Data but have Big Data and Hadoop knowldege. I am going to work on a Data Lake project in coming week, We would be using one of the AWS services for cloud storage and Redshift for database use.

I want to know if anyone has already worked on such project or have thorough knowledge on building Data Lake, Could you please provide me resources or links or guide me here step by steps to build the Data Lake.

Thanks in advance!!

Kind Regards, Manish

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: What are the resources and technologies required to create Data Lake?

Super Collaborator

A Data Lake is not tied to a platform or technology. Hadoop is not a requirement for a datalake either.

IMO, a "data lake project" should not be a project description or the end goal; you can say you got your data from "source X", using "code Y", transformed and analyzed using "framework Z", but the combinations of tools out in the market that support such statements are so broad and vague that it really depends on what business use cases you are trying to solve.

For example, S3 is replaceable with HDFS or GCS or Azure Storage. Redshift is replaceable with Postgres (and you really should use Athena anyway if the data you want to query is in S3, where Athena is replaceable by PrestoDB), and those can be compared to Google BigQuery.

My suggestion would be not to tie yourself to a certain toolset, but if you are in AWS, their own documentation pages are very extensive. Since you are not asking about a Hortonworks specific question, I'm not sure what information you are looking for from this site.

3 REPLIES 3
Highlighted

Re: What are the resources and technologies required to create Data Lake?

Super Collaborator

A Data Lake is not tied to a platform or technology. Hadoop is not a requirement for a datalake either.

IMO, a "data lake project" should not be a project description or the end goal; you can say you got your data from "source X", using "code Y", transformed and analyzed using "framework Z", but the combinations of tools out in the market that support such statements are so broad and vague that it really depends on what business use cases you are trying to solve.

For example, S3 is replaceable with HDFS or GCS or Azure Storage. Redshift is replaceable with Postgres (and you really should use Athena anyway if the data you want to query is in S3, where Athena is replaceable by PrestoDB), and those can be compared to Google BigQuery.

My suggestion would be not to tie yourself to a certain toolset, but if you are in AWS, their own documentation pages are very extensive. Since you are not asking about a Hortonworks specific question, I'm not sure what information you are looking for from this site.

Re: What are the resources and technologies required to create Data Lake?

New Contributor

Thanks Jordan for providing your inputs. I just wanted to know if there is any documentation available on Hortonworks or on any other sources, where i can go through and understand Data Lake for proper implementation.

Re: What are the resources and technologies required to create Data Lake?

Super Collaborator
Don't have an account?
Coming from Hortonworks? Activate your account here