Created on 08-30-2018 10:32 AM - edited 09-16-2022 06:39 AM
Dear All,
I am new to Big Data but have Big Data and Hadoop knowldege. I am going to work on a Data Lake project in coming week, We would be using one of the AWS services for cloud storage and Redshift for database use.
I want to know if anyone has already worked on such project or have thorough knowledge on building Data Lake, Could you please provide me resources or links or guide me here step by steps to build the Data Lake.
Thanks in advance!!
Kind Regards, Manish
Created 09-01-2018 01:18 AM
A Data Lake is not tied to a platform or technology. Hadoop is not a requirement for a datalake either.
IMO, a "data lake project" should not be a project description or the end goal; you can say you got your data from "source X", using "code Y", transformed and analyzed using "framework Z", but the combinations of tools out in the market that support such statements are so broad and vague that it really depends on what business use cases you are trying to solve.
For example, S3 is replaceable with HDFS or GCS or Azure Storage. Redshift is replaceable with Postgres (and you really should use Athena anyway if the data you want to query is in S3, where Athena is replaceable by PrestoDB), and those can be compared to Google BigQuery.
My suggestion would be not to tie yourself to a certain toolset, but if you are in AWS, their own documentation pages are very extensive. Since you are not asking about a Hortonworks specific question, I'm not sure what information you are looking for from this site.
Created 09-01-2018 01:18 AM
A Data Lake is not tied to a platform or technology. Hadoop is not a requirement for a datalake either.
IMO, a "data lake project" should not be a project description or the end goal; you can say you got your data from "source X", using "code Y", transformed and analyzed using "framework Z", but the combinations of tools out in the market that support such statements are so broad and vague that it really depends on what business use cases you are trying to solve.
For example, S3 is replaceable with HDFS or GCS or Azure Storage. Redshift is replaceable with Postgres (and you really should use Athena anyway if the data you want to query is in S3, where Athena is replaceable by PrestoDB), and those can be compared to Google BigQuery.
My suggestion would be not to tie yourself to a certain toolset, but if you are in AWS, their own documentation pages are very extensive. Since you are not asking about a Hortonworks specific question, I'm not sure what information you are looking for from this site.
Created 09-04-2018 03:12 AM
Thanks Jordan for providing your inputs. I just wanted to know if there is any documentation available on Hortonworks or on any other sources, where i can go through and understand Data Lake for proper implementation.
Created 09-04-2018 06:58 PM
@Manish Tiwari, perhaps you can look at https://docs.hortonworks.com/HDPDocuments/Cloudbreak/Cloudbreak-2.7.1/content/data-lake/index.html
Otherwise, you can search https://docs.hortonworks.com/ for the keywords you are looking for