Member since
05-11-2019
14
Posts
0
Kudos Received
0
Solutions
08-29-2019
07:51 AM
Hi CK, Here are some tips to help answer your question. First, use cloud storage services, like S3 on AWS or ADLS on Azure, for keeping data like Navigator lineage information. Those services provide availability and reliability automatically. Hadoop and other services can be configured to use cloud storage in various ways instead of local block (hard drive) storage. Sometimes it's not efficient or high-performing enough to exclusively use cloud storage. For example, a typical data analysis job may have several stages where data is read, processed, and then written back, and those round trips to storage services can be slower and cost more money than local drive access. So, think about adjusting how data is managed, so that intermediate data resides in local block storage, but final results are sent to cloud storage for safekeeping. Once all of the important data is safe in cloud storage, it becomes less important to keep cluster instances running. You can even destroy entire clusters, including master / manager nodes, knowing that the data is safe in cloud storage. At this point, you will want to use automation tools, like Altus Director or Cloudbreak, so that you can easily spin up new clusters that are configured to pull their initial data from cloud storage. Then, you only run clusters when you need them. If that isn't feasible, you can still do something like what you suggest, with clusters that have some permanent nodes and some transient ones. If so, ensure that those transient nodes do not keep important state that isn't safe elsewhere. For example, YARN node managers are stateless, so scaling nodes only housing those ("worker" nodes) is an easy goal to achieve. By constrast, HDFS datanodes store file data, so those aren't as easy to scale down - you can, though, as long as they are decommissioned properly using Cloudera Manager or Ambari, which the cloud automation tools handle for you.
... View more
07-09-2019
12:53 AM
2 Kudos
Yes that is correct, and the motivations/steps-to-use are reflected here too: https://www.cloudera.com/documentation/enterprise/6/latest/topics/cm_s3guard.html Note: On your point of 'load data from S3 into HDFS', it is better stated as simply 'read data from S3', where HDFS gets used as a transient storage (where/when required). There does not need to be a 'download X GiB data from S3 to HDFS first, only then begin jobs' step, as distributed jobs can read off of S3 via s3a:// URLs in the same way they do from HDFS hdfs://.
... View more
07-01-2019
03:34 PM
Hi Bill Thanks a lot for the long explanation. So two options exists (please correct me If I am wrong): 1) Use persistent master nodes and expand for extra temp workloads on demand 2) Alternatively, Have components (Hive, Navigator, etc) save their metadata into S3 or RDS so that all cluster can be teared off and then created from scracth repeately. (That is a better option for cost savings) Many Thanks, Cengiz
... View more
05-20-2019
04:18 AM
Thanks for this. I think, we can summarize this as follows: * If only External Hive Table is used to process S3 data, the technical issues regarding consistency, scalable meta-data handling would be resolved. * If External & Internal Hive Tables are used in combination to process S3 data, the technical issues regarding consistency, scalable meta-data handling and data locality would be resolved. * If Spark alone is used on top of S3, the technical issues regarding consistency with (in memory processing), scalable meta-data handling would be resolved. As Spark will perform transient storage in memory and only read the initial data from S3 and write back the result.
... View more