Member since
09-24-2015
38
Posts
41
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2313 | 10-18-2017 01:27 PM | |
24316 | 04-22-2016 09:23 PM | |
824 | 12-22-2015 02:41 PM | |
1271 | 12-22-2015 12:54 PM | |
2115 | 12-08-2015 03:44 PM |
11-01-2017
06:16 PM
Yes you can access data in SWIFT storage much like you can with S3 or WASB. It is not a replacment for HDFS though. Look at this link for the docs on swift configuration. https://hadoop.apache.org/docs/stable/hadoop-openstack/index.html
... View more
10-18-2017
11:17 PM
1 Kudo
@Tomomichi Hirano I think the largest concern is not heap itself, but how heap effects garbage collection. Depending on how large your heap is and how many objects need to be scanned to see if they should be purge or not can increase the time of the Garbage Collection. While the GC is taking place the NameNode is paused and all DFS operations would be paused until the GC is completed. If the GC takes to long a failover can occur, its not uncommon on some clusters to have to increase the failover timeout simply because the garbage collection is taking longer then the timeout.
... View more
10-18-2017
02:31 PM
That is correct @Rajesh Reddy, think of HDFS as a performance layer for workloads to do the inbetween work, and S3 as the place for datasets to live long term. You can then reduce the storage on your datanodes becuase its only for intermetidate processing.
... View more
10-18-2017
02:21 PM
No your clear, you cannot do what your saying. the defaultFS cannot be replaced by the S3 connector as your attempting to describe. You still need a cluster with HDFS deployed even if its a much lower volume of HDFS space. And then your jobs can 'target' the larger datasets you have stored on S3. But HDFS is still required as all the things above and as the documentation states the connectors cannot be used as an replacment for HDFS. @Rajesh Reddy Even if no data is stored in HDFS and everything is stored S3 you will still require a defaultFS for the API layer and for how processing engines work today. Drop-in replacements are block storage not object like S3 and include products like Isilon and Spectrum Scale.
... View more
10-18-2017
02:03 PM
@Rajesh Reddy explained in my above comment HDFS is used for intermediate storage of many datasets between stages depending on the workload engines being used. Additionally if these engines ie MapReduce or spark make use of a distributed cache for jars, etc they will be pushed to HDFS not S3. The link you provided talks about using S3 as a source and sink for a dataset, it does not describe replacing the entire defaultFS for HDFS with s3. The first page of the guide you linked also states this. - "These connectors are not a replacement for HDFS and cannot be used as a replacement for HDFS defaultFS ." https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_cloud-data-access/content/intro.html
... View more
10-18-2017
01:27 PM
1 Kudo
@Rajesh Reddy At this time S3 cannot be used as an outright replacement for an HDFS deployment; so your data-lifecycle you describe would be required to be scripted and such. Today jobs that have multiple map-reduce stages will output data to HDFS for the second stage. S3 can be used as a Source and Sink for input and final datasets but is not used for the intermediate data. Also Rack-Awareness is a HDFS Block based behaviour and would not apply to S3. If you want your dataset in S3 to also be located in another AWS region you could setup S3 Cross-Region Replication. https://aws.amazon.com/blogs/aws/new-cross-region-replication-for-amazon-s3/
... View more
10-18-2017
12:31 PM
1 Kudo
@Rajesh Reddy That is correct, Replication Factor is an HDFS specific setting. When you load your data into S3 your utilising however S3 performs robust durability measures. In the document linked below they claim a "99.999999999% of durability for objects
stored within a given region." S3 still redundantly stores multiple copies over multiple regions for durability and performs many of the same actions that HDFS does in terms of detecting corrupt replica's and replacing them. https://d0.awsstatic.com/whitepapers/protecting-s3-against-object-deletion.pdf http://docs.aws.amazon.com/AmazonS3/latest/dev/DataDurability.html
... View more
10-18-2017
12:19 PM
It appears your timing out because Atlas hook is not working properly, if you remove the Atlas hook this would work faster. @bkosaraju Look at this config and try removing the Altas config and then restart all and try again "org.apache.atlas.hive.hook.HiveHook" <property>
<name>hive.exec.post.hooks</name>
<value>org.apache.hadoop.hive.ql.hooks.ATSHook, org.apache.atlas.hive.hook.HiveHook</value>
</property> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.6/bk_command-line-installation/content/configuring_atlas_hive_hook.html
... View more
04-22-2016
09:23 PM
14 Kudos
Single String vs Nested Partitions
When creating partitions by date it is almost always more effective to partition by a single string of ‘YYYY-MM-DD’ rather than use a multi-depth partition with The Year, Months, and Days all as their own. The advantage to using the single string approach allows for more SQL operators to be utilized such as LIKE, IN, and BETWEEN which cannot be as easily used if one utilizes nested partitions. Example: Dates to Select: 2015-01-01, 2015-02-03,2016-01-01 Tables Table A, Partitioned by DateStamp String as YYYY-MM-DD Table B, Partitioned by YEAR INT, Month INT, Day INT Queries on Table A All Dates SELECT * FROM TableA WHERE DateStamp IN (‘2015-01-01’, ‘2015-02-03’, ‘2016-01-01’) Only 2015 SELECT * FROM TableA WHERE DateStamp LIKE ‘2015-%’ Only 2015 and February SELECT * FROM TableA WHERE DateStamp LIKE ‘2015-02-%’ All Days that start/end with a 5 SELECT * FROM TableA WHERE DateStamp LIKE ‘%-%-%5’ All Days Between 2015-01-01 and 2015-03-01 SELECT * FROM TableA WHERE DateStamp BETWEEN ‘2015-01-01’ AND ‘2015-03-01’ Queries on Table B All Dates SELECT * FROM TableB WHERE (YEAR=2015 AND MONTH=01 AND DAY=01) OR (YEAR=2015 AND MONTH=02 AND DAY=03) OR (YEAR=2016 AND MONTH=01 AND DAY=01) Only 2015 SELECT * FROM TableB WHERE YEAR=2015 Only 2015 and February SELECT * FROM TableB WHERE YEAR=2015 and MONTH=02 All Days that start/end with a 5 SELECT * FROM TableB WHERE DAY LIKE ‘%5’ All Days Between 2015-01-01 and 2015-03-01 SELECT * FROM TableA WHERE YEAR=2015 and ((MONTH=01 OR MONTH=02) OR (MONTH =03 and DAY =01))
... View more
03-22-2016
03:45 PM
Partitions are important just to reduce the dataset size that your joining on, if your table is triple petabytes and you have to scan it each time to ingest data thats not a very smart design. That said columns do not have to be partitions!! We have a column which is the hash, we could have any number of extra columns here as well! We dont partition on the hash column at all, if your usecase can handle scanning the entire dataset each time to remove dupes then dont worry about them. One way or another you need a way to test for unquieness of the record... A left outter join does this easily with a test for null to see which records are in the 'ingest' dataset which are not in the master dataset... insert into table master_table select id, col1, col2, colN FROM ( select s.id, col1, col2, colN, m.id from Staging_table s LEFT JOIN Master_table m on (s.id == m.id) WHERE m.id is null; )
... View more