Created 07-03-2019 02:37 AM
Hello,
I see the path /atsv2 in HDFS have large size, and it growing, this path contain embedded hbase data of Yarn ATS.
Have Anyone can explain about this path, and how to purge old data?
Thanks.
Created 07-04-2019 04:16 PM
Hi @son trinh !
The default config for the ATSv2 tables is to keep data for 30 days, so you should decrease this config to get smaller footprint on ATSv2.
You can change this by lowering the TTL on the tables, for example, setting expiration to 15 days (=1296000 seconds).
Assuming you're running HBase in embedded mode for atsv2 (remember that ATSv2 HBase can also be run in Service mode) :
Run this as yarn-ats user, with kerberos ticket if on a kerberized env.:
hbase --config /etc/hadoop/conf/embedded-yarn-ats-hbase shell
and inside the hbase shell, run these:
alter 'prod.timelineservice.application', {NAME=> 'm',TTL => 1296000 } alter 'prod.timelineservice.subapplication', {NAME=> 'm',TTL => 1296000 } alter 'prod.timelineservice.entity', {NAME=> 'm',TTL => 1296000 }
That should keep the ATSv2 db smaller.
Regards
--
Tomas
Created 07-04-2019 04:16 PM
Hi @son trinh !
The default config for the ATSv2 tables is to keep data for 30 days, so you should decrease this config to get smaller footprint on ATSv2.
You can change this by lowering the TTL on the tables, for example, setting expiration to 15 days (=1296000 seconds).
Assuming you're running HBase in embedded mode for atsv2 (remember that ATSv2 HBase can also be run in Service mode) :
Run this as yarn-ats user, with kerberos ticket if on a kerberized env.:
hbase --config /etc/hadoop/conf/embedded-yarn-ats-hbase shell
and inside the hbase shell, run these:
alter 'prod.timelineservice.application', {NAME=> 'm',TTL => 1296000 } alter 'prod.timelineservice.subapplication', {NAME=> 'm',TTL => 1296000 } alter 'prod.timelineservice.entity', {NAME=> 'm',TTL => 1296000 }
That should keep the ATSv2 db smaller.
Regards
--
Tomas
Created 07-05-2019 06:19 AM
Hi Tomas,
Many thanks. I will try to do follow your guide.
Created 07-05-2019 02:55 PM
Great, @son trinh, let me know how it goes.
Oh, and also, if you don't get the amount of disk space back you need, we can set TTLs to the other data.
My recommendation above is for the metrics column families on the tables (like how much memory and CPU per container) which are the least important and also the ones that come with an expiration period by default, so that you don't lose job execution metadata (like where and what and when was executed, exit status, etc.), but if required and you are OK with not having that information after the retention period, we could also get the rest of the ATSv2 data to expire with:
alter 'prod.timelineservice.application', {NAME=> 'c',TTL => 1296000} alter 'prod.timelineservice.application', {NAME=> 'i',TTL => 1296000} alter 'prod.timelineservice.app_flow', {NAME=> 'm',TTL => 1296000} alter 'prod.timelineservice.entity', {NAME=> 'c',TTL => 1296000} alter 'prod.timelineservice.entity', {NAME=> 'i',TTL => 1296000} alter 'prod.timelineservice.flowrun', {NAME=> 'i',TTL => 1296000} alter 'prod.timelineservice.flowactivity', {NAME=> 'i',TTL => 1296000} alter 'prod.timelineservice.subapplication', {NAME=> 'c',TTL => 1296000} alter 'prod.timelineservice.subapplication', {NAME=> 'i',TTL => 1296000}
Regards,
--
Tomas
Created 07-08-2019 01:29 AM
Hi Tomas,
I applied these and run compaction manually. The size of /atsv2 is smaller.
Many thanks!