One thing that’s always been a pain on the BASS is getting the cluster clear when we have a maintenance window, which usually happens monthly. Our initial solution was to send out an email notifying users of the upcoming outage and then simply killing all of the jobs at the appointed time. This obviously has its downsides. The solution that is currently implemented uses a JSV to modify the hard runtime limit on the job and notify the user of the change.
On the BASS, the default runtime for a job is 2 days (set in the cluster-wide sge_request file). Therefore, to be most effective, the maintenance must be scheduled at least two days in advance to make sure all incoming jobs have their runtime limited. A couple users require longer runtime limits, so we have to make sure they’re covered too.
The JSV below calculates the time at which the incoming job will reach the hard runtime limit and if it is after the configured maintenance window begin, modifies the hard runtime limit and notifies the user. This ensures that all jobs running on the cluster will end before the maintenance window begins and that the users know that.
Without further ado, here’s the code. It looks for a file named maintenance in the $SGE_ROOT/$SGE_CELL/common directory that contains the unix timestamp that the maintenance window begins. It’s executed client-side from the cluster-wide sge_request file in the same directory. The script requires the Date and Time libraries for some date math that it does. I’m not a perl programmer, so bear with me.
#!/usr/bin/perl
use strict;
use warnings;
no warnings qw/uninitialized/;
use Env qw(SGE_ROOT SGE_CELL);
use lib "$SGE_ROOT/$SGE_CELL/common/perl/lib/perl5/site_perl/5.8.8/";
use Date::Format;
use Time::Duration;
use lib "$SGE_ROOT/util/resources/jsv";
use JSV qw( :DEFAULT jsv_sub_is_param jsv_sub_add_param jsv_sub_get_param jsv_send_env jsv_log_info jsv_is_param jsv_get_param );
sub hms2s {
my $input = shift;
if( $input =~ m/(\d*):(\d*):(\d*)/ ) {
my $h = $1 || 0;
my $m = $2 || 0;
my $s = $3 || 0;
return $h*3600+$m*60+$s
} elsif( $input =~ m/(\d+)/) {
return $1;
} else {
return 0;
}
}
jsv_on_start(sub {
jsv_send_env();
});
jsv_on_verify(sub {
my $data_file="$SGE_ROOT/$SGE_CELL/common/maintenance";
my $success = open(DAT, $data_file);
if( !$success ) {
jsv_accept("No maintenance window scheduled.");
return;
}
my $maintenance_begin = <DAT>;
close(DAT);
if( !$maintenance_begin ) {
jsv_accept("No maintenance window scheduled.");
return;
}
# Allow a 5 minute window for jobs to die before the maintenance officially starts.
my $delta = 300;
my $now = time();
if( $maintenance_begin-$delta < $now ) {
jsv_log_info('*'x81);
jsv_log_info('* Maintenance is currently in progress');
jsv_log_info('* For more information, see http://blahblah');
jsv_log_info('*'x81);
jsv_reject();
return;
}
my $requested_rt = hms2s(jsv_sub_get_param('l_hard', 'h_rt'));
if( $now + $requested_rt > $maintenance_begin - $delta ) {
my $time_to_run = $maintenance_begin - $delta - $now;
jsv_sub_add_param('l_hard','h_rt',$time_to_run);
jsv_log_info('*'x81);
jsv_log_info('* A maintenance window is scheduled for '.time2str('%m/%d/%Y %H:%M:%S', $maintenance_begin, 'EST'));
jsv_log_info('* Your job will be allowed to run for '.duration($time_to_run));
jsv_log_info('* For more information, see http://blahblah');
jsv_log_info('*'x81);
}
jsv_correct('Job is accepted');
return;
});
jsv_main();