Instructions to use Condor
To use Condor within AFS, you must allow access for the group 'condor' to any directories where Condor needs to be able to write.
find ./your_working_directory -type d -exec fs sa -dir {} -acl condor rlidwk \;
Note: your working directory must be in a directory tree that group condor has access to.
Now, you must create a Condor job
file. The provided example is 'ee.sub'.
You must include at least the first two lines:
the first tells Condor to use the 'vanilla' universe, which is suitable for all job types.
the second line, 'executable = ' gives the full path to your program.
The rest of the lines are as
follows:
arguments = #these are passed on the command line to your program input = #this
file will become the stdin for your program output = #this file will become
stdout error = #this file will become stderr
initialdir = #you can set up a separate initial working directory for each
instance of your program; can be relative or absolute
Finally, the queue command tells Condor to take all the lines preceding it, and
submit a job with these parameters. You can then change one or more of the
parameters, and issue another queue command. This will be a different instance
of the program. You can have as many queue commands as you want, but you must be
careful to change working directories if your program outputs to a file or to
stdout/stderr.
To run:
Of course, I forgot to tell you how to start Condor once you have created your
job file. Once you have this file (ee.sub, in my
example), login to s1.mate.cs.pitt.edu, and issue:
condor_submit ee.sub
It will then start queuing your jobs. You can check the status of the cluster
with this command:
/sbin/service/condor status
and you can see what your jobs are doing with this command:
/cluster/condor/x86_64-linux-26/bin/condor_q
The above command will give you an ID
number for each task; that ID will be in the format xxx.yy . You can kill a
whole batch of tasks by issuing:
/cluster/condor/x86_64-linux-26/bin/condor_rm xxx
or you can kill a specific task by:
/cluster/condor/x86_64-linux-26/bin/condor_rm xxx.yy
To start/restart Condor service:
This is needed in case Condor does not "see" all the machines in the mate
cluster. In case you are prompted for a password while using the following
commands, use your AFS password.
To start the Condor service on that machine:
sudo /sbin/service condor start
To stop the Condor service on that machine:
sudo /sbin/service condor stop