Instructions to use Condor

 

To use Condor within AFS, you must allow access for the group 'condor' to any directories where Condor needs to be able to write.


find ./your_working_directory -type d -exec fs sa -dir {} -acl condor rlidwk \;

Note: your working directory must be in a directory tree that group condor has access to.

Now, you must create a Condor job file. The provided example is 'ee.sub'.
You must include at least the first two lines:

The rest of the lines are as follows:

arguments = #these are passed on the command line to your program input = #this file will become the stdin for your program output = #this file will become stdout error = #this file will become stderr

initialdir = #you can set up a separate initial working directory for each instance of your program; can be relative or absolute

Finally, the queue command tells Condor to take all the lines preceding it, and submit a job with these parameters. You can then change one or more of the parameters, and issue another queue command. This will be a different instance of the program. You can have as many queue commands as you want, but you must be careful to change working directories if your program outputs to a file or to stdout/stderr.

To run:

Of course, I forgot to tell you how to start Condor once you have created your job file. Once you have this file (ee.sub, in my example), login to s1.mate.cs.pitt.edu, and issue:

condor_submit ee.sub

It will then start queuing your jobs. You can check the status of the cluster with this command:

/sbin/service/condor status

and you can see what your jobs are doing with this command:

/cluster/condor/x86_64-linux-26/bin/condor_q
 

The above command will give you an ID number for each task; that ID will be in the format xxx.yy . You can kill a whole batch of tasks by issuing:

/cluster/condor/x86_64-linux-26/bin/condor_rm xxx

or you can kill a specific task by:

/cluster/condor/x86_64-linux-26/bin/condor_rm xxx.yy
 


To start/restart Condor service:

This is needed in case Condor does not "see" all the machines in the mate cluster. In case you are prompted for a password while using the following commands, use your AFS password.

To start the Condor service on that machine:

sudo /sbin/service condor start

To stop the Condor service on that machine:

sudo /sbin/service condor stop