CoEPP RC
 

Condor and Cloud Scheduler Installation and Configuration Guide

This is an installation guide for Condor 7.9 and Cloud Scheduler 1.4 tested on a Scientific Linux 6.4 machine in OpenStack cloud platform.

Prepare a VM for Condor central manager

Launch a VM

  • Image: NeCTAR Scientific Linux 6.4 x86_64
  • Name: select one of your choice
  • Keypairs: select one of your choice
  • Flavour: m1.medium (8GB memory, 2 core CPU, 60GB ephemeral disk)
  • Security group: alwaysopen, default

Set up Firewall

  • Run the following commands to set up firewall for Condor central server:
    $ chkconfig --list | grep iptables
    $ chkconfig iptables on
    $ vi /etc/sysconfig/iptables
    # Firewall configuration written by system-config-securitylevel
    # Manual customization of this file is not recommended.
    *filter
    :INPUT ACCEPT [0:0]
    :FORWARD ACCEPT [0:0]
    :OUTPUT ACCEPT [0:0]
    :RH-Firewall-1-INPUT - [0:0]
    -A INPUT -j RH-Firewall-1-INPUT
    -A FORWARD -j RH-Firewall-1-INPUT
    -A RH-Firewall-1-INPUT -i lo -j ACCEPT
    -A RH-Firewall-1-INPUT -p icmp --icmp-type any -j ACCEPT
    -A RH-Firewall-1-INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
    -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT
    -A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport 8080 -j ACCEPT
    -A RH-Firewall-1-INPUT -p udp -m state --state NEW -m udp --dport 8080 -j ACCEPT
    -A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport 8081 -j ACCEPT
    -A RH-Firewall-1-INPUT -p udp -m state --state NEW -m udp --dport 8081 -j ACCEPT
    -A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport 8111 -j ACCEPT
    -A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport 8112 -j ACCEPT
    -A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport 9618 -j ACCEPT
    -A RH-Firewall-1-INPUT -p udp -m state --state NEW -m udp --dport 9618 -j ACCEPT
    -A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport 9614 -j ACCEPT
    -A RH-Firewall-1-INPUT -p udp -m state --state NEW -m udp --dport 9614 -j ACCEPT
    -A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport 40000:50000 -j ACCEPT
    -A RH-Firewall-1-INPUT -p udp -m state --state NEW -m udp --dport 40000:50000 -j ACCEPT
    -A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibited
    COMMIT
    $ service iptables restart
    $ /etc/init.d/iptables status

Fix Hostname

  • It's required to manually fix hostname of Condor server deployed on the Nectar cloud due to the misconfigured networking and metadata on the cloud.
  • Install nslookup which is not included in SL6.4 image by default (nslookup is part of the bind-utils package):
    $ yum -y install bind-utils
  • Run the following commands to fix hostname settings:
    $ EC2_METADATA=169.254.169.254
    $ IP_ADDRESS=`curl -m 10 -s http://$EC2_METADATA/latest/meta-data/local-ipv4`
    $ EXTHOSTNAME=`nslookup $IP_ADDRESS | grep 'name =' | awk '{print $4}'`
    $ EXTHOSTNAME=${EXTHOSTNAME%?}
    $ echo $IP_ADDRESS $EXTHOSTNAME >> /etc/hosts
    $ sed -i "s/^HOSTNAME=.*$/HOSTNAME=$EXTHOSTNAME/" /etc/sysconfig/network
    $ hostname $EXTHOSTNAME

Condor

Enable the EPEL Repository

  • Install EPEL repo:
    $ rpm -Uvh http://mirror.aarnet.edu.au/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm

Configure the Condor Repository

  • Set up Condor repo:
    $ wget http://research.cs.wisc.edu/htcondor/yum/repo.d/condor-development-rhel6.repo -O /etc/yum.repos.d/condor-development-rhel6.repo
    $ wget http://research.cs.wisc.edu/htcondor/yum/repo.d/condor-stable-rhel6.repo -O /etc/yum.repos.d/condor-stable-rhel6.repo

Install Condor

  • Run the following command to install Condor server (this would take a while):
    $ yum -y install condor

Configure Condor

  • In /etc/condor/condor_config, modify the value of ALLOW_WRITE to permit started Condor worker VMs to phone home and add themselves back to Condor machine pool:
    ALLOW_WRITE = *
  • In /etc/condor/condor_config.local, add the following lines to it:
    ## CLOUD SCHEDULER SETTINGS
    ENABLE_SOAP = TRUE
    ENABLE_WEB_SERVER = TRUE
    WEB_ROOT_DIR=$(RELEASE_DIR)/web
    ALLOW_SOAP=localhost, 127.0.0.1
    SCHEDD_ARGS = -p 8080
    CLASSAD_LIFETIME = 600
    UPDATE_COLLECTOR_WITH_TCP=True
    COLLECTOR_SOCKET_CACHE_SIZE=10000
    COLLECTOR.MAX_FILE_DESCRIPTORS = 10000
    LOWPORT = 40000
    HIGHPORT = 50000

    and make sure the following directives have correct settings:

    CONDOR_HOST=$(FULL_HOSTNAME)
    COLLECTOR_NAME = CoEPP Cloud Condor Pool at $(FULL_HOSTNAME)
    START = TRUE
    SUSPEND = FALSE
    PREEMPT = FALSE
    KILL = FALSE
    DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD
    NEGOTIATOR_INTERVAL = 20
    TRUST_UID_DOMAIN = TRUE

Cloud Scheduler

Install Prerequisites

  • The installation of following packages is based on a test on SL5. Not sure if we need all of them (this would take a while):
    $ yum install gcc gdbm-devel readline-devel ncurses-devel zlib-devel bzip2-devel sqlite-devel db4-devel openssl-devel tk-devel bluez-libs-devel libxslt libxslt-devel libxml2-devel libxml2 python-devel

Install Python 2.7

We don't need Python version 2.7 anymore as the function required it has been removed from CS.

  • Cloud Scheduler requries Python 2.7 to run properly. The origial version of Python on SL6.4 is 2.6.6. Run the following commands to install version 2.7:
    $ VERSION=2.7.1
    $ mkdir /tmp/src
    $ cd /tmp/src/
    $ wget http://python.org/ftp/python/$VERSION/Python-$VERSION.tar.bz2
    $ tar xjf Python-$VERSION.tar.bz2
    $ cd Python-$VERSION
    $ ./configure
    $ make
    $ make altinstall

Install setuptools and pip

  • Run the following:
    $ cd /tmp/src
    $ wget http://pypi.python.org/packages/2.7/s/setuptools/setuptools-0.6c11-py2.7.egg
    $ sh setuptools-0.6c11-py2.7.egg
    $ easy_install-2.7 pip

Install Cloud Scheduler

  • Install 1.4 pre-release version:
    $ cd ~
    $ git clone https://github.com/hep-gc/cloud-scheduler.git
    $ cd cloud-scheduler
    $ git checkout 7a162aa59e1c7c6c7021a36a83ee2fe619e93c8b
    $ python2.7 setup.py install
    $ easy_install-2.7 .
    $ cp scripts/cloud_scheduler /etc/init.d/

    Or install current full release version 1.3.1:

    $ easy_install-2.7 pip
    $ pip-2.7 install cloud-scheduler

    NOTE: This installation guide uses version 1.4.

  • You can enable Cloud Scheduler to run at boot with:
    $ chkconfig --add cloud_scheduler
    $ chkconfig cloud_scheduler on

Configure Cloud Scheduler

  • Modify Cloud Scheduler init script to set its path and Python version to 2.7:
    $ vi /etc/init.d/cloud_scheduler
    EXECUTABLEPATH=/usr/local/bin/cloud_scheduler
    PYTHON=/usr/local/bin/python2.7
  • In /etc/cloudscheduler/cloud_resources.conf, use the following settings as an example:
    # This is a sample cloud configuration file for Cloud Scheduler
    [NeCTAR]
    host: nova.rc.nectar.org.au
    port: 8773
    cloud_type: OpenStack
    regions: NeCTAR
    vm_slots: 10
    cpu_cores: 2
    storage: 600
    memory: 81920
    cpu_archs: x86, x86_64
    networks: public
    access_key_id: xxxx
    secret_access_key:xxxx
    key_name: nectarkey
    security_group: uvic
    hypervisor: kvm
    secure_connection:true
    enabled: true
    image_attach_device:vda
    scratch_attach_device:vdb

    NOTE: Remember to replace xxxx with the real keys.

  • In /etc/cloudscheduler/cloud_scheduler.conf, make sure it contains the following directives and values:
    condor_retrieval_method: local
    condor_webservice_url: http://localhost:8080
    condor_collector_url: http://localhost:9618
    condor_host_on_vm: your.condor.server.fqdn
    condor_context_file: /etc/condor/central_manager
    vm_lifetime: 10080
    persistence_file: /var/lib/cloudscheduler.persistence
    polling_error_threshold: 3
    condor_register_time_limit: 45
    job_distribution_type: split
    graceful_shutdown: true
    graceful_shutdown_method: off
    retire_before_lifetime: true
    retire_before_lifetime_factor: 1.5
    scheduler_interval: 10
    vm_poller_interval: 180
    job_poller_interval: 35
    machine_poller_interval: 15
    cleanup_interval: 35
    max_starting_vm: 20
    use_pyopenssl: True
    job_proxy_refresher_interval: 600
    job_proxy_renewal_threshold: 7200
    vm_proxy_refresher_interval: 600
    vm_proxy_renewal_threshold: 7200
    vm_idle_threshold: 1200
    proxy_cache_dir: /var/cache/cloudscheduler
    adjust_insufficient_resources: false
    clean_idle_shutdown: false
    log_location: /var/log/cloudscheduler.log

    NOTE: Remember to replace your.condor.server.fqdn with the full hostname of your Condor server. You can run the following command to get it:

    $ hostname -f
  • For the purpose of troubleshooting, it's recommended to turn logging level to VERBOSE. Modify it in /etc/cloudscheduler/cloud_scheduler.conf:
    log_level: VERBOSE

    NOTE: Remember to modify it back to INFO after finishing troubleshooting.

Downgrade boto

  • There is an issue in Boto version above 2.6 which prevents Cloud Scheduler to create VMs.
  • Downgrade boto from whatever version on your host to 2.5.2 which works fine with CS version 1.4 and 1.5:
    $ easy_install-2.7 -m boto==2.9.2
    $ easy_install-2.7 boto==2.5.2

    It removed boto 2.9.2 from easy-install.pth file and python would use the latest installed version when import boto. If you prefer, you can delete the entire directory in /usr/local/lib/python2.7/site-packages/boto-2.9.2-py2.7.egg.

  • See more troubleshooting information: gaierror_error

Condor Worker VM Image

Testing

Start the Condor service

  • Start the Condor service:
    $ service condor start
    Starting up Condor...done.
  • For the successful Condor central manager installation, the following process should be running:
    $ ps -ef | grep condor
    condor    2088     1  0 13:07 ?        00:00:00 /usr/sbin/condor_master -pidfile /var/run/condor/condor.pid
    root      2089  2088  0 13:07 ?        00:00:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 10000000 -S 60 -C 498
    condor    2090  2088  0 13:07 ?        00:00:00 condor_collector -f
    condor    2091  2088  0 13:07 ?        00:00:00 condor_negotiator -f
    condor    2092  2088  0 13:07 ?        00:00:00 condor_schedd -f -p 8080
    root      2809 30592  0 13:24 pts/0    00:00:00 grep condor

Start the Cloud Scheduler service

  • Start Cloud Scheduler:
    $ service cloud_scheduler start
    Starting cloud_scheduler:                [  OK  ]
  • The following process should be running:
    $ ps -ef | grep cloud_scheduler
    root      2204     1  0 13:10 pts/0    00:00:02 /usr/local/bin/python2.7 /usr/local/bin/cloud_scheduler
    root      2805 30592  0 13:24 pts/0    00:00:00 grep cloud_scheduler

Prepare a test job

  • Condor does not allow us to submit jobs as root for security reasons; thus we need to add a new user:
    $ useradd -m -s /bin/bash csuser
  • Switch to user you just created:
    $ su - csuser
  • Create a test job to submit to Condor queue:
    $ vi test.job
    # Regular Condor Attributes
    Universe   = vanilla
    Executable = test.sh
    Log        = test.log
    Output     = test.out
    Error      = test.error
    priority       = 1
    should_transfer_files = YES
    when_to_transfer_output = ON_EXIT
    # Cloud Scheduler Attributes
    Requirements = VMType =?= "cernvm-batch-nectar-2.5.1-3-1-x86_64-v4.img.gz"
    +VMName        = "CernVM"
    +VMAMI = "nova.rc.nectar.org.au:ami-000003f0"
    +VMInstanceType = "nova.rc.nectar.org.au:m1.medium"
    +VMCPUArch     = "x86_64"
    +VMCPUCores    = "2"
    +VMNetwork     = "public"
    +VMMem         = "8192"
    +VMJobPerCore  = "True"
    +TargetClouds  = "NeCTAR"
    +VMSecurityGroup = "uvic"
    Queue
  • Create a test script to run on VM:
    $ vi test.sh
    #!/bin/bash
     
    hostname=`/bin/hostname -f`
    datetime=`date +"%b-%d-%y-%H-%M-%S"`
    filename='/tmp/hostname-'$datetime
    echo $hostname >> $filename

Submit the test job to Condor

  • Submit a test job:
    $ condor_submit test.job
    Submitting job(s).
    1 job(s) submitted to cluster 1.

Check jobs submitted to Condor

  • View the test job submitted to Condor queue:
    $ condor_q
    -- Submitter: vm-118-138-240-158.erc.monash.edu.au : <118.138.240.158:8080> : vm-118-138-240-158.erc.monash.edu.au
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
      1.0   csuser          5/10 13:33   0+00:05:56 R  1   0.0  test.sh
     
    1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

    NOTE: The submitted job can take a while to be scheduled and executed.

Check jobs scheduled by Cloud Scheduler

  • View the test job scheduled in the queues:
    $ cloud_status -q all
    Jobs in Scheduled Queue
    Global ID            User            VM Type         Job Status Status       Cloud
    u.au#15.0#1368156802 csuser          cernvm-batch-ne Running    Scheduled
     
    Jobs in New Queue
    Global ID            User            VM Type         Job Status Status       Cloud
     
    Jobs in High Priority Queue
    Global ID            User            VM Type         Job Status Status       Cloud

    There you can see the job is scheduled by Cloud Scheduler and in the running status.

Check VMs launched by Cloud Scheduler

  • View a VM started to serve the job:
    $ cloud_status -m
    ID     HOSTNAME                  VMTYPE               USER       STATUS       CLUSTER
    i-0000b2ee server-2b910c83-e710-4fd2-b28b-50279462c16b cernvm-batch-nectar-2.5.1-3-1-x86_64-v4.img.gz csuser     Running      NeCTAR

    Once the VM is created successfully on NeCTAR OpenStack cloud, you should be able to see it on dashboard.

Check started VMs joined back to Condor machine pool

  • Check that the started VM joined Condor machine pool:
    $ condor_status
    Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
     
    slot1@vm-118-138-2 LINUX      X86_64 Claimed   Busy      0.000 3994  0+00:08:39
    slot2@vm-118-138-2 LINUX      X86_64 Unclaimed Idle      0.020 3994  0+00:01:28
                         Total Owner Claimed Unclaimed Matched Preempting Backfill
     
            X86_64/LINUX    10     0       5         5       0          0        0
     
                   Total    10     0       5         5       0          0        0

    There you can see the 2-core VM is in the Condor machine pool and one job slot is occupied.

Troubleshootting

gaierror error

  • You can find the gaierror error of “Name or service not known” in /var/log/cloudscheduler.log:
    Traceback (most recent call last):
      File "/usr/local/lib/python2.7/site-packages/cloud_scheduler-1.4-py2.7.egg/cloudscheduler/ec2cluster.py", line 194, in vm_create
        image = connection.get_image(vm_ami)
      File "/usr/local/lib/python2.7/site-packages/boto-2.9.2-py2.7.egg/boto/ec2/connection.py", line 232, in get_image
        return self.get_all_images(image_ids=[image_id])[0]
      File "/usr/local/lib/python2.7/site-packages/boto-2.9.2-py2.7.egg/boto/ec2/connection.py", line 171, in get_all_images
        [('item', Image)], verb='POST')
      File "/usr/local/lib/python2.7/site-packages/boto-2.9.2-py2.7.egg/boto/connection.py", line 1035, in get_list
        response = self.make_request(action, params, path, verb)
      File "/usr/local/lib/python2.7/site-packages/boto-2.9.2-py2.7.egg/boto/connection.py", line 981, in make_request
        return self._mexe(http_request)
      File "/usr/local/lib/python2.7/site-packages/boto-2.9.2-py2.7.egg/boto/connection.py", line 901, in _mexe
        raise e
    gaierror: [Errno -2] Name or service not known
  • This is a bug in boto-2.9.2-py2.7. You need to downgrade it to some version which works with your Cloud Scheduler. To downgrade it, see downgrade_boto. To test if your version of boto works fine, use the following code for testing:
    import boto
    from boto.ec2.connection import EC2Connection
    from boto.ec2.regioninfo import *
    ec2_access_key = "xxxx"
    ec2_secret_key = "xxxx"
    region = RegionInfo(name="NeCTAR", endpoint="nova.rc.nectar.org.au")
    connection = boto.connect_ec2(aws_access_key_id=ec2_access_key,
                            aws_secret_access_key=ec2_secret_key,
                            is_secure=True,
                            region=region,
                            port=8773,
                            path="/services/Cloud")
    vm_ami="ami-000003f0"
    image = connection.get_image(vm_ami)

    NOTE: Replace xxxx with real keys, and change the image ID ami-000003f0 in vm_ami variable to the one you wish to test.

cloud/install_condor7.9_cs1.4_sl6.4.txt · Last modified: 2013/09/18 15:06 by joannah
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki