CoEPP RC
 

Atlas Distributed Computing Operations Shifts

Mission

  • ADCoS team shifts help to ensure that ATLAS computing resources working well and data is delivered to physicists.
  • Monitor data transfers between T1s and T2s; MC production, data reprocessing, group production
  • It is a class 2 shift with 24/7 coverage in 3 time zones EU(8-16)/US(16-24)/AP(0-8) CERN time. Shift can be taken from Home institute.
  • Information about ATLAS Distributed Computing Operations Shifts is available at https://twiki.cern.ch/twiki/bin/viewauth/Atlas/ADCoS

Levels of Shifter experience

Trainee Shifts in 2-days blocks:
Mon-Tue
Wed-Thu
Fri-Sat
no Sunday shift
OTP: 529223
Senior Shifts in 2-days blocks:
Mon-Tue
Wed-Thu
Fri-Sat
no Sunday shift
OTP: 529222
Expert Shifts in 7-days block
Wednesday → Tuesday

Shifters start as Trainees to gather experience

Prerequisites

  1. Make sure you are registered with ATLAS
  2. Have a valid grid certificate
  3. Member of the ATLAS VO.

Steps to become a new ADCoS Trainee Shifter

  1. Send an email to Shift Cordinator Alexey Sedov asedov@pic.es or Shift Captain Hiroshi Sakamoto with the following text.
    Hi Alexey, 
    
    I'd like to sign up to be a trainee ADCoS shifter for "Asia Pacific/EU/US".
    My CERN username is <username> and my DN is DN: </C=XX/O=YYYYY/OU=ZZZZZ/CN=Your Name>
    Can you add me to shift bookings and put in a GGUS ticket for TEAM membership for me?
    
    Regards,
    <Your Name>
  2. Register in elog https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/?cmd=New+user You will be required to fill the following
  3. Create a CERN Savannah account at https://savannah.cern.ch/account/register.php. You can also use your CERN account details

Booking Shifts

  1. Sign in using your normal CERN username and account
  2. You should see two parts to your homepage: “Book My Shifts” and “Find My Shifts”
  3. Expand the “Book My Shifts” by pressing the + symbol, if this section is not already expanded
  4. Under “Book My Shifts” is a list for all shifts you are allowed to book. The number “Id” is the task id for the shift. Please click on the Task Id for the shift you wish to book.
  5. If your shift (or shadow shift) does not appear here you have either not yet been listed as a member or the booking period is not (yet) open. Ask the responsible person for that shift to add you as a member or to open the shift booking period.
  6. You should now be able to see a page titled “Shift Booking”, referencing this single shift. Next to the name of the shift is a + symbol. Click this + symbol in order to expand the section.
  7. You should now be able to see a calendar. Please note that the you may have to edit the drop down menus which specify the start month, and how many months the calendar shows. You can only book shifts in the future, not in the past!
  8. Any shift that is “pink” is unallocated. Select any “pink” shifts that you wish to book. They should now turn green.
  9. To save these shifts: please click “Save Shift Booking”

Checklist

Open Jabber client and say hello in the Virtual control Room to let ADC/ADCoS community know who is on shift Room:adcvcr Server: conference.chat.uio.no
Clouds/sites problems with production jobs. Are there clouds with low efficiency (below 50%)? http://panda.cern.ch/server/pandamon/query?dash=prod
Transferring problems. Are there clouds/sites with high number of jobs in transferring state? http://panda.cern.ch:25880/server/pandamon/query?job=*&type=production&cloudtype=not_null&computingsitetype=not_null&jobStatus=transferring&transferringnotupdated=36
Jobs in waiting state. Are there jobs in wating state? http://panda.cern.ch:25880/server/pandamon/query?job=*&type=production&jobStatus=waiting, follow https://twiki.cern.ch/twiki/bin/viewauth/Atlas/ADCoS#Waiting_Jobs_Procedure
Data Management Cloud/site stastistics in DDM dashboard. Is there any Cloud(s) or site(s) below 50% efficiency? http://dashb-atlas-data.cern.ch/dashboard/ddm2/#d.dst.cloud=CA&d.src.cloud=ND&samples=true and http://dashb-atlas-data.cern.ch/dashboard/request.py/site
Central Services status https://sls.cern.ch/sls/service.php?id=ADC_CS
Database services status atlas-service-dbmonitor
Task Monitoring dashb-atlas-prodsys-prototype.cern.ch
Check ADC eLog
At the beginning of your shift please check the deletion error rate. https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#Checking_the_deletion_error_rate
Check recent changes in queue status http://panda.cern.ch?overview=incidents
Check the status of the TEAM tickets in GGUS and hand-over. https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#TicketMan
Check status of the Group production jobs and transfer requests https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#Group_production_jobs
Once per shift please check status of replication of last 3 DB releases http://panda.cern.ch/server/pandamon/query?mode=listDBRelease
Once per shift please check Frontier status https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#Frontier
Remember to report every action in eLog. (for 'new' entry, click on existing entry first. If you solve an issue, put [SOLVED] in the eLog subject)
Remember to check if sites are in Scheduled Downtime before opening bugs https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ADCoS#Downtimes
Senior shifter is obliged to submit daily shift summary report. Trainee shifters do not submit daily report.

Details of Checklist Item 1

  1. Download and install a Jabber Client such as pidgin from http://www.pidgin.im/
  2. Go to Accounts –> Manage Account –> Add. Fill in the details as shown below. Use you CERN username and password
  3. Once the account is added and you are successfully connected to the server. Join the adcvcr room by going to Buddies –> Join a chat and fill the details as shown below.

Details of Checklist Item 2

  1. Open the url http://panda.cern.ch/server/pandamon/query?dash=prod in your browser. You get a quick snapshot of the overall status for the production by cloud
  2. Look at cloud efficiency (right hand side of the cloud table) and spot most problematic clouds.
  3. First look for sites with lot of failing jobs as shown in figure 1
  4. Then look for tasks with lot of failing jobs by going to url http://panda.cern.ch/server/pandamon/query?dash=task&show=active shown in figure 2
  5. Compare them to know if the problem is site related or task related.
  6. if several sites for one task are found , it seems to be the task , open a savannah bug report
  7. if all jobs are at one site, it seems to be the site, open GGUS team ticket to the site

Example of the DDM Shift

  1. Look for problematic cloud
  2. Check the problematic site
  3. Check the error pattern
  4. Check the Endpoints
  5. Check the downtime status
  6. Check the downtime status in GOCDB
  7. We have a site which fails FTS transfer. There is no ongoing downtime for this site. Now we have to check if there is any bug report on this in Adcos elog.
  8. Checking the elog.
  9. List of elogs fullfilling your search criteria appears.
  10. Go back to the DDM Dashboard.
  11. Get more details
  12. After Clicking on Transfer Placement time you will be able to get the SRC SURL, DEST SURL and attempt counter.
  13. We have collected the error details and link to dashboard page with error details. Now its time to check GGUS tickets.
  14. Search the GGUS team ticket database.
  15. If there is no GGUS team ticket regarding this error, submit a new one.
  16. Click on Open Team Ticket.
  17. Fill in the details such as error example, link to the dashboard, SRC SURL, DEST SURL, number of attempts and click on submit.
  18. Make an entry in Adcos Elog.
  19. Fill in the details, reference the GGUS ticket and click on submit.
tutorial/adcos.txt · Last modified: 2013/10/10 09:45 by lucien
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki