====== Atlas Distributed Computing Operations Shifts ======
===== Mission =====
* ADCoS team shifts help to ensure that ATLAS computing resources working well and data is delivered to physicists.
* Monitor data transfers between T1s and T2s; MC production, data reprocessing, group production
* It is a class 2 shift with 24/7 coverage in 3 time zones EU(8-16)/US(16-24)/AP(0-8) CERN time. Shift can be taken from Home institute.
* Information about ATLAS Distributed Computing Operations Shifts is available at https://twiki.cern.ch/twiki/bin/viewauth/Atlas/ADCoS
===== Levels of Shifter experience =====
^ Trainee| **Shifts in 2-days blocks:** \\ Mon-Tue \\ Wed-Thu \\ Fri-Sat \\ no Sunday shift | OTP: 529223 |
^ Senior| **Shifts in 2-days blocks:** \\ Mon-Tue \\ Wed-Thu \\ Fri-Sat \\ no Sunday shift | OTP: 529222 |
^ Expert| **Shifts in 7-days block** \\ Wednesday → Tuesday | |
Shifters start as Trainees to gather experience
===== Prerequisites =====
- Make sure you are registered with ATLAS
- Have a valid grid certificate
- Member of the ATLAS VO.
===== Steps to become a new ADCoS Trainee Shifter =====
- Send an email to Shift Cordinator Alexey Sedov asedov@pic.es or Shift Captain Hiroshi Sakamoto with the following text.
Hi Alexey,
I'd like to sign up to be a trainee ADCoS shifter for "Asia Pacific/EU/US".
My CERN username is and my DN is DN:
Can you add me to shift bookings and put in a GGUS ticket for TEAM membership for me?
Regards,
* //Note: Your DN can be found in the following link, search for your name and mark to show "Member DN".//
* [[https://lcg-voms.cern.ch:8443/vo/atlas/vomrs?path=/RootNode/MemberAction/SetRgstrStatus&action=execute]]
- Register in elog https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/?cmd=New+user You will be required to fill the following {{ :elog-reg.png?direct&300 | }}
- Create a CERN Savannah account at https://savannah.cern.ch/account/register.php. You can also use your CERN account details {{ :savannah.png?direct&500 | }}
- Create a GGUS supporter account at https://gus.fzk.de/admin/get_account.php?accounttype=support
{{ :ggus-reg.png?direct&500 |}}
===== Booking Shifts =====
- Go to OTP Tool http://atlas-otp.cern.ch/
- Sign in using your normal CERN username and account
- You should see two parts to your homepage: "Book My Shifts" and "Find My Shifts"
- Expand the "Book My Shifts" by pressing the + symbol, if this section is not already expanded {{ :otp-welcome.png?direct&500 |}}
- Under "Book My Shifts" is a list for all shifts you are allowed to book. The number "Id" is the task id for the shift. Please click on the Task Id for the shift you wish to book.
- If your shift (or shadow shift) does not appear here you have either not yet been listed as a member or the booking period is not (yet) open. Ask the responsible person for that shift to add you as a member or to open the shift booking period.
- You should now be able to see a page titled "Shift Booking", referencing this single shift. Next to the name of the shift is a + symbol. Click this + symbol in order to expand the section. {{ :otp-book1.png?direct&500 |}}
- You should now be able to see a calendar. Please note that the you may have to edit the drop down menus which specify the start month, and how many months the calendar shows. You can only book shifts in the future, not in the past! {{ :otp-book2.png?direct&500 |}}
- Any shift that is "pink" is unallocated. Select any "pink" shifts that you wish to book. They should now turn green.
- To save these shifts: please click "Save Shift Booking"
===== Checklist =====
| Open Jabber client and say hello in the Virtual control Room to let ADC/ADCoS community know who is on shift | Room:adcvcr Server: conference.chat.uio.no |
| Clouds/sites problems with production jobs. Are there clouds with low efficiency (below 50%)? | [[http://panda.cern.ch/server/pandamon/query?dash=prod]]|
| Transferring problems. Are there clouds/sites with high number of jobs in transferring state? | [[http://panda.cern.ch:25880/server/pandamon/query?job=*&type=production&cloudtype=not_null&computingsitetype=not_null&jobStatus=transferring&transferringnotupdated=36]] |
| Jobs in waiting state. Are there jobs in wating state? |[[http://panda.cern.ch:25880/server/pandamon/query?job=*&type=production&jobStatus=waiting]], follow [[https://twiki.cern.ch/twiki/bin/viewauth/Atlas/ADCoS#Waiting_Jobs_Procedure]]|
| Data Management Cloud/site stastistics in DDM dashboard. Is there any Cloud(s) or site(s) below 50% efficiency? | [[http://dashb-atlas-data.cern.ch/dashboard/ddm2/#d.dst.cloud=CA&d.src.cloud=ND&samples=true and http://dashb-atlas-data.cern.ch/dashboard/request.py/site]] |
| Central Services status | [[https://sls.cern.ch/sls/service.php?id=ADC_CS]] |
| Database services status | [[https://atlas-service-dbmonitor.web.cern.ch/atlas-service-dbmonitor/shifter/database_dashboard.php| atlas-service-dbmonitor]] |
| Task Monitoring | [[http://dashb-atlas-prodsys-prototype.cern.ch/templates/task-prod/#user=&refresh=0&table=Tasks&p=1&records=25&activemenu=0&from=&till=&timerange=lastDay¬modsince=&created=&pattern=&taskstatus=&site=&cloud=&typeofproc=&workinggroup=&activity=&prodtaskid=&taskpriority=&ndj= | dashb-atlas-prodsys-prototype.cern.ch]] |
| Check ADC eLog | |
| At the beginning of your shift please check the deletion error rate. | [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#Checking_the_deletion_error_rate]] |
| Check recent changes in queue status | [[http://panda.cern.ch?overview=incidents]] |
| Check the status of the TEAM tickets in GGUS and hand-over. | [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#TicketMan]] |
| Check status of the Group production jobs and transfer requests | [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#Group_production_jobs]] |
| Once per shift please check status of replication of last 3 DB releases | [[http://panda.cern.ch/server/pandamon/query?mode=listDBRelease]] |
| Once per shift please check Frontier status | https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#Frontier |
| Remember to report every action in eLog. (for 'new' entry, click on existing entry first. If you solve an issue, put [SOLVED] in the eLog subject) |
| Remember to check if sites are in Scheduled Downtime before opening bugs | [[https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ADCoS#Downtimes]] |
| Senior shifter is obliged to submit daily shift summary report. Trainee shifters do not submit daily report. | |
==== Details of Checklist Item 1 ====
- Download and install a Jabber Client such as pidgin from http://www.pidgin.im/
- Go to Accounts --> Manage Account --> Add. Fill in the details as shown below. Use you CERN username and password {{ :jabber1.png?direct&300 |}}
- Once the account is added and you are successfully connected to the server. Join the adcvcr room by going to Buddies --> Join a chat and fill the details as shown below.{{ :jabber2.png?direct&300 |}}
==== Details of Checklist Item 2 ====
- Open the url http://panda.cern.ch/server/pandamon/query?dash=prod in your browser. You get a quick snapshot of the overall status for the production by cloud
- Look at cloud efficiency (right hand side of the cloud table) and spot most problematic clouds.
- First look for sites with lot of failing jobs as shown in figure 1
- Then look for tasks with lot of failing jobs by going to url http://panda.cern.ch/server/pandamon/query?dash=task&show=active shown in figure 2
- Compare them to know if the problem is site related or task related.
- if several sites for one task are found , it seems to be the task , open a savannah bug report
- if all jobs are at one site, it seems to be the site, open GGUS team ticket to the site
{{ :panda1.png?direct&300 |}}
{{ :panda-tasks.png?direct&300 |}}
==== Example of the DDM Shift ====
- Open the DDM dashboard URL http://dashb-atlas-data.cern.ch/dashboard/request.py/site {{ :ddm1.png?direct&300 |}}
- Look for problematic cloud {{ :ddm2.png?direct&300 |}}
- Check the problematic site {{ :ddm3.png?direct&300 |}}
- Check the error pattern {{ :ddm4.png?direct&300 |}}
- Check the Endpoints {{ :ddm5.png?direct&300 |}}
- Check the downtime status {{ :ddm6.png?direct&300 |}}
- Check the downtime status in GOCDB {{ :ddm7.png?direct&300 |}}
- We have a site which fails FTS transfer. There is no ongoing downtime for this site. Now we have to check if there is any bug report on this in Adcos elog. {{ :ddm8.png?direct&300 |}}
- Checking the elog. {{ :ddm9.png?direct&300 |}}
- List of elogs fullfilling your search criteria appears. {{ :ddm10.png?direct&300 |}}
- Go back to the DDM Dashboard. {{ :ddm11.png?direct&300 |}}
- Get more details {{ :ddm12.png?direct&300 |}}
- After Clicking on Transfer Placement time you will be able to get the SRC SURL, DEST SURL and attempt counter. {{ :ddm13.png?direct&300 |}}
- We have collected the error details and link to dashboard page with error details. Now its time to check GGUS tickets. {{ :ddm14.png?direct&300 |}}
- Search the GGUS team ticket database. {{ :ddm15.png?direct&300 |}}
- If there is no GGUS team ticket regarding this error, submit a new one. {{ :ddm16.png?direct&300 |}}
- Click on Open Team Ticket. {{ :ddm17.png?direct&300 |}}
- Fill in the details such as error example, link to the dashboard, SRC SURL, DEST SURL, number of attempts and click on submit. {{ :ddm18.png?direct&300 |}}
- Make an entry in Adcos Elog. {{ :ddm19.png?direct&300 |}}
- Fill in the details, reference the GGUS ticket and click on submit. {{ :ddm20.png?direct&300 |}}