Triaging and managing Skampi bugs¶

This document defines a process for Triaging and managing Skampi bugs so that any SKA team member knows how to handle the funnel of incoming bugs, the allocation, distribution and management of them.

The standard process for changing software includes the following phases:

Problem/modification identification, classification, and prioritization
Analysis
Design
Implementation
Regression/system testing
Acceptance testing
Delivery

The above process is no different for triaging and managing a bug in skampi. In the present document we will focus on how to identify a problem or bug from incoming information and event notifications and how to assign it to the right team(s).

Problem identification¶

The problem identification phase starts when there is an indication of a failure. This information can be raised by a developer (in any shared slack channel like the team-system-support) or by an alert in the following slack channels:

Any project member can join these channels to gain visibility of this information.

If the information comes from the ci-alerts-mvp then the primary source of detailed information for analysis are the gitlab pipeline logs available here.

Other source of information are:

kibana (require VPN)
Node dashboard
Gitlab runner dashboard
Gitlab CI Pipeline dashboard
Docker monitoring dashboard
K8s cluster summary dashboard
Ceph Cluster dashboard
Elasticsearch dashboard

Allocating ownership to teams¶

The following are general rules for allocating ownership to teams:

The primary responsibility for a failed pipeline is the owner of the first commit to the branch since the last successful run of the pipeline. It is therfore the responsibility of the committer to follow up on the pipeline status after each git push.
For every test case failing, the creator(s) of the test must be involved in order to assign the bug to the appropriate team.
The System Team should be involved in the problem identification in order to understand whether the problem is infrastructure related (related to a k8s cluster or any layer below it - docker, VM, virtualization etc).
For prometheus alerts, the system team must provide the analysis of the alert details in order to understand the cause, and give input into assigning it to the right team(s).

Raising bugs¶

Bugs are raised following the SKA Bug management guidelines.