rfc9940v1.txt   rfc9940.txt 
Internet Engineering Task Force (IETF) N. Davis, Ed. Internet Engineering Task Force (IETF) N. Davis, Ed.
Request for Comments: 9940 Ciena Request for Comments: 9940 Ciena
Category: Informational A. Farrel, Ed. Category: Informational A. Farrel, Ed.
ISSN: 2070-1721 Old Dog Consulting ISSN: 2070-1721 Old Dog Consulting
T. Graf T. Graf
Swisscom Swisscom
Q. Wu Q. Wu
Huawei
C. Yu C. Yu
Huawei Technologies Huawei
February 2026 February 2026
Some Key Terms for Network Fault and Problem Management Some Key Terms for Network Fault and Problem Management
Abstract Abstract
This document sets out some terms that are fundamental to a common This document sets out some terms that are fundamental to a common
understanding of network fault and problem management within the understanding of network fault and problem management within the
IETF. IETF.
skipping to change at line 85 skipping to change at line 84
Successful operation of large networks depends on effective network Successful operation of large networks depends on effective network
management. This requires a virtuous circle of network control, management. This requires a virtuous circle of network control,
network observability, network analytics, network assurance, and back network observability, network analytics, network assurance, and back
to network control. Network fault and problem management [RFC6632] to network control. Network fault and problem management [RFC6632]
is an important aspect of network management and control solutions. is an important aspect of network management and control solutions.
It deals with the detection, reporting, inspection, isolation, It deals with the detection, reporting, inspection, isolation,
correlation, and management of events within the network. The correlation, and management of events within the network. The
intention of this document is to focus on those events that have a intention of this document is to focus on those events that have a
negative effect on the network's ability to forward traffic according negative effect on the network's ability to forward traffic according
to expected behavior and so deliver services, the ability to control to expected behaviors that may reduce the network's ability to
and operate the network, and other faults that reduce the quality or deliver services. Such events may also impact the ability to control
reliability of the delivered service. The concept of fault and and operate the network. The document also considers other faults
problem management extends to include actions taken to determine the that reduce the quality or reliability of the delivered service. The
causes of problems and to work toward recovery of expected network concept of fault and problem management extends to include actions
behavior. taken to determine the causes of problems and to work toward recovery
of expected network behavior.
A number of work efforts within the IETF seek to provide components A number of work efforts within the IETF seek to provide components
of a fault management system, such as YANG data models or management of a fault management system, such as YANG data models or management
protocols. It is important that a common terminology be used so that protocols. It is important that a common terminology be used so that
there is a clear understanding of how the elements of the management there is a clear understanding of how the elements of the management
and control solutions fit together and how faults and problems will and control solutions fit together and how faults and problems will
be handled. be handled.
This document sets out some terms that are fundamental to a common This document sets out some terms that are fundamental to a common
understanding of network fault and problem management. While understanding of network fault and problem management. While
skipping to change at line 178 skipping to change at line 178
process of collecting operational network data categorized process of collecting operational network data categorized
according to the network plane (e.g., Layer 3, Layer 2, and Layer according to the network plane (e.g., Layer 3, Layer 2, and Layer
1) from which it was derived. Data collected through the Network 1) from which it was derived. Data collected through the Network
Telemetry process does not contain any data related to service Telemetry process does not contain any data related to service
definitions (i.e., "intent" per Section 3.1 of [RFC9315]). definitions (i.e., "intent" per Section 3.1 of [RFC9315]).
Network Monitoring: This is the process of keeping a continuous Network Monitoring: This is the process of keeping a continuous
record of functions related to a network topology. It involves record of functions related to a network topology. It involves
tracking various aspects such as traffic patterns, device health, tracking various aspects such as traffic patterns, device health,
performance metrics, and overall network behavior. This approach performance metrics, and overall network behavior. This approach
differentiates network monitoring from resource or device differentiates Network Monitoring from resource or device
monitoring, which focuses on individual resources or components monitoring, which focuses on individual resources or components
(Section 3.2). (Section 3.2).
Network Analytics: This is the process of deriving analytical Network Analytics: This is the process of deriving analytical
insights from operational network data. A process could be insights from operational network data. A process could be
executed by a piece of software, a system, or a human that executed by a piece of software, a system, or a human that
analyzes operational data and outputs new analytical data related analyzes operational data and outputs new analytical data related
to the operational data -- for example, a symptom. to the operational data -- for example, a symptom.
Network Observability: This is the process of enabling network Network Observability: This is the process of enabling network
behavioral assessment through analysis of observed operational behavioral assessment through analysis of observed operational
network data (logs, alarms, traces, etc.) with the aim of network data (logs, alarms, traces, etc.) with the aim of
detecting symptoms of network behavior, and to identify anomalies detecting symptoms of network behavior, and identifying anomalies
and their causes. Network Observability begins with information and their causes. Network Observability begins with information
gathered using Network Monitoring tools and that may be further gathered using Network Monitoring tools. That information may be
enriched with other operational data. The expected outcome of the further enriched with other operational data. The expected
observability processes is identification and analysis of outcome of the observability processes is identification and
deviations in observed state versus the expected state of a analysis of deviations in observed state versus the expected state
network. of a network.
Thus, there is a cascaded sequence where the following relationships Thus, there is a cascaded sequence where the following relationships
apply: apply:
* Network Telemetry is the process of collecting operational data * Network Telemetry is the process of collecting operational data
from a network. from a network.
* Network Monitoring is the process of creating/keeping a record of * Network Monitoring is the process of creating/keeping a record of
data gathered in Network Telemetry. data gathered in Network Telemetry.
skipping to change at line 231 skipping to change at line 231
Resource: An element of a network system. Resource: An element of a network system.
* Resource is a recursive concept so that a Resource may be a * Resource is a recursive concept so that a Resource may be a
collection of other Resources (for example, a network node collection of other Resources (for example, a network node
comprises a collection of network interfaces). comprises a collection of network interfaces).
Characteristic: Observable or measurable aspect or behavior Characteristic: Observable or measurable aspect or behavior
associated with a Resource. associated with a Resource.
* A Characteristic may be considered to be built on facts (see * A Characteristic may be considered to be built on facts (see
'Value', below) and the contexts and descriptors that identify "Value", below) and the contexts and descriptors that identify
and give meaning to the facts. and give meaning to the facts.
* The term "Metric" [RFC9417] is another word for a measurable * The term "Metric" (see "metric" in [RFC9417]) is another word
Characteristic which may also be thought of as analogous to a for a measurable Characteristic, which may also be thought of
'variable'. as analogous to a "variable".
Value: A measure of a Characteristic associated with a Resource. It Value: A measure of a Characteristic associated with a Resource. It
may be in the form of a categorization (e.g., high or low), an may be in the form of a categorization (e.g., high or low), an
integer (e.g., a count or gauge), or a reading of a continuous integer (e.g., a count or gauge), or a reading of a continuous
variable (e.g., an analog measurement), etc. variable (e.g., an analog measurement), etc.
Change: In the context of Network Monitoring, the variation in the Change: In the context of Network Monitoring, the variation in the
Value of a Characteristic associated with a Resource. A Change Value of a Characteristic associated with a Resource. A Change
may arise over a period of time. may arise over a period of time.
skipping to change at line 284 skipping to change at line 284
may determine the State of the router, such as shortage of memory. may determine the State of the router, such as shortage of memory.
* While a State may be observed at a specific moment in time, it * While a State may be observed at a specific moment in time, it
is actually determined by summarizing measurement over time in is actually determined by summarizing measurement over time in
a process sometimes called State compression. a process sometimes called State compression.
* It may be helpful to qualify this as "Resource State" to make * It may be helpful to qualify this as "Resource State" to make
clear the distinction between this and other uses of "state" clear the distinction between this and other uses of "state"
such as "protocol state". such as "protocol state".
* This term may be contrasted with "Operational State" as used in * This term may be contrasted with "operational state" as used in
[RFC8342]. For example, the state of a link might be up/down/ [RFC8342]. For example, the state of a link might be up/down/
degraded, but the operational state of the link would include a degraded, but the operational state of the link would include a
collection of Values of Characteristics of the link. collection of Values of Characteristics of the link.
Detect (hence Detected, Detection): To notice the presence of Detect (hence Detected, Detection): To notice the presence of
something (State, Change, Event, activity, etc.) and hence also to something (State, Change, Event, activity, etc.)
notice a Change (from the perspective of an observer such as a
monitoring system). * Also to notice a Change (from the perspective of an observer
such as a monitoring system).
Relevance: Consideration of an Event, State, or Value (through the Relevance: Consideration of an Event, State, or Value (through the
application of policy, relative to a specific perspective, intent, application of policy, relative to a specific perspective or
and in relation to other Events, States, and Values) to determine intent, and in relation to other Events, States, and Values) to
whether it is of note to the system that controls or manages the determine whether it is of note to the system that controls or
network. Note, for example, that not all Changes are Relevant. manages the network. Note, for example, that not all Changes are
Relevant.
* This term may also be used as "Relevant Event", "Relevant * This term may also be used as "Relevant Event", "Relevant
State", or "Relevant Value". State", or "Relevant Value".
Occurrence: A Relevant Event or a particular Relevant Change. Occurrence: A Relevant Event or a particular Relevant Change.
* An Occurrence may be an aggregation or abstraction of multiple * An Occurrence may be an aggregation or abstraction of multiple
fine-grained Events or Changes. fine-grained Events or Changes.
* An Occurrence may occur at any macro or micro scale because * An Occurrence may occur at any macro or micro scale because
Resources are a recursive concept, and may be perceived, Resources are a recursive concept. An Occurrence may be
depending on the scope of observation (i.e., according to the perceived, depending on the scope of observation (i.e.,
level of Resource recursion that is examined). That is, according to the level of Resource recursion that is examined).
Occurrences, themselves, are a recursive concept. That is, Occurrences, themselves, are a recursive concept.
Fault: An Occurrence (i.e., an Event or a Change) that is not Fault: An Occurrence (i.e., an Event or a Change) that is not
desired/required (as it may be indicative of a current or future desired/required (as it may be indicative of a current or future
undesired State). Thus, a Fault happens at a moment in time. A undesired State). Thus, a Fault happens at a moment in time. A
Fault can potentially be associated with a Cause. See [RFC8632] Fault can potentially be associated with a Cause. See [RFC8632]
for a more detailed discussion of network faults. for a more detailed discussion of network faults.
* Note that there is a distinction between a Fault and a Problem * Note that there is a distinction between a Fault and a Problem
that depends on context. For example, in a connectivity that depends on context. For example, in a connectivity
service where redundancy is present, a link down is a Problem, service where redundancy is present, a link down is a Problem,
skipping to change at line 461 skipping to change at line 463
Change at a time Change over time Change over time Change at a time Change over time Change over time
Figure 2: Characteristics and Changes Figure 2: Characteristics and Changes
Figure 3 shows the workflow progress for Events. As noted above, an Figure 3 shows the workflow progress for Events. As noted above, an
Event is a Change in the Value of a Characteristic at a time. The Event is a Change in the Value of a Characteristic at a time. The
Event may be evaluated (considering policy, relative to a specific Event may be evaluated (considering policy, relative to a specific
perspective, with a view to intent, and in relation to other Events, perspective, with a view to intent, and in relation to other Events,
States, and Values) to determine if it is an Occurrence and possibly States, and Values) to determine if it is an Occurrence and possibly
to indicate a Change of State. An Occurrence may be undesirable (a to indicate a Change of State. An Occurrence may be undesirable (a
Fault) and that can cause an Alert to be generated, may be evidence Fault), which might cause an Alert to be generated. Or, an
of a Problem and could directly indicate a Cause. In some cases, an Occurrence may be evidence of a Problem and could directly indicate a
Alert may give rise to an Alarm highlighting the potential or actual Cause. In some cases, an Alert may give rise to an Alarm
presence of a Problem. highlighting the potential or actual presence of a Problem.
Alert - - - > Alarm Alert - - - > Alarm
^ ^
| |
| -----> Cause | -----> Cause
| | | |
|----------> Problem |----------> Problem
| |
| |
Fault Fault
skipping to change at line 500 skipping to change at line 502
progress for States. As shown in Figure 2, Change noted at a progress for States. As shown in Figure 2, Change noted at a
particular time gives rise to State. The State may be deemed to have particular time gives rise to State. The State may be deemed to have
Relevance considering policy, relative to a specific perspective, Relevance considering policy, relative to a specific perspective,
with a view to intent, and in relation to other Events, States, and with a view to intent, and in relation to other Events, States, and
Values. A Relevant State may be deemed a Problem, or it may indicate Values. A Relevant State may be deemed a Problem, or it may indicate
a Problem or potential Problem. a Problem or potential Problem.
Problems may be considered based on Symptoms and may map directly or Problems may be considered based on Symptoms and may map directly or
indirectly to Causes. An Incident results from one or more Problems. indirectly to Causes. An Incident results from one or more Problems.
An Alarm may be raised as the result of a Problem, and the transition An Alarm may be raised as the result of a Problem, and the transition
to an Alarmed state may give rise to an Alert. to an alarmed State may give rise to an Alert.
Alarm - - -> Alert Alarm - - -> Alert
^ ^
| ------> Incident | ------> Incident
| | | |
| | ---> Cause | | ---> Cause
| | | | | |
Problem---------> Symptom Problem---------> Symptom
^ ^
| |
skipping to change at line 560 skipping to change at line 562
Events and States (and the Alerts that they might give rise to) must Events and States (and the Alerts that they might give rise to) must
be treated with caution to dampen any "flapping" (so that consistent be treated with caution to dampen any "flapping" (so that consistent
States may be observed) and to avoid overwhelming management States may be observed) and to avoid overwhelming management
processes or systems. Analog Values may be read or notified from the processes or systems. Analog Values may be read or notified from the
Resource and could transition a threshold, be deemed Relevant Values, Resource and could transition a threshold, be deemed Relevant Values,
or be evaluated over time. Events may be counted, and the Count may or be evaluated over time. Events may be counted, and the Count may
cross a threshold or reach a Relevant Value. cross a threshold or reach a Relevant Value.
The Threshold Process may be implementation specific and subject to The Threshold Process may be implementation specific and subject to
policies. When a threshold is crossed and any other conditions are policies. When a threshold is crossed and any other conditions are
matched, an Event may be determined and may be treated like any other matched, an Event may be determined and treated like any other Event.
Event.
Occurrence Occurrence
^ ^
| |
|---------------------> State |---------------------> State
| |
| ------- Relevance | ------- Relevance
|------>| Count |-----------------------------> Value |------>| Count |-----------------------------> Value
| ------- | ^ | ------- | ^
| | | | | | | |
skipping to change at line 689 skipping to change at line 690
[RFC9417] Claise, B., Quilbeuf, J., Lopez, D., Voyer, D., and T. [RFC9417] Claise, B., Quilbeuf, J., Lopez, D., Voyer, D., and T.
Arumugam, "Service Assurance for Intent-Based Networking Arumugam, "Service Assurance for Intent-Based Networking
Architecture", RFC 9417, DOI 10.17487/RFC9417, July 2023, Architecture", RFC 9417, DOI 10.17487/RFC9417, July 2023,
<https://www.rfc-editor.org/info/rfc9417>. <https://www.rfc-editor.org/info/rfc9417>.
Acknowledgments Acknowledgments
The authors would like to thank Med Boucadair, Wanting Du, Joe The authors would like to thank Med Boucadair, Wanting Du, Joe
Clarke, Javier Antich, Benoit Claise, Christopher Janz, Sherif Clarke, Javier Antich, Benoit Claise, Christopher Janz, Sherif
Mostafa, Kristian Larsson, Dirk Hugo, Carsten Bormann, Hilarie Orman, Mostafa, Kristian Larsson, Dirk Von Hugo, Carsten Bormann, Hilarie
Stewart Bryant, Bo Wu, Paul Kyzivat, Jouni Korhonen, Reshad Rahman, Orman, Stewart Bryant, Bo Wu, Paul Kyzivat, Jouni Korhonen, Reshad
Rob Wilton, Mahesh Jethanandani, Tim Bray, Paul Aitken, and Deb Rahman, Rob Wilton, Mahesh Jethanandani, Tim Bray, Paul Aitken, and
Cooley for their helpful comments. Deb Cooley for their helpful comments.
Special thanks to the team that met at a side meeting at IETF 120 to Special thanks to the team that met at a side meeting at IETF 120 to
discuss some of the thorny issues: discuss some of the thorny issues:
* Benoit Claise * Benoit Claise
* Watson Ladd * Watson Ladd
* Brad Peters * Brad Peters
* Bo Wu * Bo Wu
* Georgios Karagiannis * Georgios Karagiannis
* Olga Havel * Olga Havel
skipping to change at line 740 skipping to change at line 741
Qin Wu Qin Wu
Huawei Huawei
101 Software Avenue, Yuhua District 101 Software Avenue, Yuhua District
Nanjing Nanjing
Jiangsu, 210012 Jiangsu, 210012
China China
Email: bill.wu@huawei.com Email: bill.wu@huawei.com
Chaode Yu Chaode Yu
Huawei Technologies Huawei
Email: yuchaode@huawei.com Email: yuchaode@huawei.com
 End of changes. 17 change blocks. 
43 lines changed or deleted 44 lines changed or added

This html diff was produced by rfcdiff 1.48.