Diff: rfc9940v1.txt - rfc9940.txt

	rfc9940v1.txt		rfc9940.txt

	Internet Engineering Task Force (IETF) N. Davis, Ed.		Internet Engineering Task Force (IETF) N. Davis, Ed.
	Request for Comments: 9940 Ciena		Request for Comments: 9940 Ciena
	Category: Informational A. Farrel, Ed.		Category: Informational A. Farrel, Ed.
	ISSN: 2070-1721 Old Dog Consulting		ISSN: 2070-1721 Old Dog Consulting
	T. Graf		T. Graf
	Swisscom		Swisscom
	Q. Wu		Q. Wu

	Huawei
	C. Yu		C. Yu

	Huawei Technologies		Huawei
	February 2026		February 2026

	Some Key Terms for Network Fault and Problem Management		Some Key Terms for Network Fault and Problem Management

	Abstract		Abstract

	This document sets out some terms that are fundamental to a common		This document sets out some terms that are fundamental to a common
	understanding of network fault and problem management within the		understanding of network fault and problem management within the
	IETF.		IETF.


	skipping to change at line 85 ¶		skipping to change at line 84 ¶

	Successful operation of large networks depends on effective network		Successful operation of large networks depends on effective network
	management. This requires a virtuous circle of network control,		management. This requires a virtuous circle of network control,
	network observability, network analytics, network assurance, and back		network observability, network analytics, network assurance, and back
	to network control. Network fault and problem management [RFC6632]		to network control. Network fault and problem management [RFC6632]
	is an important aspect of network management and control solutions.		is an important aspect of network management and control solutions.
	It deals with the detection, reporting, inspection, isolation,		It deals with the detection, reporting, inspection, isolation,
	correlation, and management of events within the network. The		correlation, and management of events within the network. The
	intention of this document is to focus on those events that have a		intention of this document is to focus on those events that have a
	negative effect on the network's ability to forward traffic according		negative effect on the network's ability to forward traffic according

	to expected behavior and so deliver services, the ability to control		to expected behaviors that may reduce the network's ability to
	and operate the network, and other faults that reduce the quality or		deliver services. Such events may also impact the ability to control
	reliability of the delivered service. The concept of fault and		and operate the network. The document also considers other faults
	problem management extends to include actions taken to determine the		that reduce the quality or reliability of the delivered service. The
	causes of problems and to work toward recovery of expected network		concept of fault and problem management extends to include actions
	behavior.		taken to determine the causes of problems and to work toward recovery
			of expected network behavior.

	A number of work efforts within the IETF seek to provide components		A number of work efforts within the IETF seek to provide components
	of a fault management system, such as YANG data models or management		of a fault management system, such as YANG data models or management
	protocols. It is important that a common terminology be used so that		protocols. It is important that a common terminology be used so that
	there is a clear understanding of how the elements of the management		there is a clear understanding of how the elements of the management
	and control solutions fit together and how faults and problems will		and control solutions fit together and how faults and problems will
	be handled.		be handled.

	This document sets out some terms that are fundamental to a common		This document sets out some terms that are fundamental to a common
	understanding of network fault and problem management. While		understanding of network fault and problem management. While

	skipping to change at line 178 ¶		skipping to change at line 178 ¶
	process of collecting operational network data categorized		process of collecting operational network data categorized
	according to the network plane (e.g., Layer 3, Layer 2, and Layer		according to the network plane (e.g., Layer 3, Layer 2, and Layer
	1) from which it was derived. Data collected through the Network		1) from which it was derived. Data collected through the Network
	Telemetry process does not contain any data related to service		Telemetry process does not contain any data related to service
	definitions (i.e., "intent" per Section 3.1 of [RFC9315]).		definitions (i.e., "intent" per Section 3.1 of [RFC9315]).

	Network Monitoring: This is the process of keeping a continuous		Network Monitoring: This is the process of keeping a continuous
	record of functions related to a network topology. It involves		record of functions related to a network topology. It involves
	tracking various aspects such as traffic patterns, device health,		tracking various aspects such as traffic patterns, device health,
	performance metrics, and overall network behavior. This approach		performance metrics, and overall network behavior. This approach

	differentiates network monitoring from resource or device		differentiates Network Monitoring from resource or device
	monitoring, which focuses on individual resources or components		monitoring, which focuses on individual resources or components
	(Section 3.2).		(Section 3.2).

	Network Analytics: This is the process of deriving analytical		Network Analytics: This is the process of deriving analytical
	insights from operational network data. A process could be		insights from operational network data. A process could be
	executed by a piece of software, a system, or a human that		executed by a piece of software, a system, or a human that
	analyzes operational data and outputs new analytical data related		analyzes operational data and outputs new analytical data related
	to the operational data -- for example, a symptom.		to the operational data -- for example, a symptom.

	Network Observability: This is the process of enabling network		Network Observability: This is the process of enabling network
	behavioral assessment through analysis of observed operational		behavioral assessment through analysis of observed operational
	network data (logs, alarms, traces, etc.) with the aim of		network data (logs, alarms, traces, etc.) with the aim of

	detecting symptoms of network behavior, and to identify anomalies		detecting symptoms of network behavior, and identifying anomalies
	and their causes. Network Observability begins with information		and their causes. Network Observability begins with information

	gathered using Network Monitoring tools and that may be further		gathered using Network Monitoring tools. That information may be
	enriched with other operational data. The expected outcome of the		further enriched with other operational data. The expected
	observability processes is identification and analysis of		outcome of the observability processes is identification and
	deviations in observed state versus the expected state of a		analysis of deviations in observed state versus the expected state
	network.		of a network.

	Thus, there is a cascaded sequence where the following relationships		Thus, there is a cascaded sequence where the following relationships
	apply:		apply:

	* Network Telemetry is the process of collecting operational data		* Network Telemetry is the process of collecting operational data
	from a network.		from a network.

	* Network Monitoring is the process of creating/keeping a record of		* Network Monitoring is the process of creating/keeping a record of
	data gathered in Network Telemetry.		data gathered in Network Telemetry.


	skipping to change at line 231 ¶		skipping to change at line 231 ¶
	Resource: An element of a network system.		Resource: An element of a network system.

	* Resource is a recursive concept so that a Resource may be a		* Resource is a recursive concept so that a Resource may be a
	collection of other Resources (for example, a network node		collection of other Resources (for example, a network node
	comprises a collection of network interfaces).		comprises a collection of network interfaces).

	Characteristic: Observable or measurable aspect or behavior		Characteristic: Observable or measurable aspect or behavior
	associated with a Resource.		associated with a Resource.

	* A Characteristic may be considered to be built on facts (see		* A Characteristic may be considered to be built on facts (see

	'Value', below) and the contexts and descriptors that identify		"Value", below) and the contexts and descriptors that identify
	and give meaning to the facts.		and give meaning to the facts.


	* The term "Metric" [RFC9417] is another word for a measurable		* The term "Metric" (see "metric" in [RFC9417]) is another word
	Characteristic which may also be thought of as analogous to a		for a measurable Characteristic, which may also be thought of
	'variable'.		as analogous to a "variable".

	Value: A measure of a Characteristic associated with a Resource. It		Value: A measure of a Characteristic associated with a Resource. It
	may be in the form of a categorization (e.g., high or low), an		may be in the form of a categorization (e.g., high or low), an
	integer (e.g., a count or gauge), or a reading of a continuous		integer (e.g., a count or gauge), or a reading of a continuous
	variable (e.g., an analog measurement), etc.		variable (e.g., an analog measurement), etc.

	Change: In the context of Network Monitoring, the variation in the		Change: In the context of Network Monitoring, the variation in the
	Value of a Characteristic associated with a Resource. A Change		Value of a Characteristic associated with a Resource. A Change
	may arise over a period of time.		may arise over a period of time.


	skipping to change at line 284 ¶		skipping to change at line 284 ¶
	may determine the State of the router, such as shortage of memory.		may determine the State of the router, such as shortage of memory.

	* While a State may be observed at a specific moment in time, it		* While a State may be observed at a specific moment in time, it
	is actually determined by summarizing measurement over time in		is actually determined by summarizing measurement over time in
	a process sometimes called State compression.		a process sometimes called State compression.

	* It may be helpful to qualify this as "Resource State" to make		* It may be helpful to qualify this as "Resource State" to make
	clear the distinction between this and other uses of "state"		clear the distinction between this and other uses of "state"
	such as "protocol state".		such as "protocol state".


	* This term may be contrasted with "Operational State" as used in		* This term may be contrasted with "operational state" as used in
	[RFC8342]. For example, the state of a link might be up/down/		[RFC8342]. For example, the state of a link might be up/down/
	degraded, but the operational state of the link would include a		degraded, but the operational state of the link would include a
	collection of Values of Characteristics of the link.		collection of Values of Characteristics of the link.

	Detect (hence Detected, Detection): To notice the presence of		Detect (hence Detected, Detection): To notice the presence of

	something (State, Change, Event, activity, etc.) and hence also to		something (State, Change, Event, activity, etc.)
	notice a Change (from the perspective of an observer such as a
	monitoring system).		* Also to notice a Change (from the perspective of an observer
			such as a monitoring system).

	Relevance: Consideration of an Event, State, or Value (through the		Relevance: Consideration of an Event, State, or Value (through the

	application of policy, relative to a specific perspective, intent,		application of policy, relative to a specific perspective or
	and in relation to other Events, States, and Values) to determine		intent, and in relation to other Events, States, and Values) to
	whether it is of note to the system that controls or manages the		determine whether it is of note to the system that controls or
	network. Note, for example, that not all Changes are Relevant.		manages the network. Note, for example, that not all Changes are
			Relevant.

	* This term may also be used as "Relevant Event", "Relevant		* This term may also be used as "Relevant Event", "Relevant
	State", or "Relevant Value".		State", or "Relevant Value".

	Occurrence: A Relevant Event or a particular Relevant Change.		Occurrence: A Relevant Event or a particular Relevant Change.

	* An Occurrence may be an aggregation or abstraction of multiple		* An Occurrence may be an aggregation or abstraction of multiple
	fine-grained Events or Changes.		fine-grained Events or Changes.

	* An Occurrence may occur at any macro or micro scale because		* An Occurrence may occur at any macro or micro scale because

	Resources are a recursive concept, and may be perceived,		Resources are a recursive concept. An Occurrence may be
	depending on the scope of observation (i.e., according to the		perceived, depending on the scope of observation (i.e.,
	level of Resource recursion that is examined). That is,		according to the level of Resource recursion that is examined).
	Occurrences, themselves, are a recursive concept.		That is, Occurrences, themselves, are a recursive concept.

	Fault: An Occurrence (i.e., an Event or a Change) that is not		Fault: An Occurrence (i.e., an Event or a Change) that is not
	desired/required (as it may be indicative of a current or future		desired/required (as it may be indicative of a current or future
	undesired State). Thus, a Fault happens at a moment in time. A		undesired State). Thus, a Fault happens at a moment in time. A
	Fault can potentially be associated with a Cause. See [RFC8632]		Fault can potentially be associated with a Cause. See [RFC8632]
	for a more detailed discussion of network faults.		for a more detailed discussion of network faults.

	* Note that there is a distinction between a Fault and a Problem		* Note that there is a distinction between a Fault and a Problem
	that depends on context. For example, in a connectivity		that depends on context. For example, in a connectivity
	service where redundancy is present, a link down is a Problem,		service where redundancy is present, a link down is a Problem,

	skipping to change at line 461 ¶		skipping to change at line 463 ¶
	Change at a time Change over time Change over time		Change at a time Change over time Change over time

	Figure 2: Characteristics and Changes		Figure 2: Characteristics and Changes

	Figure 3 shows the workflow progress for Events. As noted above, an		Figure 3 shows the workflow progress for Events. As noted above, an
	Event is a Change in the Value of a Characteristic at a time. The		Event is a Change in the Value of a Characteristic at a time. The
	Event may be evaluated (considering policy, relative to a specific		Event may be evaluated (considering policy, relative to a specific
	perspective, with a view to intent, and in relation to other Events,		perspective, with a view to intent, and in relation to other Events,
	States, and Values) to determine if it is an Occurrence and possibly		States, and Values) to determine if it is an Occurrence and possibly
	to indicate a Change of State. An Occurrence may be undesirable (a		to indicate a Change of State. An Occurrence may be undesirable (a

	Fault) and that can cause an Alert to be generated, may be evidence		Fault), which might cause an Alert to be generated. Or, an
	of a Problem and could directly indicate a Cause. In some cases, an		Occurrence may be evidence of a Problem and could directly indicate a
	Alert may give rise to an Alarm highlighting the potential or actual		Cause. In some cases, an Alert may give rise to an Alarm
	presence of a Problem.		highlighting the potential or actual presence of a Problem.

	Alert - - - > Alarm		Alert - - - > Alarm
	^		^
	\|		\|
	\| -----> Cause		\| -----> Cause
	\| \|		\| \|
	\|----------> Problem		\|----------> Problem
	\|		\|
	\|		\|
	Fault		Fault

	skipping to change at line 500 ¶		skipping to change at line 502 ¶
	progress for States. As shown in Figure 2, Change noted at a		progress for States. As shown in Figure 2, Change noted at a
	particular time gives rise to State. The State may be deemed to have		particular time gives rise to State. The State may be deemed to have
	Relevance considering policy, relative to a specific perspective,		Relevance considering policy, relative to a specific perspective,
	with a view to intent, and in relation to other Events, States, and		with a view to intent, and in relation to other Events, States, and
	Values. A Relevant State may be deemed a Problem, or it may indicate		Values. A Relevant State may be deemed a Problem, or it may indicate
	a Problem or potential Problem.		a Problem or potential Problem.

	Problems may be considered based on Symptoms and may map directly or		Problems may be considered based on Symptoms and may map directly or
	indirectly to Causes. An Incident results from one or more Problems.		indirectly to Causes. An Incident results from one or more Problems.
	An Alarm may be raised as the result of a Problem, and the transition		An Alarm may be raised as the result of a Problem, and the transition

	to an Alarmed state may give rise to an Alert.		to an alarmed State may give rise to an Alert.

	Alarm - - -> Alert		Alarm - - -> Alert
	^		^
	\| ------> Incident		\| ------> Incident
	\| \|		\| \|
	\| \| ---> Cause		\| \| ---> Cause
	\| \| \|		\| \| \|
	Problem---------> Symptom		Problem---------> Symptom
	^		^
	\|		\|

	skipping to change at line 560 ¶		skipping to change at line 562 ¶
	Events and States (and the Alerts that they might give rise to) must		Events and States (and the Alerts that they might give rise to) must
	be treated with caution to dampen any "flapping" (so that consistent		be treated with caution to dampen any "flapping" (so that consistent
	States may be observed) and to avoid overwhelming management		States may be observed) and to avoid overwhelming management
	processes or systems. Analog Values may be read or notified from the		processes or systems. Analog Values may be read or notified from the
	Resource and could transition a threshold, be deemed Relevant Values,		Resource and could transition a threshold, be deemed Relevant Values,
	or be evaluated over time. Events may be counted, and the Count may		or be evaluated over time. Events may be counted, and the Count may
	cross a threshold or reach a Relevant Value.		cross a threshold or reach a Relevant Value.

	The Threshold Process may be implementation specific and subject to		The Threshold Process may be implementation specific and subject to
	policies. When a threshold is crossed and any other conditions are		policies. When a threshold is crossed and any other conditions are

	matched, an Event may be determined and may be treated like any other		matched, an Event may be determined and treated like any other Event.
	Event.

	Occurrence		Occurrence
	^		^
	\|		\|
	\|---------------------> State		\|---------------------> State
	\|		\|
	\| ------- Relevance		\| ------- Relevance
	\|------>\| Count \|-----------------------------> Value		\|------>\| Count \|-----------------------------> Value
	\| ------- \| ^		\| ------- \| ^
	\| \| \| \|		\| \| \| \|

	skipping to change at line 689 ¶		skipping to change at line 690 ¶

	[RFC9417] Claise, B., Quilbeuf, J., Lopez, D., Voyer, D., and T.		[RFC9417] Claise, B., Quilbeuf, J., Lopez, D., Voyer, D., and T.
	Arumugam, "Service Assurance for Intent-Based Networking		Arumugam, "Service Assurance for Intent-Based Networking
	Architecture", RFC 9417, DOI 10.17487/RFC9417, July 2023,		Architecture", RFC 9417, DOI 10.17487/RFC9417, July 2023,
	<https://www.rfc-editor.org/info/rfc9417>.		<https://www.rfc-editor.org/info/rfc9417>.

	Acknowledgments		Acknowledgments

	The authors would like to thank Med Boucadair, Wanting Du, Joe		The authors would like to thank Med Boucadair, Wanting Du, Joe
	Clarke, Javier Antich, Benoit Claise, Christopher Janz, Sherif		Clarke, Javier Antich, Benoit Claise, Christopher Janz, Sherif

	Mostafa, Kristian Larsson, Dirk Hugo, Carsten Bormann, Hilarie Orman,		Mostafa, Kristian Larsson, Dirk Von Hugo, Carsten Bormann, Hilarie
	Stewart Bryant, Bo Wu, Paul Kyzivat, Jouni Korhonen, Reshad Rahman,		Orman, Stewart Bryant, Bo Wu, Paul Kyzivat, Jouni Korhonen, Reshad
	Rob Wilton, Mahesh Jethanandani, Tim Bray, Paul Aitken, and Deb		Rahman, Rob Wilton, Mahesh Jethanandani, Tim Bray, Paul Aitken, and
	Cooley for their helpful comments.		Deb Cooley for their helpful comments.

	Special thanks to the team that met at a side meeting at IETF 120 to		Special thanks to the team that met at a side meeting at IETF 120 to
	discuss some of the thorny issues:		discuss some of the thorny issues:

	* Benoit Claise		* Benoit Claise
	* Watson Ladd		* Watson Ladd
	* Brad Peters		* Brad Peters
	* Bo Wu		* Bo Wu
	* Georgios Karagiannis		* Georgios Karagiannis
	* Olga Havel		* Olga Havel

	skipping to change at line 740 ¶		skipping to change at line 741 ¶

	Qin Wu		Qin Wu
	Huawei		Huawei
	101 Software Avenue, Yuhua District		101 Software Avenue, Yuhua District
	Nanjing		Nanjing
	Jiangsu, 210012		Jiangsu, 210012
	China		China
	Email: bill.wu@huawei.com		Email: bill.wu@huawei.com

	Chaode Yu		Chaode Yu

	Huawei Technologies		Huawei
	Email: yuchaode@huawei.com		Email: yuchaode@huawei.com

End of changes. 17 change blocks.
	43 lines changed or deleted		44 lines changed or added
This html diff was produced by rfcdiff 1.48.