Technical Articles series: Monitoring in Swarmchestrate -the story so far

(3 min read)

Monitoring in the Swarmchestrate project is undertaken by the Event Management System (EMS). The EMS is a resilient, event propagation and processing system, which is able to detect Service Level Agreement (SLA) or Service Level Objective (SLO) violations concerning the deployed application components. The EMS is tasked with collecting, in real time, metrics from the applications deployed in the Swarmchestrate universe, as well as the infrastructure where the aforementioned apps are deployed at. By processing the monitoring data, alerts can be issued in case violations in the defined resource requirements are detected, and adaptations can be enacted towards maintaining the required Quality of Service (QoS) level.

Monitoring in Swarmchestrate (© Swarmchestrate consortium 2024-2026)

As mentioned in the orchestration brief introduction article, when an application owner submits the resource descriptions and requirements for an application, the Swarmchestrate platform locates and assigns the most suitable resources for the microservices comprising this specific application, which in turn form a swarm. After the swarm’s formation, an Event Processing Manager (EPM) and several Event Processing Agents (EPAs) are deployed on each swarm; the main blocks of the EMS. EPAs accompany the application component instances, either on the same hosting resource or on a nearby one, based on computational capacity. Therefore, the swarm entities create an Event Processing Network (EPN), capable of aggregating high frequency real-time monitoring metrics.

The EPAs are able to monitor any Topology and Orchestration Specification for Cloud Applications (TOSCA)-based metrics specified by the application owner. These agents deploy monitoring probes for the collection of raw data in real-time, powered by Netdata. The collected raw data is propagated to the network of EPAs, using a pub/sub protocol, which in turn add Complex Event Processing (CEP) capacity; the outcomes are aggregated at the EPM level, as composite metrics.

The EPM is responsible for gathering raw and composite metrics from the agents of the swarm and process them. As an outcome of this processing, the EPM is able to issue alerts which trigger application reconfigurations. The EPM of each swarm is able to communicate with the EPMs of other swarms, achieving cross-swarm data exchange.

The EMS workflow starts when the application owner submits the requirements for a specific application. These requirements, provided in TOSCA language, are considered the input to the EMS. The EMS translates them to a set of CEP rules which dictate the functionality of each monitoring agent. CEP allows events to be processed as data streams, to identify meaningful events or combination of events based on a pre-defined set of rules. The various EMS agents are then deployed accordingly and monitor the microservices, checking for potential violations. Data monitored may either require processing (e.g., average CPU usage of the application), in this case called composite metrics, or collected directly from the monitoring probe of an agent (e.g., RAM used in a particular Virtual Machine), in this case called raw metrics. Once a violation is detected, an event is issued and propagated to trigger the reconfiguration.

Editor: Dimitris Kounoudis, Institute of Communication and Computer Systems (ICCS) of the School of Electrical and Computer Engineering (ECE) of the National Technical University of Athens (NTUA)

Categories: Latest News | Technical Articles