International Journal For Multidisciplinary Research

E-ISSN: 2582-2160     Impact Factor: 9.24

A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal

Call for Paper Volume 8, Issue 2 (March-April 2026) Submit your research before last 3 days of April to publish your research paper in the issue of March-April.

Reducing Leader Recovery Time in Distributed Architectures Using Zookeeper Atomic Broadcast

Author(s) Naveen Srikanth Pasupuleti
Country United States
Abstract Virtual Replication (VR) is a distributed system architecture that aims to provide fault tolerance and high availability by maintaining copies of data across multiple nodes. It ensures that even if one or more nodes fail, the system can continue to function by promoting one of the remaining replicas to become the new leader. This replication mechanism is essential in systems that require reliability and consistency. VR systems typically use leader-based replication, where a single node, called the leader, handles write operations and propagates those changes to the follower nodes. In case of a leader failure, a new leader is elected from the available replicas, and the system continues operation without disruption. However, despite its advantages in providing fault tolerance, VR systems often face significant challenges with leader failure recovery time. This is particularly true as the number of nodes in the system increases. In VR systems, when a leader fails, a recovery process must take place to elect a new leader from the available replicas. This process involves communication between the replicas, where they must agree on which replica should take over as the new leader. While the consensus process is designed to maintain the consistency and availability of the system, it introduces delays. One of the primary reasons for high leader failure recovery times in VR systems is the need for synchronization among the nodes. When a leader fails, the system must ensure that all replicas are up to date before electing a new leader, which can be time-consuming, especially in larger clusters. Additionally, as the number of nodes increases, the number of communication messages between replicas grows, further increasing the recovery time. The process of leader election itself involves multiple rounds of communication and coordination, adding to the delay. In systems with many nodes, this coordination overhead can become a bottleneck. Another contributing factor is the time required to validate the state of the cluster after a leader failure. In a large-scale system, there are often a significant number of follower nodes, and ensuring that they all agree on the new leader can take considerable time. This issue is exacerbated when the system is under heavy load, as the election process becomes more resource-intensive and time-consuming. The consensus process, communication overhead, and leader election mechanism contribute to the delays observed in the recovery process. As a result, optimizing leader failure recovery in VR systems is essential for ensuring system performance and minimizing downtime in large-scale deployments. This paper addresses this issue using Zookeeper Atomic Broadcast ZAB.
Published In Volume 5, Issue 4, July-August 2023
Published On 2023-08-05
DOI https://doi.org/10.36948/ijfmr.2023.v05i04.46802

Share this