Partition-tolerant Distributed Publish/Subscribe Systems

University of Toronto, 2010.
Pages 1-10.

Abstract

In this paper, we develop {\em reliable} distributed publish/subscribe algorithms that can tolerate concurrent failure of up to $\delta$ brokers or links. In our approach, $\delta$ is a configuration parameter which determines the level of fault-tolerance of the system, and reliability refers to exactly-once and per-source in-order delivery of publications to clients with matching subscriptions. We propose protocols to address three problems in presence of broker or link failures: {\em (i)} subscription propagation; {\em (ii)} event forwarding; and {\em (iii)} broker recovery. To precisely study the effect of multiple failures on the operation of the system, we introduce two types of network partitions which we term {\em partition islands} and {\em partition barriers}. Our approach is able to transparently bypass partition islands while guaranteeing reliable publication delivery at all times. For barriers however, we maintain reliability by excluding delivery of publications that may violate its requirements. Finally, we study the effectiveness of our approach when the number of concurrent failures exceed delta. Via experimental evaluations, we demonstrate that a system configured with a modest value of $\delta=3$ is able to reliably deliver $97\%$ of publications in presence of failure of up to $17\%$ of brokers.

Readers who enjoyed the above work, may also like the following:

• Grand Challenge: The BlueBay Soccer Monitoring Engine.