SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol
Unknowns
- virtually synchronous multicast protocols(drastic reduction in performance, and partitioning, at beyond a few dozen members)
- stable rate of false positives: Having a stable rate of false positives is better than having a stable rate of false negatives
- infection style
- Contact member ?
- Static coordinator ?
- why toss coin ?
- anti entropy
- merkle trees ?
Problems with traditional protocols (heart-beating, etc)
- high network loads that grow quadratically with group size
- Compromise network times
- False positive frequency (detects a system has crashed if even if it is alive due to network issues)
- time to detect failures is high
About SWIM
- 2 protocols for failure detection and message dissemination for membership updates
- Time for membership changes is not affected by group size
- 2 step process for detecting failure by suspecting processes first
- provides stable failure detection times
- low message load for group members
- efficient non multicast failure detection
- using the dissemination component only when a membership change occurs. How is this different from other protocols ?
- dissemination latency in the group grows slowly (logarithmically) with the number of members.
Failure detection
- M ping - ack P
- If no ack from P, process M picks k random processes to ping the suspected process P. If none of them receive a ping from the suspected process P, it will be declared as failed in M’s local membership list and then passed on to the dissemination process.
Some concepts
-
Piggybacking
Piggybacking in a way, is to use the ping-ack message to send data as well. The ack message is immediately not sent by the receiver but only sent when it has additional data to send. This way, only PING-ACK messages with data are passed around.
-
Incarnation
SWIM assumes the worst always. Example: If node B is declared dead, an alive protocol from B will not be considered as SWIM assumes the worst, i.e, dead. Incarnation number is added, which is like a counter. A node can only increment its incarnation number. B will increment its incarnation number from 1→2 and send an alive message. Other nodes will always consider a message with higher incarnation than the one they have stored in their membership list, i.e, 1.
-
Generation + incarnation to tackle node restart