An interesting conversation has been evolving on Twitter about the antifragility of server systems, DevOps etc.
The link below is one thread but there are a number of other threads so please dont take the one below as the only one.https://twitter.com/nntaleb/status/368820666767650816
It was getting too long for Twitter, so I'm expanding on some thoughts here.
The desirable situation is similar to what happens in a human body, where weaker cells or bad cells die off leaving stronger ones. This is an antifragile system where stressors make a system stronger.
The discussion is about analogies in the server and DevOps space.
Current server systems are robust but not antifragile. Robust systems resist external force to a point but then collapse completely.
Antifragile systems adaptively get better with each external stressor.
The problem with applying this to server systems is twofold
a) There is no underlying statistical
model or underlying set of properties which admit of a statistical distribution where some part of the population may be considered "good" and the others "bad". In fact in a server farm (whether in the cloud or not) all servers are started up with identical properties, intentionally.
It is therefore, IMHO, no way to assign a "goodness" measure to a server and cause it to be "killed" off to improve the health of the larger population of servers.
b) servers themselves do not adapt in the presence of external stress, in a way that changes their responses in future. This is not how they are currently designed. They are much more like bricks than human cells.
IMHO, there needs to be a layer in addition to existing server architecture that uses machine learning to do two things. But before I go there I want to say that the goal of this is not "finding bugs" as Netflix's ChaosMonkey does.
It is to adaptively change parameters of the system so that with each stressor it becomes more resilient to future stressors.
The problem IMHO is that we don't think of server populations as set of properties that are random variables in the stochastic process sense, we think of each server as an identical copy of the other. And a server is either up or down. In our mental model there is no smooth transition where the server gracefully degrades, learns from the stress and the next time degrades less. We want this second model where we think in terms of graceful degradation (smooth metric) rather than up/down (discrete metric)
The question is then threefold
a) What knowledge (attributes) do we want to extract, from the current stressor, and from the current response, which might be useful in creating an adaptive loop.
b) what parameters in the server do we adjust after a stressor event, so as to degrade less in future.
c) how do we assign degrees of goodness(continuous metric) so that we can use certain servers preferentially over others .
Nicholas, Adrian, et al - hope this makes sense. Hard to do this on Twitter.
I am not sure the "killing off bad cells" analogy holds in a cloud but may hold in a non virtualized hardware server farm where some servers may have worse hardware. The virtualization in a cloud, IMHO, spreads "badness" in ways that might make the above discussion inapplicable due to the non-localizability of "badness" to a particular server even if said server is made to look more like a cell not a brick.