Profile

Cover photo
Sebastian “baboo” P.
Worked at T-Online
Attended Technische Universität Darmstadt
Lives in the middle of nowhere
1,478 followers|1,352,432 views
AboutPostsCollections

Stream

 
#Verbalsadismus auf der A4

Was ist ein anderes Wort für einen Elefantenpenis?
Ganz klar, Dicktiergerät.
 ·  Translate
8
2
Janek Bevendorff's profile photoChristian Ege (graugans)'s profile photo
Add a comment...
 
 
• Austria for English speakers ‪#‎Herstareike‬
3 comments on original post
2
1
Eric Muller's profile photo
Add a comment...
 
 
Was die hiesige Presse nicht schreibt

"Tsipras teilt gegen die Kanzlerin aus„ titeln gleichlautend Focus und Bild, "Tsipras pöbelt im TV gegen die Kanzlerin“ schreibt die "Münchner Abendzeitung. "Tsipras droht mit Referendum“ ist der Titel in der Zeit und „Tsipras kriegt kalte Füße“ ist die Interpretation des Handelsblatts. Englische Blätter verstehen Tsipras anders. Der Telegraph hat z.B. die headline: „Angela Merkel is still Greece’s best hope“ und Fortune fragt sich „Did Alexis Tsipras just banish the risk of Grexit?“.

Damit sich die Leser ein eigenes Urteil bilden können und weil viele interessante Informationen garnicht übermittelt wurden moechte ich an dieser Stelle einige Schlüsselaussagen aus dem griechischen Orginalton übersetzen.

... 
weiterlesen

http://norberthaering.de/de/27-german/news/350-brief-tsipras#weiterlesen
 ·  Translate
View original post
2
Add a comment...
7
Add a comment...
 
 
Distributed Discovery Systems: Zookeeper, etcd, Consul

When you are doing distributed anything, all components of the distributed system need to agree which systems are part of the distributed service, and which aren't. This is a problem that is hard to solve - if you are trying to do it for yourself, you'll end up here: https://aphyr.com/tags/jepsen because you will be doing it wrong. Instead, use a consensus systems such as Zookeeper (ZK), etcd or Consul and be done with it properly.

If your cluster does not use such a system and is also not Jepsen tested, it is likely to be defective. I am looking at you, Openstack.

This kind of system is called a consensus system, because they have a number of nodes that need to agree (find a consensus) on who is the leader in the face of nodes or connections between nodes failing. For that, these systems are using a validated implementation of an algorithm such as Paxos (http://en.wikipedia.org/wiki/Paxos_%28computer_science%29) or Raft (http://en.wikipedia.org/wiki/Raft_(computer_science)).

Services and Operations offered

Once these systems agree between themselves on cluster membership and leadership for the consensus system itself, they provide a service for the rest of the cluster - mostly they create a tree of nodes, and each node can have subnodes and attributes, much like an XML-tree and a little less like a filesystem (that is, each node in a consensus system is usually a directory and a file at the same time, in filesystem terms).

Typically, operations on nodes are atomic. That is, clients to the consensus system can create the same node simultaneously, but because of ordering guarantees in the cluster consensus system only one client can succeed, and the cluster agrees on the same client clusterwide, even in the face of adverse conditions such as ongoing node failures or network splits. That latter part is important - many systems are calling themselves cluster systems, and work well as long as the cluster and the network operate in fair weather.

The Jepsen testing harness (https://aphyr.com/tags/jepsen) is a setup in which such distributed systems are seeing a defined test load with a known end state. Jepsen runs the test load and at the same time kills nodes or splits the cluster network randomly. It records the state changes of the cluster and the end state in each node, and compares it to the expected result. If the results differ, the cluster is broken. Most are. In fact, only ZK survived on the first attempt and at the moment only ZK, etcd and Consul are known to survive a Jepsen test.

Once you have a stable and verified cluster core, you can do useful things with it. For that, you need a set of operations to change state, and a mechanism to learn when the state changes. 

The cluster provides operations for clients that allow them to be notified of changes in a node or a subtree starting in a single node - these are called watches. A watch is a substitute for polling. Instead of each client asking "Did the cluster master change yet? Did the cluster master change yet? Did the cluster master change yet?" in a tight loop, clients are being notified once the cluster master changes.

Assume a Zookeeper installation with three machines that are running Zookeeper instances in a single cluster. These nodes will agree among themselves on a master and a common shared state, which in the beginning is an empty tree of nodes, called Znodes.

The ZK API has only a few operations: "create /path data" to create a Znode, "delete /path" do delete it agasin, "exists /path" to check if a path exists, "setData /path newdata" to change the data of a Znode, "getData /path" to get that data and "getChildren /path" to get a list of child Znodes under /path.

It is important to understand that the data in a node can only be read or written as a whole, atomically and never modified, only atomically replaced. The data is supposed to be small, maybe a KiB or four ar most.

A central concept in Zookeeper (and Consul) is the session. You can connect to any Zookeeper in a cluster and will see the same cluster state - after all that is what consensus systems are for. When you establish a connection, you also establish a session. Even when because of problems with the network you are losing the connection the sessions remains. If you manage to connect to any other ZK node in the cluster within the timeout limit you session will remain active. To terminate the session, you are either requesting a session end actively, or you time out because your client is so isolated within a broken network that it cannot reach any ZK node that is still part of the surviving cluster.

Znodes in a ZK hierarchy can be persistent - you create them and they stay around until you delete them. They can also be ephemeral - you create them and when your session ends they will go away. So if a client registers itself with a cluster in an emphemeral node and puts connection information in the data of its Znode, we can be pretty sure that the client is alive and somehow reachable.

Znodes can also be sequential. That is, we provide a base name for a Znode such as "job-" and the cluster appends a unique, monotonically increasing number and tells us the resulting name ("job-17"). Because of the properties of the cluster consensus protocol, all nodes will agree on what name is owned by whom, and on a global order.

A final important concept are versions - each Znode has a version number associated with the data that is being stored, and each time that data is replaced (remember that it can't be changed), the version number is incremented. A couple of operations in the API which as setData and delete can be executed conditionally - the calls take a version number as a parameter and the operation succeeds if and only if the version passed by the client still matches the current version on the server.

Usage

Obviously, we can use this to create an attendance list of worker nodes in a cluster. Each worker will connect to the cluster, create a session and create the persistent directory /workers, if it doesn't exist already.

It will then create an emphemeral node for itself, /workers/hostname and leave connection information as the data in it. Yes, Openstack, these connection endpoints in keystone's MySQL are pretty useless compared to that.

If a client goes away due to node failure or network partition, the session will end and the endpoint information stored in the ephemeral node will go away together with the emphemeral node for that host itself. Until the network or the node recover and register again.

We can also create a /jobs directory, persistent of course, and create /jobs/job- sequential nodes in it. They will be numbered automatically, and we will have a global, agreed on order of jobs in the cluster. These jobs now need to be assigned to available workers. For that we need a scheduler that makes these decisions, and among all eligble nodes only one node can become the scheduler. We are going to call that node the master.

A node that wants to be the master can try to create an ephemeral node master and put its own connection informaton into that node.

Because the node is ephemeral it will go away when the masters session goes away.

Because node creation is atomic and can succeed only if there is no node already in place, it will fail when there already is a master.

Because we can set a watch on /master, we will be notified if the master goes away and we will then try to become master instead of the master. That may or may not succeed depending on the order of events in the surviving cluster, but we don't need to care because ZK does. It will elect a ZK master internally, create a shared agreed on state of the ZK Znode tree, notify the surviving sessions of master loss and then await master node creating attempt. It will order these, create an agreed on order of creation attempts, allow the first master creating attempt to succeed and tell all the others of the failure. It will then deliver the data - the connection information for the newly elected master - on request to everybody.

The master can get a list of unassigned jobs in /jobs, and a list of idle workers in /workers, and assign jobs to workers. It will create a directory /schedules and keep scheduled jobs and their assigned workers in sequential ephemeral nodes in there.

The cluster simply works. Even if the networks breaks around it, nodes join or leave or other things happen.

Why

Clusters are dynamic environments. In a cluster of 100 hardware nodes you do not want to register membership to the cluster manually with calls to a Keystone or Nova API. Nodes will register and unregister themselves with the cluster as they go up and down.

In such a cluster, you will need roles such as API service, compute service, scheduler servive and so on. Of many roles you will need a specific number of instances such as 3 API instances, exactly one scheduler instance and one compute per hardware node.

You do not actually care where your scheduler or API services are running - if there are too few of them, nodes will notice that and simply spawn an instance. That may make it too many, so some of them may decide to kill themselves. Because there is a sequence of time in the cluster, operations are ordered and atomic and you will not get flapping services.

Services may migrate, respawn or change locations, but the cluster manages that automatically through discovery instead of you entering keystone or nova commands to keep track of the cluster state after planned or unplanned topology changes. Their location may change, but as long as you have capacity there will be exactly the right number of instances of roles running in the cluster.

The ZK can keep small files directly in the cluster, or it can store pointers to large files in a highly available store for large files (such as S3 https URLs). 

If you think that this, together with automated respawning and discovery, makes a lot of puppet to set up the cluster redundant you following the discussion very tightly, congratulations.

Differences

ZK is the oldest and best tested system of the three. It has all important concepts and works very well. It has a few idiosyncrasies regarding order across session stop/start blocks that need careful coding.

etcd is part of the systemd/etcd/etcd-sa/fleetd/coreos/docker combo. It does not use persistent connections, but http/https and hence has no concept of a session such as ZK. It also does not have watches. Instead we are seeing the concept of a value TTL replacing ephemeral nodes, and the concept of version numbers in polling to replace watches - you tell it you have seen the history of a subtree up to a point and get all the missing bits since then from the etcd.

Consul is much like ZK, but uses Raft internally (like etcd does) instead of Paxos.

All of them are Jepsen safe in their latest instance.

In the end it matters less which one you are using. You should be worried if your cluster doesn't use any of these. I am still looking at you, Openstack.

#zookeeper   #etcd   #consul   #openstack    #cluster   #consensus  
10 comments on original post
2
1
Wilfried Klaebe (Wonka)'s profile photo
Add a comment...
 
 
Mal eine andere Sichtweise
 ·  Translate
Ein Gastbeitrag zum aktuellen Streik der GDL von Harburgs DGB-Vorsitzenden Detlef BaadeDie Lokführer in Deutschland bekommen 1.750 € netto und stehen damit am unteren Ende der Lohntabelle der Lokführer in Europa. Da ist der Streik für mehr Loh...
4 comments on original post
3
1
Caroline Breysach's profile photo
Add a comment...
Have them in circles
1,478 people
Lale “Yuncu” Yılmaz's profile photo
Zeckenlily Blau's profile photo
omar billy's profile photo
Mark Baker (AudioByAlien)'s profile photo
milan markovski's profile photo
Poolim Annpoo's profile photo
Sandra “Mammawutz” D.'s profile photo
willames santos's profile photo
Michael Kurz's profile photo
 
#läuft.

Teuerstes Festival ever.
 ·  Translate
8
Britta Klein's profile photo
 
Schäm dich ;-) #angeber
Add a comment...
 
 
Finally, like took them forever xD
7 comments on original post
1
Add a comment...
Sebastian's Collections
People
Have them in circles
1,478 people
Lale “Yuncu” Yılmaz's profile photo
Zeckenlily Blau's profile photo
omar billy's profile photo
Mark Baker (AudioByAlien)'s profile photo
milan markovski's profile photo
Poolim Annpoo's profile photo
Sandra “Mammawutz” D.'s profile photo
willames santos's profile photo
Michael Kurz's profile photo
Work
Employment
  • T-Online
    Network Specialist, 2002 - 2004
  • HRZ TU-Darmstadt
    Azubi FI-SI, 1998 - 2001
  • Transcom
    IT, 2001 - 2002
Places
Map of the places this user has livedMap of the places this user has livedMap of the places this user has lived
Currently
the middle of nowhere
Story
Tagline
Musing about corporate policies, "agile" and all kind of catcontent. Bring your own tin foil hat.
Introduction
"The trouble with quotes on the internet is that it’s difficult to discern whether or not they are genuine.” 
- Abraham Lincoln
Education
  • Technische Universität Darmstadt
    1998 - 2001