Teleport tool: Backwards-compatible resilience to network outages
13 plus ones
Shared publicly•View activity
View 7 previous comments
- All of these will disappear if you use a newer version of the Go compiler. I thought the latest official release did not require return statements, but maybe I misremember. The tip of golang definitely has no problems with this. I will try to make sure this works with the latest release of Go. Sorry for inconvenience.Aug 30, 2013
- That's an excellent question. When application software uses a timeout monitor on an I/O operation, the use of the timeout sometimes conflates two very different purposes:
(a) One purpose of a timeout is to detect whether something has gone wrong with the network or the remote server, and
(b) Another purpose of a timeout is to move on with app execution, because perhaps the application, for its own reasons, is not interested in responses that come too late.
In an ideal world, it is not the application's business to try to figure out whether the remote or the network is dead, especially using adhoc devices like timeouts.This is the job of the entity that executes the application, because that entity has global view on the system and is better equipped to know when a network error should be announced and when it shouldn't —perhaps because the network will be coming back up soon, after the engineers are done replacing that old router. So, in this example, when an app experiences a 1min timeout, it does not know whether this is due to a dead remote process or due to a 1 hour temporary disruption of the network due to manual maintenance, for instance.
Since the app cannot distinguish which of the two is the source of the timeout, it cannot always take intelligent action. Perhaps if it knew the network is down temporarily, it would just wait until it comes back up. If, on the other hand, it knew that the remote process had died, then it would chose to die itself.
In good style, an app should use a timeout only for its own purposes. For example, it is OK to use a timeout because the user of the app does not want to wait past a deadline. This is an app-specific timeout. It is not OK, for the app to use timeouts to guess error conditions of the hosting system that it does not know or understand.
In other words: In an ideal world, handling of physical (networking) errors and application errors should be decoupled. And handling physical errors should not be in the hands of the app.
Unfortunately, for historical reasons, many apps today conflate the two uses of timeouts and take, by definition, the same action regardless of the source reason for the timeout.
The Teleport Tool fixes this problem somewhat. It makes the network connections never break (as far as the app can see). If a legacy app does hit a timeout (that it installed itself), while working under the Teleport Tool, it will still take the action it has hardcoded. This action might be an overkill, because legacy apps often treat a timeout as a major network failure, but there's little we can do about this. Hitting a timeout error, however, does not close the connection and so a well implemented app might continue using the healthy teleport connection.
Even if a legacy application does the usual thing: retry the connection, this will be OK, since the teleport tool multiplexes over its pool of TCP connections, so no actual TCP handshake is invoked.
But, perhaps, the shortest answer last, of course) to your question:
In POSIX socket semantics, a timeout error is a user-elected thing. It does not imply connection failure. And most software is aware of that and usually continues to use the connection.Aug 30, 2013
- The way in which the Teleport Tool decouples application errors from system errors, is the following:
WIth the Teleport Tool, only the system administrator can cause an end to a connection by explicitly killing the Teleport Tool process. So, network decisions are in the hands of the system admin.
Timeouts can still occur within the application. These are application errors and the app decides how to handle those.
This gives you a clean separation of error-handling responsibilities.
Most legacy apps actually play quite well with the teleport tool, because they usually do a reconnect. This is fine and it costs nothing with the Teleport Tool, because it does not actually trigger an over-the-network TCP handshake.Aug 30, 2013
- In your example, the security timeout is an application-specific and application-elected timeout. It is fine if the application wants to close the connection at the timeout. Teleport will respect that and disconnect the remote connection between the server-side teleport tool and the user server (presumably some internal service).
However, the TCP connection between the client (HTTP-side) and server (service-side) teleport tools will not be disconnected. This way there is a "warm" TCP connection to use when future connect attempts come from the HTTP server. Notice that this way the multiple connection attempts from the HTTP server are all transported over the same warm TCP connection, and therefore they don't suffer the over-the-network TCP handshake latency.
This is actually a very useful application for PHP web apps which cannot keep state, so every HTTP request creates new network connections to internal services.
On the other hand, I disagree that "same goes for database connections". It depends on the application. For example, in stream processing you might have a Kafka queue streaming records into a database (perhaps through an intermediary service). The goal is to send the whole stream to the database eventually. Per-record transmit times are irrelevant.
In this case, the streaming Kafka will usually be configured to be patient (no timeouts) as there is no use for timing out: the goal is to process all data, not to process all data by a deadline. This will generally be the case with any data-analysis pipeline (as opposed to a pipeline on the critical line to user experience).Sep 6, 2013
- I am not sure what you mean by "limits the usage of your tool severly". You can use the tool in absolutely EVERY setting.
In some cases the tool will have bigger benefit to your app (it will help with network glitches and with efficiency due to connection caching). In other cases, it will help less (just with efficiency). And in some rare cases, where your apps already have connection caching, it won't help but it also won't hurt.
Note that by its design, the tool can be put between any client/server without affecting the correctness of their interaction.Sep 9, 2013
- Regarding your second comment. If timeouts are ever used in an application to get to a clean state, it means that the application has bugs and the tool will help you spot them. This is a good thing.
That said, I've never seen an application where a timeout is used in order to get to clean state. Perhaps you could give me an example.Sep 9, 2013
Add a comment...