Shared publicly  - 
 
Pipes - observation, comment, and question

I've had a chance to digest the Pipes package, and I have three basic thoughts I'd like to share.

The observation: a driving motivator for iteratees has always been making it easier to reason about resource usage. In support of this, a general property of enumeration-IO packages has been that scarce resources (Handles/sockets/file descriptors) will never be held for longer than necessary, and indeed it's basically been impossible to circumvent this (at least in "iteratee", "enumerator", and Oleg's code). However, by reifying enumerators (producers) to data, the pipes package no longer has this property. A Pipe may hold an open Handle until it's garbage collected, which may be much later than you would expect (or desire). I don't think it will be easy to regain this property without significantly altering some core design choices.

The comment: I believe that the decision to require that all data producers be explicitly drained is a mistake. This is mostly because at present, I don't see an easy way to short-circuit the necessary IO to consume e.g. a file if it's determined at an early stage that further processing is unnecessary. Also, out of all the programming mistakes I've made, and bugs I've introduced (both categories are unfortunately quite large), I don't believe I've ever made an error that would have been prevented by this requirement. However, this would be relatively simple to change and/or work around should the implementers desire to do so.

The question: for anyone who's used the pipes library for any significant work, how does performance compare to iteratee/enumerator/iterIO? An element-based implementation, such as pipes uses, is unquestionably more elegant than the block-based implementations of other libraries, but I've yet to see an element-based implementation that matches the performance of a block-based version. Changing this probably wouldn't be too difficult, but the result would definitely not look as pretty.

In conclusion, I'm very glad to see some new libraries expanding the design space, as most other packages hew fairly closely to Oleg's work. However, I do think it's fair to say that the "pipes" design makes some tradeoffs that may impact some use-cases heavily. I don't think the ultimate iteratee design, if indeed one exists, has yet been found.
8
2
Liam O'Connor's profile photoJohn Lato's profile photoPaolo Capriotti's profile photoMichael Snoyman's profile photo
18 comments
 
I can't speak for pipes, but conduits also avoids chunking, and performance has been fine. But that's not really a fair statement, since the majority of the use cases for conduits (or iteratee/enumerator/pipes) involve data which has already been chunked, either via ByteString or Text.

Chunking was originally in conduit, but I removed it (+Greg Weber's idea) to solve the data loss issue. I consider it a very minor point in the API design, and quite orthogonal to other decisions (chunking could be added back without much difficulty).
 
+Michael Snoyman - thanks for your input. You actually touch on my only criticism of "enumerator": IMHO it gets the stream abstraction wrong by presenting chunks to users as essentially first-class data. Conduit looks more general in that it isn't necessary to work with chunks, however it seems that you do so anyway. Essentially, what I would like to see is that you change to `sourceHandle :: ResourceIO m => Handle -> Source m Word8`, etc. At least that's my vision for iteratee; sometimes the performance lags behind chunk-aware code.

Getting chunking right without losing data is hard; data will need to be discarded at certain points. Properly defining those times is tricky.
 
I definitely understand the desire to have such an API, it just seems that in practice it complicates things too much. Here's a possible approach that I haven't really digested fully: the type of `sourceHandle` isn't really that important, it's the type of all the operations. I suppose the practical ramification is that all of the helper functions in enumerator and conduit need to be written three times: in the List, Binary and Text modules.

Perhaps we can unify those operations via an associated type, like `type family Content a`, with instances like `type instance Content ByteString = Word8` and `type instance Content Text = Word8`. Then we could define `head :: Resource m => Sink a m (Content a)`. We'd have to be careful that we still implement things efficiently.

Is there another advantage to have sourceHandle work directly on Word8 that I haven't realized?
 
I don't understand how just the API presents a complication. The combination of my desired API and high performance does provide for a somewhat messy implementation unfortunately, but that's a price I'm willing to pay.

Unifying the operations via an AT is exactly what iteratee does, through the ListLike package (technically a fundep, but that doesn't matter). There's only one declaration for most functions, and the List, Bytestring, and Text versions all come for free (vector-based versions too, which I use a lot). The only real performance pitfall is with enumeratees; element-wise enumeratees tend to be much less efficient than chunk-wise versions. It can be harder to write chunk-wise enumeratees, because you need to either manually handle leftover data or decide to drop it. But you're used to doing that from enumerator/conduit anyway.

The only other advantage I'm aware of (which shouldn't be underestimated) is having a single implementation point for a very large number of functions over a variety of stream types.
 
I've spent some time learning about pipes and trying to improve the library so that it could be used in serious code, so I'll try to answer your questions as best as I can. +Gabriel Gonzalez will certainly be able to provide further insight.

First of all, pipes don't have any exception or error handling built in. It is intended that you put all error handling in the base monad. So there is really no way for pipes to be exception-safe with regards to resource deallocation.

However, I don't see this as a limitation; rather, it is about separation of concerns. If you use, for example, ResourceT IO as your base monad, you can define perfectly safe pipes that read data from a file and close it as soon as the pipe terminates, or when an exception occurs. You should be able to use monadic regions for the same purpose, as well, although I haven't yet had a chance to try it.

Of course you can write a Pipe that holds an open handle until GC, but I'm pretty sure you can do that with enumerators and conduits as well. We just need to provide good basic pipes to deal with files and handles (http://github.com/pcapriotti/pipes-extra is an initial stab at that), and everything will be doable by combining them through composition and other combinators.

Which brings me to what I think is the most important aspect of pipes: composability. Iteratees and enumerators don't really compose well as far as I know, and conduits are much better but still somewhat clunky, in my opinion. Eliminating the distinction between sources, sinks and transformers makes composition more powerful and easier to reason about. A big gripe of mine with conduit is the fact that they have several types of compositions that all seem to have the same semantics, but might actually have different behaviors, and you probably need to know the internals to understand exactly. Pipes have one associative composition, and all pipes are monads.

About producers needing to be completely drained: that's actually not true. There is a strict composition defined in the pipes package, but I see it more as a curiosity than a useful operation, and I personally wouldn't have exposed it. It is not actually useful in practice and it doesn't guarantee resource finalization. The use case that you mention is implementable in pipes without any problem using lazy composition. Actually, you can convert any sink, source or conduit into a pipe in a way that preserves their termination behavior (see Control.Pipe.Conduit in pipes-extra).

I agree with +Michael Snoyman that built-in chunking should not be necessary. You can work with chunked data if you want, and in that case there are ways to structure pipes to pass unconsumed data along. The idea of having an API that abstracts chunking away is interesting, and I suspect it could be built on top of Pipes. Again, I think it's important to keep various abstractions separate and composable, so that each one can be reasoned about independently.

As for performance, not much time has been put into micro-optimizations, but I did run a couple of benchmarks (stolen from conduit, you can find them in pipes-extra), and pipes are in the same ballpark as the other libraries. Pure pipes (i.e. those that don't have monadic effects) seem to be a lot faster than their conduit counterparts, though I'm not sure why.

I agree that there is still some work to be done to make pipes usable in practice, but I think the basic structure of the library is sound, and as soon as the guarded pipes (http://pcapriotti.wordpress.com/2012/02/02/an-introduction-to-guarded-pipes/) concept is finalized, and initial libraries of combinators and IO pipes are released, they will be ready to be applied to real situations.
 
I think Paolo summed up most of what I wanted to say. In Pipes, exception handling (as in Control.Exception) can be delegated to the base monad with no special considerations. Same thing goes for EitherT/ErrorT (which I personally prefer). However, Paolo and I are working on automatic resource finalization. I think his blog post on guarded pipes is on the right track but we are reworking it to make it more elegant and easier to reason about. Our goal is that automatic resource finalization is layered on top of lazy finalization so that you don't have to drain resources to still get automatic finalization.

I believe that the strongest advantage of Pipes over other iteratee libraries is that they are easier to reason about both in terms of their performance characteristics and their behavior because they rely heavily on category theory abstractions. Everybody in this thread is an expert on Pipes because it's so easy to understand. Notice that in the Pipes documentation I never have to explain how the Pipe data type is implemented or how the monad instance works and you never see Stack Overflow questions asking to explain various vagaries of Pipes behavior. Paolo and I are working to ensure that the final resource finalization implementation is equally easy to reason about.
 
I'm sorry, but that's a lot of hand-waving. Firstly, there's more to exception handling than resource deallocation. What if you want to run a pipe and catch an exception? There doesn't seem to be any way to do so. And due to the nature of pipes (similar to enumerator), you will need to have your code living in the Pipe monad for the majority of your program.

I have first-hand experience with the pain this causes: we had to jump through hoops in Yesod to get exception handling right. I can go into details if you're not familiar with the situation.

As far as "everyone's an expert" and it's so simple... I'm sorry, but the documentation makes me very concerned about pipes. You have the breakdown between strict and lazy, and if you use them the wrong way, things break. It's all well and good that you have a Category instance, but that's an incredibly minor point, and Conduit could have one as well if we were willing to swap around type variables and make things inconsistent. And all we get out of that is the ability to replace (=$=) with (.).

As far as different types: that an advantage of conduits, not a disadvantage. Type errors are direct and clear. It's easy to understand exactly what's going on: a Source just provides data without consuming it. I don't see where the clunkiness is that you're referring to, and mentioning such vague accusations is really not going to advance arguments at all.

But my main gripe about pipes: there's nothing serious to back it up. There are a whole bunch of claims about how composable it is, how it's a better enumerator/iteratee, but you've not actually defined the problems you're solving. I can surmise that you believe it makes things easy to reason about, but there's no real incentive to switch besides just trusting you that it's going to solve problems.

I've made it clear that I don't believe pipes can scale up to the real problems large projects face. I don't think pipes could handle something like an HTTP proxy elegantly. I would recommend that, if you believe otherwise, to actually show serious code demonstrations.
 
+Michael Snoyman FWIW, I get a "more elegant" feeling out of Pipes than conduits, and +Paolo Capriotti's post resonated a lot with my initial impression of conduits. It's just one man's opinion, but conduits feels ugly (enumerator even worse) and pipes seems nice.

I wouldn't be so quick to dismiss criticisms of conduits that come from this direction. Some abstractions are better than others.
 
Actually, I do tend to dismiss criticisms like that without any kind of backing. We hear them all the time in the Haskell world: people think imperatively, there's too much time spent worrying about types, etc. These are baseless criticisms, and they don't advance arguments at all.

My strong belief- and I have yet to be shown otherwise- is that the elegance of pipes (which I'm not denying) is due to the fact that it doesn't solve the real problems we have. There's no question that conduit could be made simpler. I could completely remove BufferedSource, and get rid of an entire extra abstraction. But that would mean the library can't do as much as it can now.

So instead of saying "pipes are more elegant than conduits" without backing, what I'm looking for is "pipes are more elegant, and they can do everything conduits do." At that point, you'll have my attention. Until then, it's a bunch of hot air.

Alternatively, you could prove that pipes do everything necessary and that some of the features of conduit needn't exist, but I think that will be a hard sell. Each feature of conduit came as the result of some actual problem we were trying to solve.
 
+Michael Snoyman I did say why I think conduits are clunky: there are 3 specialized types, each with a different interface, and different composition operators (what's the difference between `c1 =$= c2 =$ s` and `c1 $= c2 $= s`?). Pipes have 1 type, which is a monad, and composes normally. You already made it clear that you think conduit's way is better, so we'll just have to disagree on this one. :)

As for exception handling, you can of course catch exceptions outside the pipe. You can also catch exceptions inside if they originate within that pipe's IO actions. What you can't do is recover from exceptions of the whole pipeline inside a particular pipe. Is that what you were referring to?

You seem to be missing the point of the Category instance. It's not about using `.` for composition. It's about the fact that the composition is associative. That's what improves composability. You can't write combinators if the end result depends on the order in which the pieces are assembled. I agree about the lazy/strict distinction, and I believe that it is a mistake. I'd like to remove strict composition altogether, since it doesn't actually serve any purpose.

I really don't see why pipes wouldn't be able to scale to real problems. As I said, you can convert sources/sinks/conduits to pipes (except for buffered sources). That sort of shows that they are more (or equally) general. You don't need buffered sources with pipes, since you can simply compose the downstream pipe vertically if you want to feed data to different consumers. If the data is chunked, you can return the leftover portion from the first pipe and feed it back to the second.

But you're right that there's not enough code to back up our claims. We're working to fix that.
 
I misunderstood. I didn't realize that the clunkiness claim was directed at the different types. It's true: we'll have to agree to disagree. I don't think having separate operators is confusing. I also don't think that you're correct about different behaviors for different kinds of composition. Or more to the point: if you are correct, it's a universal problem that pipes can't solve.

What I mean is that there is an inherent issue of data loss that cannot be overcome. If you don't believe me, go read the relevant section of the conduit chapter, it spells out data loss in terms of plain lists.

Let me give you a completely valid use case that demonstrates the shortcomings we had with enumerator, and which I assume based on everything I've seen apply to pipes. We have a web application that needs to read a request body, and pipe it through something which my throw an exception. For example, streamed parsing of XML. If there is an exception thrown by this pipeline, we need to catch it and return a 400 error message.

You've also pointed out another downside of pipes: it can't handle buffered sources. This is a major component of conduit. Compare WAI based on enumerator and http-enumerator WAI/conduit and http-conduit, you'll see the complete change in API. pipes would not allow us to have this more elegant solution.

In other words: pipes may be elegant in the small, but will lead us back to ugliness in the large.

To be clear: based on all the comments I've seen so far, I still believe pipes was designed in a vacuum without taking actual problems into consideration. It's very easy to create elegant solutions under such circumstances.
 
I read the section on data loss, but I'm not sure why that implies that you need different composition operators.

Thanks for the examples. I'll take some time to study them and see if they actually bring out weaknesses of pipes.

> I still believe pipes was designed in a vacuum without taking actual problems into consideration

That might be true, but the same could be argued about Haskell. :)
 
Apologies, let me clarify: the reason we need different composition operators is because we have different types, nothing to do with data loss. Any differences in behavior would come down to the issue of data loss, which is a universal issue.

As a side point, removing chunking lessens the impact of data loss, which IMO makes pipes and conduit more resilient than iteratee/enumerator. (enumerator works around the issue at the combinator level. For an example, look at the implementation of concatMapM.) But there are still times when it crops up.
 
I just want to emphasize that this is NOT a competition. Paolo and I strive a lot to emulate the practicality of conduits and I feel we have a lot to learn from it. In fact, we are indebted to conduits because it takes the pressure off of our library to pragmatically deliver results to the Haskell community immediately and we have more freedom to experiment and try to "get it right". After all, Haskell needs a killer framework like Yesod right now to broaden its appeal and I wouldn't want to feel responsible for holding the community back because I was nit-picking on elegance (as I am wont to do).
 
I agree that there's no competition in the sense that we're personally trying to one-up each other, I feel the same way. And to be clear: I like pipes, and am very glad people are looking into alternate approaches to the problem. I'm truly hopeful you can come up with something that will either simplify conduits, or even replace them.

However, at a technical level, I think there is a competition, in the sense of which package will be used. I'm just worried that pipes will not cover all use cases, and we'll end up with a lot of people spending a lot of time trying to solve problems in pipes that those of us who used enumerator already tried to solve.
 
I suppose I have to disagree that pipes are more compositional than other libraries. For example, in iteratee Enumerators are just Kleisli arrows, and can be composed via Kleisli composition. You just apply iteratees to enumerators like you would any other function. I don't see what's non-compositional about it, or even what's not associative. As an example, most of iteratee's fancy combinators are simply a combination of plain old function composition (.) and running an iteratee.

In fact, in some ways pipes are less compositional. Consider Kleisli composition of enumerators. I don't see how this is possible with pipes. It would work if your producer has a non-Zero source type, but then what's type stop someone from accidentally writing `pipeFile "b" <-< pipeConsumer`? In iteratee, and I presume conduits as well, mistakes like this are compile-time errors. With pipes, it appears to me that invalid code like this is either accepted at compile time, or you have to restrict the available compositions.

At its heart, iteratee provides only two things: a monadic parser library and functions for feeding data to those parsers. It's really not any different from uu-parsinglib, attoparsec, or many similar libraries, except for providing more enumeration functions.
 
Those two pipes will not compose in the wrong direction since they should have types "() ~> ByteString" and "ByteString ~> Void". Using the same type for the left and right end of a pipeline is a mistake that has already been fixed in my pipes branch.

About composability, let me try to explain it in a different way. Kleisli composition of enumerators is basically equivalent to vertical composition of pipes, but I have no idea how you can mix monadic and "horizontal" composition freely with iteratees like you can with pipes. Generally speaking, it is a lot harder to wrap your head around how enumerators and enumeratees compose, at least for me. This might very well be due to lack of experience on my part, but I think it also speaks of the relative simplicity and clarity of the two approaches, which by the way are very similar in spirit and only differ in API and some minor implementation choices.

Furthermore, pipes have a sort of Arrow instance (not quite, unfortunately, but pretty close, see my blog post on monoidal instances for more details), which enable very high-level declarative specifications of arbitrary multi-channel compositions.

If you look at my pipes-extra repository, you can see what I mean by composability. No specific pipe definition requires using the constructors. Compare for example Control.Pipe.Combinators with Data.Iterator.ListLike or Data.Enumerator.List. Yeah, I know it's not a completely fair comparison because iterators work with chunked data, but the difference in simplicity and level of abstraction is apparent.

As a final silly example:


fib :: Monad m => Pipe () Int m r
fib = loopP . feed (Right (0, 1)) . forever $ do
(a, b) <- awaitRight
yield $ Left a
yield $ Right (b, a + b)
Note that loopP is the only combinator there which is implemented by using the constructors (and corresponds to loop for Arrows). Of course this could implemented with normal recursion, but I hope it gives an idea of the kinds of abstractions that are possible using pipes and their generalized arrow instance.
 
I have actually eliminated Zero from the next release of the library (scheduled roughly for a week from now) and all pipes are now type safe and you can compose them and run them safely as you described. pipeConsumer will end up with a polymorphic output type and pipeFile will have a polymorphic input type so that would type-check and run even if no information actually flows across their "boundary". runPipe now has the type:

runPipe :: (Monad m) => Pipe (Maybe a) b m r -> m r

... so that it can guarantee providing input since it can't guarantee at compile time that the pipe it runs doesn't call await. You are completely correct that v1.0 is unsafe for this reason and this was absolutely a flaw in the initial library release.

It turns out that the (Maybe a) in the input type of runPipe leads to some very elegant and symmetrical results grounded in category theory and it led to useful bonus functionality I didn't anticipate, and I'm still discussing this with Paolo to see how much of them we want to include in the next official release.

I think you are missing the huge potential of iteratee libraries. It's not just about streaming data for performance. It's about compositionality and modularity that is applicable to ANY programming project. Iteratees make it trivial to write modular code and to mix and match functionality. For example, I can write functions like:

email address = forever $ do
x <- await
lift $ sendEmailTo address (show x)

prompt = forever $ do
x <- lift $ getLine
yield x

And now I can just compose them and I have a program that creates an e-mail for everything I type on the command line:

runPipe $ email "user@example.com"<+< prompt

If I decide I instead want to fire up a GUI program to compose e-mails, I just write a new producer pipe:

composeMessage = do
x <- lift $ getMessageFromGUI Program
yield x

And if I want to send 10 emails using my GUI program I just write:

runPipe $ email "user@example.com"<+< replicateM_ 10 composeMessage

But maybe I prefer to first have a friend approve my e-mails first to make sure that I'm not sending out drunk e-mails to ex-girlfriends. I can just write:

verify = forever $ do
x <- await
lift $ emailToFriend x
y <- lift $ waitForFriendResponse
when y $ yield x

And now i just compose that in the middle:

runPipe $ email "user@example.com"<+< verify <+< replicateM_ 10 composeMessage

You might even note that I could try using the "email" pipe I defined before to send the e-mail off to my friend instead of defining a hard-coded function like "emailToFriend". Right now we are also integrating arrow functionality into pipes based off of Paolo's blog post that will let you build up complex pipe flow controls and let you fanout or create recursive pipes.

This is the future of general-purpose programming and it is why Haskell shines, making it incredibly easy to mix different modules with incredibly little integration code required. This is why I consider iteratees a "killer feature" of Haskell right up there with STM and why I strive to develop a really elegant solution to try to convert other people to Haskell.
Add a comment...