We built our own custom software system for training large networks on large clusters of machines.
We're hoping to have a paper published in the near future that gives more details on the training system we've built, but the two main principles we use are
(a) partitioning a single model across multiple machines (model-level parallelism),
(b) data parallelism for training, by stamping out multiple copies of these multi-machine models, all sharing a set of parameters over the network through a centralized parameter server service, which serves fresh parameter copies, and applies gradient updates sent to it by the model replicas.
A few diagrams of this are shown starting about halfway through the following slide deck (start with the slide "Scaling Deep Learning"): http://cra.org/uploads/documents/resources/snowbird2012_slides/dean.pdf
Given enough training data and computational cycles, it's definitely practical. I think these sort of techniques would be very good in the genomics domain, because of their ability to automatically identify complicated, high-level features/interactions from the raw data.