What Daniel Povey means is that CUBLAS is not a replacement for BLAS. It is conceptually analogous to BLAS, but it is not interface-compatible with BLAS. BLAS acts on matrices in host memory, CUBLAS acts on matrices in device memory. You can't just replace BLAS with CUBLAS and expect old programs to run, because they call BLAS on host memory.
You could in principle write your own new BLAS that wraps CUBLAS by transferring the arguments to BLAS calls to GPU, then calling CUBLAS and transferring the result back to host memory. This would perform terribly, because it would transfer data to/from the GPU every time you do a linear algebra operation. GPU programs are only efficient if they transfer data once, do many operations on GPU, and then transfer a result back. If you have to do a transfer every time you do an operation, it will probably be slower than CPU. The high level program really must be designed to be aware of transfers and minimize them.
Don't use Octave. I personally would recommend using Python and Theano. As far as I know there is no GPU acceleration available for Octave. Octave is essentially a FOSS but stripped-down version of MATLAB. For MATLAB, it's possible to get GPU acceleration using Jacket, but you have to pay for Jacket.