You should move out your SSE code from your source into a module. And the alignment issues tthat some of the older processors have, were long solved already.
Still it totally makes sense to use the aligned commands for older processors, which will have performance issues if facing unaligned data. On my AMD "Piledriver" FX-6300 this isn't the case anymore. I guess your system is even newer, isn't it?
So, might be that you can use AVX instead of SSE to get over that problem. But anyway, if you like to see how I dealt with aligning problems here is a link:
That's the one where I killed the Gnu Scientific Library by factor 30. Thirty, yes, I didn't accidently type a zero at the end. Should not be possible, but it was. Even without using Assembly.
If you are using the YMM registers from the AVX chipset you go from 128 bit to 256 bit in parallel. That should at least give you a boost in performance of 50% if not more.
There are several datatypes that help with aligning, but the best way I found was the alignment attribute: