I'm dealing with some large pre-processing and looking for advice on how to setup my pipeline.
I'm training a siamese network on the Labelled Faces in the Wild (LFW) dataset in order to do face verification. LFW has about 13,000 different faces. A siamese network ends up training a metric function that places each face into a discrete x/y space, with similar faces clustering together. At testing time I can take a given face, see where it lies in the face, and see if it clusters near another face to do positive/negative identification.
Through a subset of the data, I've learned that the best performance happens when I pair every face with every other face to get all possible positive and negative face combinations to train on. However, if I were to do this with the full 13,000 faces, it would be a huge processing task.
I'm currently pre-processing my data with Python, pairing faces, shuffling them, then writing them to a LevelDB database for use with Caffe. If I try to pair every possible combo of faces with the full 13K faces I run out of memory. I need to either get much cleverer with how I'm processing these with Python, or move to some other solution to do every possible combo. What data processing pipeline do folks recommend for dealing with a large dataset like this to do the pairing and shuffling without exhausting available memory?
Eventually I'll move from using the LFW dataset to the CASIA-WebFace dataset, which has about 500K images, so my data processing pipeline needs will get even more intense, so I'll need something that can scale up to that level.