Today I encountered something odd. While working on an assignment for my study I noticed that a program was running very slow (using only 30% CPU). This program makes a training set of a set of images, so a lot of I/O is involved. The weird thing is, when calling the training program in a different way, it was using over 70% CPU (and finished in a shorter amount of time).
Looking at Activity Monitor, a lot of I/O was done, both reading (1-2 MB/s) and writing (0.5 MB/s). I expected reading, but writing? The change in calling the training program was the the filenames it should read. In the fast case, it was reading all filenames, to train on all data. in the slower case it was reading a somewhat smaller subset of these files, to train for K-fold cross-validation.
The thing is, the somewhat smaller subset was generated by picking random filenames, so it was unsorted. This caused the trainer to read files non-sequentially (while the order doesn’t really matter). Apparently, this caused a lot of load in writing the access times of the accessed filenames. Because the directory has 14380 filenames, a non-sequential update of the access times apparently is not very nice on disk I/O.
Sorting the smaller subset on filename caused the trainer to have a CPU usage of over 70% and an appropriate shorter run time as well!