Version control is great. However, it is something I have to actively think about. If I am writing code, I can find points when I have finished a task and am ready to commit it in. But I have a different frame of mind while doing exploratory model fitting.
Saving the models and results
When I am playing with an idea, I’d rather spend those brain cycles on improving the model or tweaking parameters. And perhaps it is just me, but training models to fit data seems like a separate realm of thought, where version control just doesn’t seem to fit. Checking in the models feels akin to adding binaries to the source control: a bit icky.
Also, these each of these models can take a long time to train. So I do want to at least persist them to disk; I have accidentally killed my share of Python processes and
screens with the model results still stores in variables.
Pickling takes care of the serialization. The only missing bit is coming up with filenames.
Hardest problem in CS
Naming things is one of the two (or three?) hardest problems in Computer Science.
These are the approaches that sprang to my mind:
Use a static file name
This means that only the most recent model is saved while all history is lost. This can also get tricky if you train different models in parallel (one model will win in writing the file).
Use model parameters in the filename
This feels intuitive until you add/remove parameters and noise in the filenames can come back to bite you.
Use a GUID!
They are incomprehensible and noisy. Also, it is impossible to tell from the filename which files are the most recent ones.
pid can be reused and it stays the same during a single interactive session. Hence, the files get overwritten anyway. Moreover, there is no sense of sequence in the names here either.
Finally, I wrote seqfile to solve the problem for me. It is a python library which can generate sequential filenames from a template. It takes care to ensure that it succeeded in creating a file before returning it to the process to ensure no two processes accidentally end up writing to the same file. If used with
suffix options, it is smart about jumping over gaps in the sequence. However, if the filenames are more nuanced, then a function can be supplied to generate the filenames as well.
The API was inspired by the tempfile library.
At the end of my script, I use the following snippet:
Thereafter, I keep changing the parameters code and re-running the analysis on ipython:
In : run -i my_models.py
This keeps saving my models as I tweak the parameters in the script, even if I run them in parallel across different screens.
O_CREAT | O_EXCL trick used in this library to create files may fail on old Linux kernels while writing to NFS.