As a part of working on large data projects, I have settled into a kind of workflow and I started wondering where I can improve it.
Step 0: ETL
I get the data and convert it to a form I can read it in. Since I have 1.5 Tb RAM machines at my disposal (re: MPI-SWS), this translates to SQLite for most datasets of ~ 500Gb size. My favourite tools for the job are csv2sqlite and some other related scripts.
Most of the scripts work on compressed versions (i.e.
bz2) of the
raw data files.
Step 1: Exploration
Then I explore the data a bit with matplotlib.
Step 2: Modeling
The task of modelling slowly takes form with preprocessing, sub-setting, model creation, model testing, and finally with running the models.
Depending on the kind of model, there may be other steps involved here. For example, figuring out the cross-validation strategy or creation of synthetic-data.
Step 2.1: Subsetting, Preprocessing and Model Creation
These go hand in hand because the model dictates what the intermediate results look like. The sub-setting of data is done to make the testing of models feasible, such that everything can be done in-memory
At some point, it becomes apparent at various points that some pre-processing can save a bunch of time.
Here, I like to save data in
mat form and usually as sparse
arrays because the
format is interoperable between MATLAB and Python.
However, if I move to
pickle or even
np.savetxt, I can reap the benefits of
the library. This is something I can try to improve.
Step 2.2: Testing the model
At this point, the methodology changes a bit.
I move all the code the model needs to run to one single Jupyter cell, save it to a file and then make it configurable using argparse.
Then I invoke the file from Jupyter using the
%run -i magic
This gives me the best part of two worlds. Placing code in a file is good for
version control and for sharing the script with others collaborators. However,
the script still has all the variables in the global scope and the
lets me import those the current Jupyter session, inspect them and figure out
what the internal state of the program was at the point of exit (either though
an exception or normally).
Hence, using a persistence framework like
sacred is out of the question since I
explicitly want all the local variables available in Jupyter for debugging
sacred requires the code inside a function.
Here, to persist data, I use seqfile which lets me save files from different processes and even different clusters on an NFS without worrying about losing data.
Moreover, since I run the script usually with
%run -i, I usually don’t have
to read the data from the disk again and can carry straight to the Results phase.
Nevertheless, if I have to to look at a previous experiment’s results, I only
have to load the last file saved by
seqfile and look into it.
One thing I usually do run into is that reading local variables from a file clobbers some variables that I have already declared in the Jupyter notebook and if I restart the kernel and run only some blocks, I usually lose track of which declared. A Notebook extension which lists all variables in the current global environment (like the Workspace pane in MATLAB) would help me here.
There is also the inverse problem of having undefined variables in the script which are conveniently supplied by the Jupyter notebook when it is run from inside a Notebook. Using Syntastic helps with weeding out this class of problems.
Step 3: Results
Finally, to analyse and process the results, I again rely on the Jupyter
notebook, albeit a new one. The new notebook only runs the scripts as
%run -i script.py --args in individual cells and receives the results loaded
in a local variable. Plotting is again done using
and the results are saved as