Splicer is intended to be run on a Pydra cluster. These are instructions and notes on running it and dealing with Pydra’s immaturity.
Components of splicer can also be run manually from the command line
Slow FTP Issues¶
The FTP server for PDB files is a very slow, rate limited, server located in the UK. PDB files are currently 1.8 gigabytes total for 16,000 proteins in PGD. It can take a long time to download this much data from the FTP server. This is handled in two ways:
Maintaining Connections Between Workunits¶
Each workunit is composed of downloading and processing a PDB file. Rather than disconnecting from the FTP server, connections are maintained until the last work unit is completed. This removes the overhead for connecting and disconnecting from the server
Only Downloading New Files¶
Checking dates is very fast, the MODTIME command completes almost instantly. This prevents uneeded downloading
Workunit Thrashing Problem¶
There is an outstanding bug in pydra that causes the node to crash when workunits complete too quickly. Splicer includes an option to batch process proteins to ensure that this does not happen. Eventually batching workunits will an automatic feature of Pydra
When running repeat runs of Pydra it is important to increase the workunit size to at least 500-1000. Because the date checks are very fast it will cycle through the existing proteins very quickly.
Pydra logs most things that happen within it. A full task history can be viewed by clicking the history icon found on the pydra tasks page. Clicking on a task instance gives you more details about the task including which workunits were successful and what their arguments were.
Workunits are logged individually and located in /var/logs/pydra/archive. The logs are aggregated from the Nodes after it is done with the entire task