Command Usage¶

Pull Into Place (PIP) is a protocol to design protein functional groups with
sub-angstrom accuracy.  The protocol is based on two ideas: 1) using restraints
to define the geometry you're trying to design and 2) using an unrestrained
simulations to test designs.

Usage:
    pull_into_place <command> [<args>...]
    pull_into_place --version
    pull_into_place --help

Arguments:
    <command>
        The name of the command you want to run.  You only need to specify
        enough of the name to be unique.  Broadly speaking, there are two
        categories of scripts.  The first are part of the main design pipeline.
        These are prefixed with numbers so that you know the order to run them
        in.  The second are helper scripts and are not prefixed.

        01_setup_workspace                     cache_models
        02_setup_model_fragments               count_models
        03_build_models                        fetch_and_cache_models
        04_pick_models_to_design               fetch_data
        05_design_models                       make_web_logo
        06_manually_pick_designs_to_validate   plot_funnels
        06_pick_designs_to_validate            push_data
        07_setup_design_fragments
        08_validate_designs
        09_compare_best_designs

    <args>...
        The necessary arguments depend on the command being run.  For more
        information, pass the '--help' flag to the command you want to run.

Options:
    -v, --version
        Display the version of PIP that's installed.

    -h, --help
        Display this help message.

PIP's design pipeline has the following steps:

1. Define your project.  This entails creating an input PDB file and preparing
   it for use with rosetta, creating a restraints file that specifies your
   desired geometry, creating a resfile that specifies which residues are
   allowed to design, and creating a loop file that specifies where backbone
   flexibility will be considered.

   $ pull_into_place 01_setup_workspace ...

2. Build a large number of models that plausibly support your desired geometry
   by running flexible backbone Monte Carlo simulations restrained to stay near
   said geometry.  The goal is to find a balance between finding models that
   are realistic and that satisfy your restraints.

   $ pull_into_place 02_setup_model_fragments ...
   $ pull_into_place 03_build_models ...

3. Filter out models that don't meet your quality criteria.

   $ pull_into_place 04_pick_models_to_design ...

4. Generate a number of designs for each model remaining.

   $ pull_into_place 05_design_models ...

5. Pick a small number of designs to validate.  Typically I generate 100,000
   designs and can only validate 50-100.  I've found that randomly picking
   designs according to the Boltzmann weight of their rosetta score gives a
   nice mix of designs that are good but not too homogeneous.

   $ pull_into_place 06_pick_designs_to_validate ...

6. Validate the designs using unrestrained Monte Carlo simulations.  Designs
   that are "successful" will have a funnel on the left side of their score vs
   rmsd plots.

   $ pull_into_place 07_setup_design_fragments ...
   $ pull_into_place 08_validate_designs ...

7. Optionally take the decoys with the best geometry from the validation run
   (even if they didn't score well) and feed them back into step 3.  Second and
   third rounds of simulation usually produce much better results than the
   first, because the models being designed are more realistic.  Additional
   rounds of simulation give diminishing returns, and may be more effected by
   some of rosetta's pathologies (i.e. it's preference for aromatic residues).

   $ pull_into_place 04_pick_models_to_design ...
   $ pull_into_place 05_design_models ...
   $ pull_into_place 06_pick_designs_to_validate ...
   $ pull_into_place 07_setup_design_fragments ...
   $ pull_into_place 08_validate_designs ...

8. Generate a report summarizing a variety of quality metrics for each design.
   This report is meant to help you pick designs to test experimentally.

   $ pull_into_place 09_compare_best_designs ...

Step 1: Setup workspace¶

Query the user for all the input data needed for a design.  This includes a
starting PDB file, the backbone regions that will be remodeled, the residues
that will be allowed to design, and more.  A brief description of each field is
given below.  This information is used to build a workspace for this design
that will be used by the rest of the scripts in this pipeline.

Usage:
    pull_into_place 01_setup_workspace <workspace> [--remote] [--overwrite]

Options:
    --remote, -r
        Setup a link to a design directory on a remote machine, to help with
        transferring data between a workstation and a cluster.  Note: the
        remote and local design directories must have the same name.

    --overwrite, -o
        If a design with the given name already exists, remove it and replace
        it with the new design created by this script.

Step 2: Setup model fragments¶

Generate fragments for the initial model building simulations.  Note that it's
a little bit weird to use fragments even though the models are allowed to
design in these simulations.  Conformations that are common for the current
sequence but rare for the original one might not get sampled.  However, we
believe that the improved sampling that fragments offer outweighs this
potential drawback.

Usage:
    pull_into_place 02_setup_model_fragments <workspace> [options]

Options:
    -L, --ignore-loop-file
        Generate fragments for the entire input structure, not just for the
        region that will be remodeled as specified in the loop file.  This is
        currently necessary only if multiple loops are being remodeled.

    -m, --mem-free=MEM  [default: 2]
        The amount of memory (GB) to request from the cluster.  Bigger systems
        may need more memory, but making large memory requests can make jobs
        take much longer to come off the queue (since there may only be a few
        nodes with enough memory to meet the request).

    -d, --dry-run
        Print out the command-line that would be used to generate fragments,
        but don't actually run it.

Step 3: Build models¶

Build models satisfying the design goal.  Only the regions of backbone
specified by the loop file are allowed to move and only the residues specified
in the resfile are allowed to design.  The design goal is embodied by the
restraints specified in the restraints file.

Usage:
    pull_into_place 03_build_models <workspace> [options]

Options:
    --nstruct NUM, -n NUM   [default: 10000]
        The number of jobs to run.  The more backbones are generated here, the
        better the rest of the pipeline will work.  With too few backbones, you
        can run into a lot of issues with degenerate designs.

    --max-runtime TIME      [default: 12:00:00]
        The runtime limit for each model building job.

    --max-memory MEM        [default: 1G]
        The memory limit for each model building job.

    --test-run
        Run on the short queue with a limited number of iterations.  This
        option automatically clears old results.

    --clear
        Clear existing results before submitting new jobs.

Step 4: Pick models to design¶

Pick backbone models from the restrained loopmodel simulations to carry on
though the rest of the design pipeline.  The next step in the pipeline is to
search for the sequences that best stabilize these models.  Models can be
picked based on number of criteria, including how well the model satisfies the
given restraints and how many buried unsatisfied H-bonds are present in the
model.  All of the criteria that can be used are described in the "Queries"
section below.

Usage:
    pull_into_place 04_pick_models_to_design [options]
        <workspace> <round> <queries>...

Options:
    --clear, -x
        Remove any previously selected "best" models.

    --recalc, -f
        Recalculate all the metrics that will be used to choose designs.

    --dry-run, -d
        Choose which models to pick, but don't actually make any symlinks.

Queries:
    The queries provided after the workspace name and round number are used to
    decide which models to carry forward and which to discard.  Any number of
    queries may be specified; only models that satisfy each query will be
    picked.  The query strings use the same syntax of the query() method of
    pandas DataFrame objects, which is pretty similar to python syntax.
    Loosely speaking, each query must consist of a criterion name, a comparison
    operator, and a comparison value.  Only 5 criterion names are recognized:

    "restraint_dist"
        The average distance between all the restrained atoms and their target
        positions in a model.
    "loop_dist"
        The backbone RMSD of a model relative to the input structure.
    "buried_unsat_score"
        The change in the number of buried unsatisfied H-bonds in a model
        relative to the input structure.
    "dunbrack_score"
        The average Dunbrack score of any sidechains in a model that were
        restrained during the loopmodel simulation.
    "total_score"
        The total score of a model.

    Some example query strings:

    'restraint_dist < 0.6'
    'buried_unsat_score <= 4'

Step 5: Design models¶

Find sequences that stabilize the backbone models built previously.  The same
resfile that was used for the model building step is used again for this step.
Note that the model build step already includes some design.  The purpose of
this step is to expand the number of designs for each backbone model.

Usage:
    pull_into_place 05_design_models <workspace> <round> [options]

Options:
    --nstruct NUM, -n NUM   [default: 100]
        The number of design jobs to run.

    --max-runtime TIME      [default: 0:30:00]
        The runtime limit for each design job.  The default value is
        set pretty low so that the short queue is available by default.  This
        should work fine more often than not, but you also shouldn't be
        surprised if you need to increase this.

    --max-memory MEM        [default: 1G]
        The memory limit for each design job.

    --test-run
        Run on the short queue with a limited number of iterations.  This
        option automatically clears old results.

    --clear
        Clear existing results before submitting new jobs.

Step 6: Pick designs to validate¶

Pick a set of designs to validate.  This is actually a rather challenging task
because so few designs can be validated.  Typically the decision is made based
on sequence identity and rosetta score.  It might be nice to add a clustering
component as well.

Usage:
    pull_into_place 06_pick_designs_to_validate
            <workspace> <round> [<queries>...] [options]

Options:
    --num NUM, -n NUM           [default: 50]
        The number of designs to pick.  The code can gets stuck and run for a
        long time if this is too close to the number of design to pick from.

    --temp TEMP, -t TEMP        [default: 2.0]
        The parameter controlling how often low scoring designs are picked.

    --clear, -x
        Forget about any designs that were previously picked for validation.

    --recalc, -f
        Recalculate all the metrics that will be used to choose designs.

    --dry-run
        Don't actually fill in the input directory of the validation workspace.
        Instead just report how many designs would be picked.

Step 6’: Manually pick designs to validate¶

Manually provide designs to validate.

The command accepts any number of pdb files, which should already contain the
mutations you want to test.  These files are simply copied into the workspace
in question.  The files are copied (not linked) so they're less fragile and
easier to copy across the network.

Usage:
    pull_into_place 06_manually_pick_designs_to_validate [options]
        <workspace> <round> <pdbs>...

Options:
    --clear, -x
        Forget about any designs that were previously picked for validation.

Step 7: Setup design fragments¶

Generate fragments for the design validation simulations.  Each design has a
different sequence, so each input needs its own fragment library.  You can skip
this step if you don't plan to use fragments in your validation simulations,
but other algorithms may not perform as well on long loops.

Usage:
    pull_into_place 07_setup_design_fragments <workspace> <round> [options]

Options:
    -m, --mem-free=MEM  [default: 2]
        The amount of memory (GB) to request from the cluster.  Bigger systems
        may need more memory, but making large memory requests can make jobs
        take much longer to come off the queue (since there may only be a few
        nodes with enough memory to meet the request).

    -d, --dry-run
        Print out the command-line that would be used to generate fragments,
        but don't actually run it.

Step 8: Validate designs¶

Validate the designs by running unrestrained flexible backbone simulations.
Only regions of the backbone specified by the loop file are allowed to move.
The resfile used in the previous steps of the pipeline is not respected here;
all residues within 10A of the loop are allowed to pack.

Usage:
    pull_into_place 08_validate_designs <workspace> <round> [options]

Options:
    --nstruct NUM, -n NUM   [default: 500]
        The number of simulations to run per design.

    --max-runtime TIME      [default: 24:00:00]
        The runtime limit for each validation job.

    --max-memory MEM        [default: 1G]
        The memory limit for each validation job.

    --test-run
        Run on the short queue with a limited number of iterations.  This
        option automatically clears old results.

    --clear
        Clear existing results before submitting new jobs.

Step 9: Compare best designs¶

Create a nicely organized excel spreadsheet comparing all of the validated
designs in the given workspace where the lowest scoring decoy within some
threshold of the target structure.

Usage:
    pull_into_place 09_compare_best_designs <workspace> [<round>] [options]

Options:
    -t, --threshold RESTRAINT_DIST   [default: 1.2]
        Only consider designs where the lowest scoring decoy has a restraint
        satisfaction distance less than the given threshold.

    -u, --structure-threshold LOOP_RMSD
        Limit how different two loops can be before they are placed in
        different clusters by the structural clustering algorithm.

    -q, --num-sequence-clusters NUM_CLUSTERS   [default: 0]
        Specify how many sequence clusters should be created.  If 0, the
        algorithms will try to detect the number of clusters that best matches
        the data on its own.

    -s, --subs-matrix NAME   [default: blosum80]
        Specify a substitution matrix to use for the sequence clustering
        metric.  Any name that is understood by biopython may be used.  This
        includes a lot of the BLOSUM and PAM matrices.

    -p, --prefix PREFIX
        Specify a prefix to append to all the files generated by this script.
        This is useful for discriminating files generated by different runs.

    -v, --verbose
        Output sanity checks and debugging information for each calculation.

Cache models¶

Cache various distance and score metrics for each model in the given directory.
After being cached, a handful of these metrics are printed to the terminal to
show that things are working as expected.

Usage:
    pull_into_place cache_models <directory> [options]

Options:
    -r PATH, --restraints PATH
        Specify a restraints file that can be used to calculate the "restraint
        distance" metric.  If the directory specified above was created by the
        01_setup_pipeline script, this flag is optional and will default to the
        restraints used in that pipeline.

    -f, --recalc
        Force the cache to be regenerated.

Count models¶

Count the number of models meeting the given query.

Usage:
    pull_into_place count_models <directories>... [options]

Options:
    --query QUERY, -q QUERY
        Specify which models to include in the count.

    --recalc, -f
        Recalculate all the metrics that will be used to choose designs.

    --restraints PATH
        The path to a set of restraints that can be used to recalculate the
        restraint_distance metric.  This is only necessary if the cache is
        being regenerated in a directory that is not a workspace.

Queries:
    The query string uses the same syntax as the query() method of pandas
    DataFrame objects, which is pretty similar to python syntax.  Loosely
    speaking, each query must consist of a criterion name, a comparison
    operator, and a comparison value.  Only 5 criterion names are recognized:

    "restraint_dist"
        The average distance between all the restrained atoms and their target
        positions in a model.
    "loop_dist"
        The backbone RMSD of a model relative to the input structure.
    "buried_unsat_score"
        The change in the number of buried unsatisfied H-bonds in a model
        relative to the input structure.
    "dunbrack_score"
        The average Dunbrack score of any sidechains in a model that were
        restrained during the loopmodel simulation.
    "total_score"
        The total score of a model.

    Some example query strings:

    'restraint_dist < 0.6'
    'buried_unsat_score <= 4'

Fetch and cache models¶

Download models from a remote host then cache a number of distance and score
metrics for each one.  This script is meant to be called periodically during
long running jobs, to reduce the amount of time you have to spend waiting to
build the cache at the end.

Usage:
    pull_into_place fetch_and_cache_models <directory> [options]

Options:
    --remote URL, -r URL
        Specify the URL to fetch data from.  You can put this value in a file
        called "rsync_url" in the local workspace if you don't want to specify
        it on the command-line every time.

    --include-logs, -i
        Fetch log files (i.e. stdout and stderr) in addition to everything
        else.  Note that these files are often quite large, so this may take
        significantly longer.

    --keep-going, -k
        Keep attempting to fetch and cache new models until you press Ctrl-C.
        You can run this command with this flag at the start of a long job, and
        it will incrementally cache new models as they are produced.

    --wait-time MINUTES, -w MINUTES     [default: 5]
        The amount of time to wait in between attempts to fetch and cache new
        models, if the --keep-going flag was given.

Fetch data¶

Copy design files from a remote source.  A common application is to copy
simulation results from the cluster to a workstation for analysis.  The given
directory must be contained within a workspace created by 01_setup_workspace.

Usage:
    pull_into_place fetch_data <directory> [options]

Options:
    --remote URL, -r URL
        Specify the URL to fetch data from.  You can put this value in a file
        called "rsync_url" in the local workspace if you don't want to specify
        it on the command-line every time.

    --include-logs, -i
        Fetch log files (i.e. stdout and stderr) in addition to everything
        else.  Note that these files are often quite large, so this may take
        significantly longer.

    --dry-run, -d
        Output the rsync command that would be used to fetch data.

Make web logo¶

Create a web logos for sequences generated by the design pipeline.

Usage:
    pull_into_place make_web_logo <workspace> <round> <pdf_output>

It would be nice to pass all unparsed options through to weblogo.  I'll have to
think a bit about how to do that.

Plot funnels¶

Visualize the results from the loop modeling simulations in PIP and identify 
promising designs.

Usage:
    pull_into_place plot_funnels <pdb_directories>... [options]

Options:
    -F, --no-fork
        Do not fork into a background process.

    -f, --force
        Force the cache to be regenerated.

    -q, --quiet
        Build the cache, but don't launch the GUI.

This command launches a GUI designed to visualize the results for the loop 
modeling simulations in PIP and to help you identify promising designs.  To 
this end, the following features are supported:

1. Extract quality metrics from forward-folded models and plot them against 
   each other in any combination.

2. Easily visualize specific models by right-clicking on plotted points.  
   Add your own visualizations by writing `*.sho' scripts.

3. Plot multiple designs at once, for comparison purposes.

4. Keep notes on each design, and search your notes to find the designs you 
   want to visualize.

Generally, the only arguments you need are the names of one or more directories 
containing the PDB files you want to look at.  For example:

    $ ls -R
    .:
    design_1  design_2 ...

    ./design_1:
    model_1.pdb  model_2.pdb ...

    ./design_2:
    model_1.pdb  model_2.pdb ...

    $ pull_into_place plot_funnels design_*

This last command will launch the GUI.  If you specified more than one design 
on the command line, the GUI will have a panel on the left listing all the 
designs being compared.  You can control what is plotted by selecting one or 
more designs from this list.  The search bar at the top of this panel can be 
used to filter the list for designs that have the search term in their 
descriptions.  The buttons at the bottom can be used to save information about 
whatever designs are selected.  The "Save selected paths" button will save a 
text file listing the path to the lowest scoring model for each selected 
design.  The "Save selected funnels" button will save a PDF with the plot for 
each selected design on a separate page.

The upper right area of the GUI will contain a plot with different metrics on 
the two axes where each point represents a single model.  You can right-click 
on any point to take an action on the model represented by that point.  Usually 
this means visualizing the model in an external program, like pymol or chimera.  
You can also run custom code by writing a script with the extension *.sho that 
takes the path of a model as its only argument.  ``plot_funnels`` will search 
for scripts with this extension in every directory starting with the directory 
containing the model in question and going down all the way to the root of the 
file system.  Any scripts that are found are added to the menu you get by 
right-clicking on a point, using simple rules (the first letter is capitalized 
and underscores are converted to spaces) to convert the file name into a menu 
item name.

The tool bar below the plot can be used to pan around, zoom in or out, save an 
image of the plot, or change the axes.  If the mouse is over the plot, its 
coordinates will be shown just to the right of these controls.  Below the plot 
is a text form which can be used to enter a description of the design.  These 
descriptions can be searched.  I like using the '+', '++', ... convention to 
rank designs so I can easily search for increasingly good designs.

Hotkeys:
    j,f,down: Select the next design, if there is one.
    k,d,up: Select the previous design, if there is one.
    i,a: Focus on the description form.
    z: Use the mouse to zoom on a rectangle.
    x: Use the mouse to pan (left-click) or zoom (right-click).
    c: Return to the original plot view.
    slash: Focus on the search bar.
    tab: Change the y-axis metric.
    space: Change the x-axis metric.
    escape: Unfocus the search and description forms.

Push data¶

Copy design files to a remote destination.  A common application is to copy
input files onto the cluster before starting big jobs.

Usage:
    pull_into_place push_data <directory> [options]

Options:
    --remote URL, -r URL
        Specify the URL to push data to.

    --dry-run, -d
        Output the rsync command that would be used to push data.