January 2015 Introduction to Python coding workshop

Introduction to coding in Python.

A workshop for Feng Lab Group meeting.

Slides link.

A Google Doc for sharing today

3:40am-5:30pm, January 15th, 2015. (1st session)

This is 1st session in series; see second session info here, and third, hands-on session info here

Getting started

Using this Web site

Reload to get the latest links!

A Google Doc for sharing today

Preparation

Read

Interactive notebooks: Sharing the code by Helen Shen. Nature. 2014 Nov 6;515(7525):151-2. doi: 10.1038/515151a. PMID: 25373681

UPDATE: now I’d probably add Programming: Pick up Python by Jeffrey M. Perkel. Nature. 2015 February 5;518:125–6.doi:10.1038/518125a. PMID:25653001 too

Tech prep

Be sure you have a modern, updated browser on your system. Preferably Chrome or Firefox.

Register and do the follow-up activation at SourceLair.

Register for Sagemath Cloud.

Be sure to have a good text editor on your computer. Sounds like you may have been using AquaMac in the past and so this shouldn’t be a problem. I highly recommend Sublime Text. However, for what we’ll be doing Thursday, even TextWrangler on a Mac will be sufficient. For those not on a Mac, I’d recommend Sublime Text or Notepad++ or jEdit.

Intro to technology

We’ll use as a group two technologies today.

The idea for using cloud-based tools is to make it easier upfront to get coding and then you can modify what you use as you develop your coding workflow preferences. (Sorry for needing two, but finding a good interface that has all the features desired and works on the Upstate network is not easy.)

Why Python

Slides

Slides link

The beginning is covering slides up to Where to code and run Python?

Where to code and run Python

On Your Desktop - Asterisks indicates those in this section I have used and they are in order of preference/familiarty in this particular section

In Your Browser/Cloud - Asterisks indicates those in this section I have used

  • Launch sessions via MyBinder.org Example can be launched from: Here and BLAST will work along with Python. Sessions based on R available elsewhere.
  • PythonAnywhere* (Others’ recent views at the Python subreddit)
  • Domino Data Lab* - enables cloud-based Jupyter Notebook (formerly called IPython Notebook) work along with many other languages and functionality.
  • Sagemath Cloud Jupyter Notebook (formerly called IPython Notebook) work An introduction can be found here
  • Azure notebooks
  • Wakari.io - Web-based Python Data Analysis* - Unfortunately, the IPython notebook aspect doesn’t seem to work when on Upstate’s network, but you can still use Python there.
  • Code with Mu: a simple Python editor for beginner programmers (Found in 2019. I don’t know if it is emulator or actual python.)
  • Trinket
  • ScienceBox - lacks free version as far as I know. All above have one.
  • Amazon Web Services*- has a free low level one for first year. All above ScienceBox have free version.
  • DigitalOcean - lacks free version as far as I know. All above ScienceBox have free version.
  • Heroku - lacks free version as far as I know. All above ScienceBox have free version.
  • Rackspace - lacks free version as far as I know. All above ScienceBox have free version.
  • Google App Engine - I am unsure of cost here. Try Google Cloud Platform free for 60 days
  • SourceLair* <– NO LONGER FREE

for installing IPython Notebook handling on your machine

from Exploratory Computing with Python

The three main options are Canopy Express (Mac, Windows, Linux), Anaconda (Mac, Windows, Linux), and PythonXY (Windows).

Python for today

Getting started, we’ll use Sourcelair and Sagemath Cloud

SourceLair

Guide to Python on Sourcelair

Sagemath Cloud enables cloud-based IPython notebook work. An introduction can be found [here](

Sagemath Cloud use is for the IPython Notebook ability

SourceLair, although cloud-based, is like traditional computing interfaces for Python and many languages.

We will come back to that.

Using Sourcelair to run Python Interactively

We will begin to cover

  • Basics of Python ( TO BE DONE Link to an ipynb with examples in it already)
    • variable types
      • strings
      • integers
      • floats
      • booleans
    • lists and dictionaries
    • slicing strings and indexing items in lists
    • math and comparative operators
    • conditionals (Slide of comparison from PCfB book, pg. 116?)
      • if
      • if-else
    • loops (Slide of comparison from PCfB book, pg. 117?)
      • for
      • while
    • type conversion
    • functions
    • file and string handling

We’ll break it up with some real world examples

Real World Examples I: Running Other’s Python code

Methods sections are good for finding what others used and then now you can use

NGS Analysis with Python

Once you can run Python, you can use others’ code

Github and materials and methods are great resources as well.

Example 1

Annotate: Annotation of single-nucleotide variants in the yeast genome

Alright let’s see if we can run this.

What so we need?

Looking at the excerpt from the documentation below starts to give you a feel for what you need.

=Required Files=
Four files are required by Annotate:

1) A BED-formatted file containing mutations. The first five columns of this file must be

chr start   stop    ref obs

which are the chromosome of the mutation, the start and stop positions (which can be the same number, for a single-base mutation), the reference allele at that position, and the observed allele (i.e., the mutation). Mutations are then listed one per line. Extra information can be included in the columns to the right of these. Mutation start and stop positions are 1-indexed.

2) A FASTA file containing the reference genome sequence, in which each chromosome is its own sequence.

3) A FASTA file containing the reference coding sequences, in which each gene is its own sequence. Currently, the positional information for each gene is parsed from the headers of this file. As such, it is very important to use the same file as is provided by SGD or with this software.

4) A .GFF-formatted file containing functional non-coding genomic regions.

Files 2-4 are supplied with Annotate. The most up-to-date versions of these files can be found at the Saccharomyces Genome Database (http://www.yeastgenome.org/download-data/sequence). These files are also based on the most recent S.cerevisiae genome sequence (Saccer3). As such, mutations in the file 1 should be based on Saccer3; mutations called in reference to Saccer2 or a different genome assembly will yield incorrect results.


==Running Annotate==
Annotate is run by executing the annotate.py script (as below), which includes passing some required options that direct the script to the files listed above.

python annotate.py --mutations FILE1 --genome FILE2 --coding FILE3 --non-coding FILE4

Depending on the processing speed of your system, Annotate takes about 60 seconds to process 1000 mutations.


==Required Options==
-m, --mutations : BED file containing mutations
-g, --genome : FASTA file containing genome sequence
-c, --coding: FASTA file of coding sequences
-n, --non-coding: GFF file containing non-coding regions


==Optional Options==
-h, --help : Print all options, usage information
--upstream : Distance (in bp) upstream of gene start codons to include in 5'-upstream annotation (default=500)

The limit of dragging and dropping from a local file system up to SourceLair is file size up to 5 Mb, therefor use the terminal within SourceLair to directly download the archived files to SourceLair.

There are two popular commands to do this, curl and wget. Both are illustrated, but you just need to do one. Additionally the command to unpack the arcvhive is included with each. Note that despite the extension indicating it is a tar.gz or gzipped tarball, it appears to only be a .tar archive and so you just need the tar command to extract.

curl http://depts.washington.edu/sfields/software/annotate/src/Annotate-0.1.tar.gz > Annotate.tar
tar -xf Annotate.tar

-OR-

wget http://depts.washington.edu/sfields/software/annotate/src/Annotate-0.1.tar.gz
tar -xf Annotate-0.1.tar.gz

Navigate into the unpacked directory now.

cd Annotate-0.1

You’ll see from the documentation that 3 of the four files needed to run this script on the yeast genome are included in the archive with the original source. Now that you have the script and the accompanying files unpacked, you just need some data listing mutations.

I am supplying a BED file containing mutations, called mutation_data.bed.

The wget command you need to run on the SourceLair command line terminal to download the file is given below. You could of course use curl if you prefer and can use the above command where it was used as a guide to writing it. It is important here that the resulting file have the name mutation_data.bed or be prepared to modify the commands illustrated below accordingly. (Alternatively you could click the name above and download it locally and then drag into the SoureLair file pane since it is a small file.)

wget https://gist.githubusercontent.com/fomightez/9c435b0f18bf659a4669/raw/54d514b1fa9ce57ec46c5527fbd1eaf3236943e0/mutation_data.bed

Unless you have previously installed the Biopython package in your current SourceLair project, you’ll get an error if you try to run the annotate.py script at this point.

wayne461@scripts:/mnt/project/Annotate-0.1$ python annotate.py
Traceback (most recent call last):
  File "annotate.py", line 242, in <module>
    from Bio import SeqIO
ImportError: No module named Bio

You simply need to install the needed package to your SourceLair project. (Packages or modules (sometimes called libraries) are simply code by others that provide useful functions and abilities that go beyond the bare bones Python and are generally specific for certain sorts of tasks. Not having them as part of bare bones Python cuts down on use of unnecessary resources. They generally need to be installed or put some place Python will look for them, and then you can import them at any time to use them. Many Python distributions include several packages are common. For example you can see all the modules/packages that come standard with PythonAnywhere here. Any distribution of Python will include a way of installing additional packages. For example, PythonAnywhere’s instructions are here. SourceLair’s are here.)

At the command line terminal of your SourceLair project, type the following to install the Biopython package.

pip install biopython

Now we should be set to run the annotate.py script.

Looks like from the documentation the command would be …

python annotate.py --mutations mutation_data.bed --genome S288C_reference_sequence_R64-1-1_20110203.fsa --coding orf_coding_all_R64-1-1_20110203.fasta --non-coding saccharomyces_cerevisiae_R64-1-1_20110208.gff.filtered

(Be careful to get it all as it is long and scrolls off to the right.)

Running that unexpectedly FAILS?!?!?!

Looks good according to documentation. What is going on?

Look at annotate.py file some. Specifically the docstring at the top and the text about arguments at the very end. Two of the arguments don’t match what the documentation says are the flags signalling them. The code is going to run what is in the code; it doesn’t know about the documentation page.

Let’s try that command again modifying it to match what the script itself says.

python annotate.py --input mutation_data.bed --genome S288C_reference_sequence_R64-1-1_20110203.fsa --sequences orf_coding_all_R64-1-1_20110203.fasta --non-coding saccharomyces_cerevisiae_R64-1-1_20110208.gff.filtered

To view the result you can type the following. (Alternatively you can double-click on the file mutation_data.bed.annotated in SourceLair’s file navigation panel. You may first need to reload the browser page to even see it listed in the file navigation pane though.)

head mutation_data.bed.annotated

You should see a breakdown of the possible impacts of each of the mutations listed in the provided input file.

Additional Note

Note the example mutation data on the main page describing the package seems unrelated to yeasts.

Example data listed as:

chr1   213941196  213941196  A  G
chr10  942363     942363     C  G

Although the example data included with source and discussed in the documentation is for yeast S. cerevisiae, the example data listed on the documentation cannot be yeast. Chromosome 1 of S. cerevisiae is only 230218 bp http://www.genome.jp/dbget-bin/www_bget?refseq+NC_001133

Chromsome X of S. cerevisiae is only 745751 bp http://www.ncbi.nlm.nih.gov/nuccore/BK006943

So both are out of the size range for the chromosomes listed and throws errors if you try to use those values with the data that comes with the download of the source file. The specific error is IndexError: list assignment index out of range.

Example 2

This one will be non-interactive.

HTSeq

see manuscript at http://biorxiv.org/content/biorxiv/early/2014/02/20/002824.full.pdf

Simon Anders, Paul Theodor Pyl, Wolfgang Huber HTSeq — A Python framework to work with high-throughput sequencing data Bioinformatics (2014), in print, online at doi:10.1093/bioinformatics/btu638

While the main purpose of HTSeq is to allow you to write your own analysis scripts, customized to your needs, there are also a couple of stand-alone scripts for common tasks that can be used without any Python knowledge. See the Scripts section in the overview below for what is available.

Real World Examples II: Using APIs

Plotly

Plotly

This will be a non-interactive demo of a script running on Domino Data Lab.

  1. Look at date and stats of plot here
  2. Script will be run.
  3. Look at date and stats of plot here

Additional help with plotting biological data via plotly - Exploratory bioinformatics with plot.ly and IPython notebook: Visualizing gene expression data (That notebook on github.)

Yeastmine

YeastMine has a Python web service API

More information about YeastMine:

I’ll demo Query Builder and where to get Python code fragments for designed queries.

Example of integrating YeastMine code fragments into your work

Plan:

Use Yeastmine to convert information into a table into more useful form.

Initial steps will be non-interactive in the interest of time.

Those steps will produce

table_4_gene_list = ["YBL091C-A", "YBL059W", "YBR090C", "YBR186W", "YBR219C", "YBR230C", "YCL005W-A_1", "YCL005W-A_2", "YCR028C-A", "YCR097W_2", "YDL219W", "YDL189W", "YDL137W", "YDL125C", "YDL082W", "YDL079C", "YDL064W", "YDR059C", "YDR099W", "YDR305C", "YDR318W", "YDR367W", "YDR381W", "YDR381C-A", "YDR535C", "YER003C", "YER007C-A", "YER014C-A", "YER044C-A", "YER131W", "YER179W", "YFL039C", "YFL034C-B", "YFL031W", "YFR045W", "YGL251C", "YGL187C", "YGL183C", "YGL033W", "YGR029W", "YGR183C", "YGR225W", "YHR012W", "YHR039C-A", "YHR041C", "YHR079C-A", "YHR123W", "YHR141C", "YHR218W", "YIL148W", "YIL111W", "YIL073C", "YIL004C", "YJL189W", "YJL041W", "YJL031C", "YJL024C", "YJR079W", "YJR094W-A", "YJR112W-A", "YKL006C-A", "YKR005C", "YLL050C", "YLR054C", "YLR078C", "YLR128W", "YLR199C", "YLR202C", "YLR211C", "YLR275W", "YLR333C", "YLR445W", "YML085C", "YML067C", "YML036W", "YML025C", "YML024W", "YML017W", "YMR194C-B", "YMR242C", "YMR292W", "YNL312W", "YNL138W-A", "YNL130C", "YNL066W", "YNL050C", "YNL044W", "YNR053C", "YOL047C", "snR17A", "YOR318C", "YPL241C", "YPL230W", "snR17B", "YPR010C-A", "YPR153W"]

Now replace the example list in the code below and run the code of finding_genes_in_list_with_SGD_Systematic_Name.py found below.

#!/usr/bin/env python

## USAGE: TAKES A LIST OF GENES PROVIDED IN THE LONG SGD SYSTEMATIC NAME FORM
## AND COLLECTS MORE USER FRIENDLY VERSION OF NAME AND INFORMATION FOR EACH GENE.

## Example input:
## ["YPR187W", "YPR202W"]

## Example output:
## S000006391 YPR187W RPO26 RNA POlymerase S. cerevisiae Rpo26p RPB6 ABC23 ORF RNA polymerase subunit ABC23; common to RNA polymerases I, II, and III; part of central core; similar to bacterial omega subunit
## S000006406 YPR202W None None S. cerevisiae None None ORF Putative protein of unknown function; similar to telomere-encoded helicases; down-regulated at low calcium levels; YPR202W is not an essential gene; transcript is predicted to be spliced but there is no evidence that it is spliced in vivo

# See the README.txt for this script at the link below for more information:
# https://github.com/fomightez/yeastmine

## IMPETUS FOR THIS SCRIPT:
## Kawashima et al. 2014 [http://www.ncbi.nlm.nih.gov/pubmed/24722551] HAD
## GOOD-SIZED GENE LISTS WITH SYSTEMATIC NAMES AS PART OF SOME TABLES AND I
# WANTED THE LIST IN A FORM THAT IS MORE INFORMATIVE AND HUMAN-READABLE.
##
## LATER ADDED THE CONCEPT OF BEING ABLE TO ADD FAVORITE GENES.
## CURRENTLY FAVORITE GENES USE THE SGD 'STANDARD NAME' BECAUSE THAT IS
## HOW I USUALLY TRACK THEM BUT YOU CAN CHANGE THAT BE PUTTING THEM IN THE
## FORM YOU'D LIKE AND ADJUSTING THE CONDITIONAL THAT CHECKS THEM AGAINST THE
## SGD GENE LIST.


list_to_get_info_for = ["YPR063C", "YPR098C", "YPR132W", "YPR170W-B", "YPR187W", "YPR202W"]

#OPTIONAL - SEE BELOW
#my_favorite_genes = ["NMD2", "MUD1", "TAN1"]

# The following two lines will be needed in every python script:
import intermine
from intermine.webservice import Service
service = Service("http://yeastmine.yeastgenome.org/yeastmine/service")

# Get a new query on the class (table) you will be querying:
query = service.new_query("Gene")

# The view specifies the output columns
query.add_view(
    "primaryIdentifier", "secondaryIdentifier", "symbol", "name",
    "organism.shortName", "proteins.symbol",  "sgdAlias", "featureType", "description"
)

# This query's custom sort order is specified below:
query.add_sort_order("Gene.secondaryIdentifier", "ASC")


print "primaryIdentifier\tsecondaryIdentifier\tsymbol\tname\torganism.shortName\tproteins.symbol\tsgdAlias\tfeatureType\tdescription"


for row in query.rows():
    if row["secondaryIdentifier"] in list_to_get_info_for:
    #LIST OF FAVORITE GENES AND ADD AN 'AND' CONDITION TO ABOVE LINE TO LIMIT TO YOUR FAVORITE GENES, like so:
    #if (row["secondaryIdentifier"] in list_to_get_info_for) & (row["symbol"] in my_favorite_genes):
        print row["primaryIdentifier"], row["secondaryIdentifier"], row["symbol"], row["name"], \
            row["organism.shortName"], row["proteins.symbol"], row["sgdAlias"], row["featureType"], \
             row["description"]

NCBI

Using the NCBI Entrez server via Biopython

See the Real World example #1 for a reminder of how to take the script below and run it on Sourcelair.com. The process is the same.

Unless you have previously installed the Biopython package in your current SourceLair project, you’ll get an error if you try to run the script below. (If you already did the examples under the Real World exercises set #1 then you should be all set unless you startd over with a new project folder since then.)

If you try to run the script and see the error below, you need to install the module again.

ImportError: No module named Bio

You simply need to install the needed package to your SourceLair project. (Packages or modules (sometimes called libraries) are simply code by others that provide useful functions and abilities that go beyond the bare bones Python and are generally specific for certain sorts of tasks. Not having them as part of bare bones Python cuts down on use of unnecessary resources. They generally need to be installed or put some place Python will look for them, and then you can import them at any time to use them. Many Python distributions include several packages are common. For example you can see all the modules/packages that come standard with PythonAnywhere here. Any distribution of Python will include a way of installing additional packages. For example, PythonAnywhere’s instructions are here. SourceLair’s are here.)

At the command line terminal of your SourceLair project, type the following to install the Biopython package.

pip install biopython

Now we should be set to run the script below to use the NCBI Entrez server via Biopython.

from Bio import Entrez
Entrez.email = "YOUR_EMAIL_GOES HERE" #so NCBI can contact you if you abuse system

protein_accn_numbers = ["ABR17211.1", "XP_002864745.1", "AAT45004.1", "XP_003642916.1" ]
protein_gi_numbers = []

print "The Accession numbers for protein sequence provided:"
print protein_accn_numbers

#ESearch
print "\nBeginning the ESearch..."
# BE CAREFUL TO NOT ABUSE THE NCBI SYSTEM.
# see http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec119 for information.
# For example, if searching with more than 100 records, you'd need to do this ESearch step
# on weekends or outside USA peak times.
for accn in protein_accn_numbers:
    esearch_handle = Entrez.esearch(db="protein", term=accn)
    esearch_result= Entrez.read(esearch_handle)
    esearch_handle.close()
    #print esearch_result
    #print esearch_result["IdList"][0]
    protein_gi_numbers.append(esearch_result["IdList"][0])
#print protein_gi_numbers

retrieved_mRNA_uids = []
#ELink
print "Beginning the ELink step..."
handle = Entrez.elink(dbfrom="protein", db="nuccore", LinkName="protein_nuccore_mrna", id=protein_gi_numbers)
result = Entrez.read(handle)
handle.close()
#print result
for each_record in result:
    mrna_id = each_record["LinkSetDb"][0]["Link"][0]["Id"]
    retrieved_mRNA_uids.append(mrna_id)
#print retrieved_mRNA_uids

#EPost
print "Beginning the EPost step..."
epost_handle = Entrez.epost(db="nuccore", id=",".join(retrieved_mRNA_uids))
epost_result = Entrez.read(epost_handle)
epost_handle.close()

webenv = epost_result["WebEnv"]
query_key = epost_result["QueryKey"]

#EFetch
print "Beginning the EFetch step..."
count = len(retrieved_mRNA_uids)
batch_size = 20
the_records = ""
for start in range(0, count, batch_size):
    end = min(count, start + batch_size)
    print("Fetching records %i thru %i..." % (start + 1, end))
    fetch_handle = Entrez.efetch(db="nuccore",
                                 rettype="fasta", retmode="text",
                                 retstart=start, retmax=batch_size,
                                 webenv=webenv,
                                 query_key=query_key)
    data = fetch_handle.read()
    fetch_handle.close()
    the_records = the_records + data
print the_records

Sources

The sources for the information used today came from those linked throughout the content.

However, certain sources deserve special highlighting as they were particularly useful in developing this workshop, contain a wealth of related resources, or are especially pertinent at this stage.

Going forward

Look into

Learning Python

Intermediate Python and Integrating it with Other Tools

ADVANCED

Epilogue to Session I

In response to questions raised during Session I. I look forward to another!

Comments and docstrings and placeholders

Comments

Python will disregard everything on a line after an # that is not within a string. You can comment out entire lines by beginning them with #.

Example:

sequence = 'GAATTC' #EcoRI site
# I like enzymes
print sequence

If you need to comment out blocks of code, you can take advantage of your text editor to add # to multiple lines at once. In Sublime Text you highlight the text block with the cursor and then use the Edit menu to naviagte to Toggle Block Comment, i.e., Edit menu>Comment>Toggle Block Comment`.

Alternatively you can use a docstring.

Docstrings

Typically docstrings are used below the first line of a a function, function def line to explain the function of the function. Example:

def function_name(variable_passed_in):
    """
    Calculate or do something

    Args:
        variable_passed_in: a variable represention something
    Returns:
        description of output
    Raises:
        TypeError: if variable_passed_in is not a number.
        ValueError: if variable_passed_in is negative.

    """
    pass

code example adapted from http://stackoverflow.com/questions/3898572/what-is-the-standard-python-docstring-format

Docstrings used elsewhere can be used for large comment blocks as well.

Docstrings can also be assigned as strings. Typically that approach is used when you want to print to stdout a a lot of text. Such as a guide to usage of your program in repsonse to certain input fromt the user.

Pass as a placholder

Note that in the function example adove, pass is used a placeholder for code to be fleshed out later.

pass can be useful for building the skeleton of your script as certain text editors and Interactive Development Environments (IDE) will not allow you to leave lines following a colon blank.

Installed modules/packages/libraries

Get installed modules/package command library command

help('modules')

works on sourcelair.com at the Python prompt. Interestingly, it doesn’t work on my Mac at the Python prompt (in Terminal) where I have Enthought Canopy installed as my version of Python. Enthought has a package manager and so they are listed under that if you open Entought’s Canopy gui analysis environment.

adapted from from http://stackoverflow.com/questions/739993/how-can-i-get-a-list-of-locally-installed-python-modules

Current scope and visualizing your coding steps

In Python interactive mode that comes up when you type Python:

- dir() will give you the list of in scope variables
- vars() gives you a dictionary of variables
- globals() will give you a dictionary of global variables
- locals() will give you a dictionary of local variables
- vars()

In IPython Notebook: type whos

adapted from http://stackoverflow.com/questions/633127/viewing-all-defined-variables

(For more on scope and Python’s namespaces, see A beginner’s guide to Python’s namespaces, scope resolution…)

Examining the current scope can be part of the more general process of debugging your script, and so I’ll touch on that too.

The Online Python Tutor or Codeskulptor’s visualization mode (Viz mode) will show values as you step through each line of your script.

For debugging real scripts in the typical Python environment, you can:

- add printing variables and messages. You can comment these out. You can even use a the `logging` module to control statements you can turn off at a document level. See lines 70-72 and line 115 of  https://github.com/fomightez/sequencework/blob/master/ConvertSeq/ConvertFASTAdnaSEQtoRNA.py for an example if it in action. See https://docs.python.org/2/howto/logging.html for information about the `logging` module in general.

- use a debugger module (see https://docs.python.org/2/library/pdb.html and  http://hplgit.github.io/bioinf-py/doc/pub/html/main_bioinf.html for some guidance in this)

- Enthought Canopy has a nice debugging implementation. You can see a video [here](https://www.enthought.com/products/canopy/canopy-python-debugger/) to get an idea of the features and how one uses these types of approaches to debug in general.