Bioinformatics-2014
Welcome to Bioinformatics
Contact:
Please send all correspondences relating to this course, as well as all completed computer labs to: [course@johanneslab.org]
Course aim:
The aim is to provide an introduction to the following important topics in Bioinformatics:
Genome Projects (Lecture 1)
Genome Sequencing I (Lecture 2)
Genome Sequencing II (Lecture 3)
Detection of DNA Sequence Variation (Lecture 4)
Epigenomics I (Lecture 5)
Epigenomics II(Lecture 6)
Network analysis (Lecture 7)
Protein Alignment and Gene Prediction (Lecture 8)
Big Data handling (Lecture 9)
Jump to Lecture [1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9]
What can you find on this page:
This page provides information for each of the class lectures. It includes pdf versions of the lecture slides as well as links to download datasets and instructions for the computer labs. I am also providing supplemental material with each lecture, which is meant to reinforce the course material. More general online educational resources are listed at the bottom of this page. These include tutorials, webinars and presentations.
Jump to [General online and educational material]
Symposium:
At the end of this course you will give a presentation to the other students. In this presentation you will summarize an original scientific study that was highlighted in the News section of the journals Nature or Science. Your task is to show how the concepts and/or tools employed in the study relate to one or more topic(s) covered in our lectures. These symposium presentations are an important component of this course, which is reflected in their contribution to your final grade.
Jump to [Symposium]
Study Guide:
In this course you will be exposed to a lot of new information. To help you study for the exam we are putting together a study guide that will highlight the most important topics.
Jump to [Study Guide]
Grading scheme:
Your final grade is obtained as follows:
We will try to keep you up-to-date on your current grades and status of your assignments. See spreadsheet below.
Lecture 1: Course overview/Genome projects
Date:
8-5-2014
Lecture slides:
[pdf]
Computer lab:
n/a
Lecturer(s):
Frank Johannes | Marnix Medema
Summary:
In this lecture you will learn the key components of 'Genome Projects'. We will highlight these components in the context of the Human Genome Project. Finally, we will discuss ongoing genome projects in microbes, as microbial systems are an important focus of GBB.
Lecture 1: Supplemental material
Genome Browsers
Human Genome Project
Human Genome announcement at the White House
[Video]
Publications of the initial working draft human genome sequence
[IHGSC-Nature-2001 | Venter-Science-2001]
Lecture 2: Genome sequencing I
Date:
9-5-2014
Lecture slides:
[pdf-part1] [pdf-part2][NGS-supplemental-slides+videos]
Computer lab:
n/a
Lecturer(s):
Frank Johannes | Marianna Bevova
Summary:
DNA sequencing is the most important component of Genome Projects. In this lecture you will learn the principles of Sanger sequencing (First Generation Sequencing) as well as newer methods such as massively parallel sequencing (Next Generation Sequencing).
Lecture 2: Supplemental material
Principles of Sanger Sequencing
Review of Next Generation Sequencing (NGS) Methods
[Metzker-NRG-2010]
Key companies producing NGS technologies
[Roche 454 | LifeTechnologies (2-color encoding) | Illumina | Pacific Biosciences]
Whole Genome Shotgun Sequencing in action
[Graig Venter's sorcerer 2 expedition]
Lecture 3: Genome sequencing II
Date:
12-5-2014
Lecture slides:
[pdf-part1][pdf-part2]
Computer lab:
[pdf][data link]
Lecturer(s):
Victor Guryev
Summary:
We discussed Next Generation Sequencing technologies (NGS) in the previous lecture. NGS produces millions of short sequenced reads. An important bioinformatic step is to quality-check these reads. It is important to remove any reads or samples that show experimental artifacts, which could stem from DNA preparation or from the NGS machine itself. Following this quality control step, short reads need to be aligned to a reference genome, which is a huge computational task. We will learn now this is done.
Lecture 2: Supplemental material
[none]
Lecture 4: Detection of DNA sequence variation
Date:
13-5-2014
Lecture slides:
[pdf-part1][pdf-part2][pdf-part3]
Computer lab:
[pdf]
Lecturer(s):
Frank Johannes | Victor Guryev
Summary:
Having determined the DNA sequence of a single (reference) genome, it is of interest to identify and characterize DNA sequence differences between individuals or species. In this lecture you will hear about various types of DNA sequence variations, and how these can be detected using older array-based technologies as well as newer Next Generation Sequencing technologies. We will highlight several ongoing large-scale projects that focus on DNA sequence variation in populations, with particular emphasis on the Genome of the Netherlands (GO.NL).
Lecture 4: Supplemental material
Large-scale projects to characterize DNA sequencing variation in various populations
Humans: [International HapMap Project | 1000 Genomes Project |
Human Variome Project | GO.NL (Genomes of the Netherlands) |
FarGen Project (Proposal to sequence the genomes of a whole population)]
Plants: [1001 Genomes Project (Arabidopsis) | 1KP Plant Genomes Project (transcriptomes of 1000 plant species)]
Vertebrates: [10K Genomes Projects]
Microorganisms: [The 10k Microbial Genomes Project]
Micro-arrays for SNP detection
[LaFramboise-NAR-2009]
Lecture 5: Epigenomics I
Date:
14-5-2014
Lecture slides:
[pdf-part1][pdf-part2][pdf-part3]
Computer lab:
[pdf][data1][data2]
Lecturer(s):
Frank Johannes | Maria Colomé-Tatché
Lecturer(s):
Biological motivation
All cells in your body have the same DNA sequence (save a few somatic mutations). Nonetheless, there is vast functional divergence between cells from different tissues or time-points during development. This is achieved by turning genes on or off in a tissue-dependent or time-dependent manner. The systematic (genome-wide) study of this "functional organization" of the genome has become known as "epigenomics". We will discuss large-scale projects that have been initiated to construct comprehensive epigenomic maps in humans and model organisms.
Measuring DNA methylation
The methylation of cytosines is one important layer of the epigenome. We will learn about whole genome bisulphite sequencing (WGBS-seq), which is an NGS-based method for measuring whole methylomes at basepair resolution.
Analyzing DNA methylation
We will discuss simple analysis strategies that are commonly used to determine the methylation state of individual cytosines from WGBS-seq measurements.
Lecture 5: Supplemental material
The mysterious epigenome: What lies beyond DNA?
[Video]
Epigenomics: Scientific background
[NCBI]
Epigenomics: centralized data deposit
[NCBI]
ENCODE Project information and resources
[Home (UCSC) | NHGRI | Ensembl | Nature ENCODE explorer]
A User's guide to ENCODE
[The ENCODE Project Consortium-PLoSBiology-2011]
modENCODE Project information and resources
Roadmap Epigenomics Program
[Home]
News article about 'ENCODE' and 'Roadmap Epigenomics Program'
Lecture 6: Epigenomics II
Date:
15-5-2014
Lecture slides:
[pdf-part1][pdf-part2][pdf-part3]
Computer lab:
[BS-seq-pdf][BS-seq-data]
[ChIP-seq-pdf][ChIP-seq-data (posted on Nestor)]
Lecturer(s):
Frank Johannes | Maria Colomé-Tatché
Summary:
Measuring histone modifications
In this lecture you will learn about methods for measuring the chemical states of histone proteins. The major technique is called Chromatin Immunoprecipitation followed by sequencing (ChIP-seq).
Analyzing histone modifications (single epigenome)
ChIP-seq data obtained from a single epigenome is a common experimental starting point. You will learn about two major ways of analyzing this data.
Analyzing histone modifications (multiple epigenomes)
Often researchers are interested in comparing two or more ChIP-seq dataset with each other. For instance, it may be of interest to compare the histone states of cancer cells with those of normal cells. Epigenetic differences between these two cells types may provide clues about the causes of cancer.
Lecture 6: Supplemental material
Review of ChIP-seq analysis
[Park-NRG-2009]
Review of DNA methylation analysis
[Laird-NRG-2009]
Lecture 7: Network analysis
Date:
16-5-2014
Lecture slides:
[pdf]
Computer lab:
[pdf]
Lecturer(s):
Pariya Berouzi
Summary:
The regulation of the genome is a complex process. We still know very little about it. However, we do know that many functional elements, such as genes, form interacting networks. That is, genes can be expressed together, or they are expressed in response to the expression of another gene, and so forth. Often it is important to infer (i.e. find) networks from a "sea of functional data". Ideally we would like analysis techniques that allow us to infer the direct and causal relationships between functional elements. In this lecture we are going to study graphical models as a statistical tool to encode the relationships among various functional elements. We will highlight the most common network analysis methods used in computational biology.
Lecture 7: Supplemental material
[none]
Lecture 8: Protein alignment/Gene prediction/Databases
Date:
19-5-2014
Lecture slides:
[part1][part2][part3]
Computer lab:
[pdf][doc]
Lecturer(s):
Peter Terpstra
Summary:
Knowledge of the DNA sequence allows us to know the sequence of proteins. Or alternatively, knowing the amino acid sequence of proteins we can map this back to the genic regions that produced the protein (and/or its homologs). This will help us to predict coding elements in raw DNA sequences and will start the genome ”annotation” process. Together with knowledge of (e.g.) start, stop and splice sites it will allow us to ”predict” genes based on sequence information alone. Knowledge about where the databases are that contain all this information is of course indispensable.
Protein alignment
We will discuss: methods of protein alignments; explain what protein substitution scoring matrices are; how to extract protein “profiles” from multiple alignments leading to protein families; the difference between “profiles” with “motifs” . Also we will briefly discuss how one could predict a protein secondary or 3D structure only from its amino acid sequence.
Gene prediction
We will discuss: layout of pro- and eukaryotic genes; homology and “ab initio” methods of predicting the location of protein coding genes in unannotated DNA sequences; what gene “signals” and “content” are; the differences in predicting genes for pro- versus eukaryote genes.
Database searches
This will be an introduction about databases that contain DNA/RNA/Protein sequence data and DNA variation, expression and metabolic data. Where to find them and how to use them.
Lecture 8: Supplemental material
List of helpful links to online tutorials, software and databases
[pdf]
Lecture 9: Big data handling
Date:
20-5-2014
Lecture slides:
[pdf]
Computer lab:
[zip-file]
Lecturer(s):
Pieter
Summary:
Modern measurement technologies such as the Next Generation Sequencing methods discussed in this course generate a tremendous amount of molecular data. The storage, manipulation and dissemination of this data to end-users is a major bottleneck in gaining meaningful biological insights. In this lecture you will learn about ways to handle this so-called Big data.
Lecture 9: Supplemental material
For additional reading material about this topic see the links and publications listed in the following ppt slide:
[ppt]
Symposium: Day 1
Date:
21-5-2014
Instructions:
[pdf]
Groups presenting:
see spreadsheet below
Symposium: Day 2
Date:
23-5-2014
Instructions:
[pdf]
Groups presenting:
see spreadsheet below
General online and educational material
Online training courses at EBI-EMBL
[go there]
Overview of online software and databases at EBI-EMBL
[go there]
Video tutorial on how to use Ensembl tools
[go there]
Training and tutorials at NCBI
[go there]
NCBI educational resources
[go there]
Educational resources at the National Genome Research Institute
[go there]
Website of USA PBS (Public Broadcasting Station) with interesting NOVA documentaries. Note: the PBS website will not show them, but most of these videos can be viewed at YouTube.
[go there]
Study guide questions
If you really KNOW the answer to the following questions, you should be able to perform well on the final exam.
Lecture 1: Genome Projects
What are the core aims of genome projects?
What are genetic maps?
What are cytological maps?
What are physical maps?
How are each of these maps constructed?
What are factors determining the relationship between genetic and physical distances?
What is "chromosome walking" used for?
What was/is the Human Genome Project (HGP)? (Be aware of the history discussed in class)
What were the landmark goals of the HGP and what was achieved?
Why is "finishing" a genome sequence so difficult?
What is synteny?
Why did the HGP also sequence other species (e.g. the mouse).
What is metagenomics in microbes?
What are examples of metagenomics projects?
Lecture 2: DNA sequencing I
How does Sanger sequencing work?
How do you infer the sequence from the banding patterns on a gel?
In which ways did automated Sanger sequencing improve over manual methods?
What is a "trace file"?
What are the issues with reading/interpreting trace files?
How does hierarchical sequencing work?
What is a "tiling path"?
How does whole-genome shotgun sequencing work?
In comparison, what are their advantages and disadvantages?
What features do NGS platforms (discussed in this lecture) share?
What are the main steps in Illumina Next Generation Sequencing (NGS)? (see also video posted online)
What are the disadvantages of NGS methods (such as Illumina's) that use clonal amplification of templates?
What does it mean to "barcode" reads? What implications does this have for setting up sequencing experiments?
What distinguishes "third" generation methods from current NGS platforms?
Lecture 3: DNA sequencing II
The analysis of NGS data usually starts with quality control (QC). What is meant by quality control?
Know how to interpret quality scores (Phred value, base quality, mapping quality).
Know how to interpret output from the FastQC software (see corresponding computer lab).
What are FastQ, SAM and BAM files?
NGS short reads need to be aligned to the reference genome. To optimize this these aligners are often platform specific. Why?
What is so special about aligning reads from an RNA-seq experiment?
What are some advantages of RNA-seq compare over micro-arrays?
Detection of DNA Sequence Variation
What are the different classes of DNA sequence variations?
SNPs can originate from transitions or from transversions. What is the difference?
Although there are twice as many possibilities to have a transversion than a transition, empirical observations show that that transversions are as frequent as transitions in the genome. What reasons for this were discussed in the lecture?
What is linkage disequilibrium (LD)?
What is a haplotype? And what contributes to haplotpe diversity?
What is nucleotide diversity?
The lecture on variant calling was difficult. Just focus on the following, more general, questions:
What are VCF files?
How does an SNP detected in NGS data differ from a sequencing error?
What is it important to mark duplicate reads when calling variants?
What is the difference between paired-end sequencing and mate-pair sequencing?
What is meant by the term "base coverage"?
What is meant by the term "physical coverage"?
What are the four main methods for structural variant (SV) detection? And what are their respective scope of application?
What are typical signatures of structural variation seen in the mapped pair-end reads?
Lecture 5: Epigenomics I
Scientific background
What do we mean by the term "epigenome"?
What do we mean by the term "epigenetics"?
What are the main components of the epigenome discussed in class?
What is the rationale for studying epigenomics and epigenetics?
Measuring and analyzing DNA methylation: MeDIP-chip
There are two main approaches for measuring components of the epigenome, array-based and sequencing-based methods. One popular array-based method is MeDIP-chip. What does this term stand for? And how does it work?
The arrays (or chips) used in MeDIP-chip are often "tiling arrays". What is a "tiling array"?
Finite mixture models are popular methods for analyzing tiling array data. What are the main assumptions of this modeling approach?
What are the main modeling goals of this approach?
What are the key disadvantages of this approach in the context of tiling array data?
Measuring and analyzing DNA methylation: Bisulphite Sequencing
Why do you need to convert unmethylated cytosines into uracil with use of the chemical compound sodium bisulphite for the detection of DNA methylation using sequencing?
How many different types of sequences do you expect from one DNA fragment after bisulphite conversion and PCR amplification? And why?
What are the three types of mapping issues that we have with mapping (or aligning) bisulphite converted read sequences?
Programs that align (or map) read sequences first change the remaining cytosines of the read sequences and all the cytosine of the reference genome into thymines before performing the alignment. This is done on the computer. What is the benefit of this three letter alignment?
What do we mean with the term "bisulphite conversion rate"?
Why do you need to determine the bisulphite conversion rate?
How do you calculate the bisulphite conversion rate?
How can we determine, after mapping, which cytosines are methylated and which ones are unmethylated?
What are the advantages and disadvantages of bisulphite sequencing compared to hybridization based methods like MeDIP-chip?
Lecture 6: Epigenomics II
What is ChIP-seq stand for?
How is ChIP-seq different from ChIP-chip?
When ChIP-seq peaks are narrow, a "peak calling" algorithm such as MACS is suitable. When ChIP-seq peaks are broad, Hidden Markov Models are suitable.
What is the meaning of the "initial probabilities" of a Markov Chain?
What does the "transition matrix" of a Markov Chain tell you?
Given the initial probabilities and the transition matrix, how do you calculate the probability of a given chain?
A key property of Markov Chains is that the past and the future are conditionally independent given the present. What does this mean?
What is it that is hidden in a Hidden Markov Model (HMM)?
In an HMM, what is the meaning of an "emission" probability?
What are the key two properties of ChIP-seq data that make the application of HMMs suitable for this data?
Again, a key property of Markov Chains is that the past and the future are conditionally independent given the present. But what does this mean in the context of ChIP-seq data?
Lecture 7: Network analysis
What is the main goal of network analysis in computational biology?
What is the biological interpretation of an "edge" in a network graph?
What is the biological interpretation of a "node" in a network graph?
What are the differences between an undirected graph and a directed acyclic graph (DAG)?
How is the concept of "conditional independence" defined in the context of an undirected graph?
How is the concept of "conditional independence" defined in the context of an directed acyclic graph?
Considering directed acyclic graphs, what is a "parent" and what is a "descendent"?
Given a graph, know how to write down the conditional independence relationships (see exercises in lecture and in computer lab).
Given the conditional independence relationships, know how to draw a graph (see exercises in lecture and in computer lab).
Be able to name some example applications of undirected graphs and directed acyclic graphs.
Lecture 8: Protein alignment, gene prediction, databases
What is an ORF?
How many Reading Frames (DNA triplets to amino acids) in a double stranded piece of DNA
What is included in the definition of a gene that is not included in the definition of a CDS (=CoDing Sequence)?
In eukaryotic genes : Is the startcodon of the CDS always in the first exon?
UTR means?
Do bacterial mRNA's have 3' or 5' UTR's ?
In gene finding what fact does the “codon usage statistic” use?
In gene predictions there are “Signals” (S) and “Content” (C). In this terminology, what would a "splice site" be? and what would an "exon" be?
What criteria would you use to predict (in prokaryotes) which ORF could code for a protein?
Why is gene prediction in eukaryotes in general less accurate than in prokaryotes?
What specific applications do the different version of the Blast program have? For example, when would use use BlastX, or tBlastN?
Amino acids can be grouped in various ways. Be able to name at least 3 of these groups.
What do the numbers in a Blosum substitution matrix quantify (or measure)?
How are the scores in this matrix interpreted?
Be able to distinguish databases as either primary or secondary.
What data is contained in the NCBI dbSNP database?
Be able to name at least 2 of the 3 data retrieval systems for consulting the text part of the public DNA/Protein databases.
What kind of problems do current sequence databases have?
What is a “low complexity region” in a protein/DNA sequence?
Know the notation for sequence motifs. For example, the “Prosite” aa motif is defined as K-x(1,3)-[LVIM].
What is a PSSM?
How do you derive a PSSM matrix from a protein multiple alignment?
After the first database comparison, what is used in Psi-Blast as a query for the second round of database comparison?
Lecture 9: Big data handling
For this lecture, I require you to be broadly familiar with the "summary" slides for each of the subtopics.
More specifically:
know the rules for database normalization.
Be familiar with the DBMS exercise done in the computer lab.
THE END