Bioinformatics-2013
Welcome to Bioinformatics
Contact:
Please send all correspondences relating to this course, as well as all completed computer labs to: [course@johanneslab.org]
Course aim:
The aim is to provide an introduction to the following important topics in Bioinformatics:
Genome Projects (Lecture 1)
Genome Sequencing (Lecture 2)
DNA Sequence Variation (Lecture 3)
Sequence Alignment (Lecture 4)
Protein Alignment and Gene Prediction (Lecture 5)
Epigenome Projects and tiling array analysis (Lecture 6)
Epigenome Sequencing and analysis (Lecture 7)
Data integration: experimental populations (Lecture 8)
Data integration: human populations (Lecture 9)
Big Data handling (Lecture 10)
Jump to Lecture [1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10]
What can you find on this page:
This page provides information for each of the class lectures. It includes pdf versions of the lecture slides as well as links to download datasets and instructions for the computer labs. I am also providing supplemental material with each lecture, which is meant to reinforce the course material. More general online educational resources are listed at the bottom of this page. These include tutorials, webinars and presentations.
Jump to [General online and educational material]
Symposium:
At the end of this course you will give a presentation to the other students. In this presentation you will summarize an original scientific study that was highlighted in the News section of the journals Nature or Science. Your task is to show how the concepts and/or tools employed in the study relate to one or more topic(s) covered in our lectures. These symposium presentations are an important component of this course, which is reflected in their contribution to your final grade.
Jump to [Symposium]
Study Guide:
In this course you will be exposed to a lot of new information. To help you study for the exam we are putting together a study guide that will highlight the most important topics.
Jump to [Study Guide]
Grading scheme:
Your final grade is obtained as follows:
We will try to keep you up-to-date on your current grades and status of your assignments. See spreadsheet below.
PART 1| Lecture 1: Course overview/Genome projects
Date:
21-5-2013
Lecture slides:
[pdf]
Computer lab:
n/a
Lecturer(s):
Frank Johannes | Marnix Medema
Summary:
In this lecture you will learn the key components of 'Genome Projects'. We will highlight these components in the context of the Human Genome Project. Finally, we will discuss ongoing genome projects in microbes, as microbial systems are an important focus of GBB.
PART 1 | Lecture 1: Supplemental material
Genome Browsers
Human Genome Project
Human Genome announcement at the White House
[Video]
Publications of the initial working draft human genome sequence
[IHGSC-Nature-2001 | Venter-Science-2001]
PART 1| Lecture 2: Genome sequencing
Date:
22-5-2013
Lecture slides:
[pdf]
Computer lab:
n/a
Lecturer(s):
Frank Johannes
Summary:
DNA sequencing is the most important component of Genome Projects. In this lecture you will learn the principles of Sanger sequencing (First Generation Sequencing) as well as newer methods such as massively parallel sequencing (Next Generation Sequencing).
PART 1 | Lecture 2: Supplemental material
Principles of Sanger Sequencing
Review of Next Generation Sequencing (NGS) Methods
[Metzker-NRG-2010]
Key companies producing NGS technologies
[Roche 454 | LifeTechnologies (2-color encoding) | Illumina ]
[HelicosBiosciences | Pacific Biosciences]
Whole Genome Shotgun Sequencing in action
[Graig Venter's sorcerer 2 expedition]
PART 1| Lecture 3: DNA sequence variation
Date:
23-5-2013
Lecture slides:
[pdf-part1 | pdf-part2 | pdf-part3 | pdf-part4]
Computer lab:
[pdf | doc | data | ANSWERS | Supplemental tutorial]
Lecturer(s):
Frank Johannes | Victor Guryev
Summary:
Having determined the DNA sequence of a single (reference) genome, it is of interest to identify and characterize DNA sequence differences between individuals or species. In this lecture you will hear about various types of DNA sequence variations, and how these can be detected using older array-based technologies as well as newer Next Generation Sequencing technologies. We will highlight several ongoing large-scale projects that focus on DNA sequence variation in populations, with particular emphasis on the Genome of the Netherlands (GO.NL).
PART 1 | Lecture 3: Supplemental material
Large-scale projects to characterize DNA sequencing variation in various populations
Humans: [International HapMap Project | 1000 Genomes Project |
Human Variome Project | GO.NL (Genomes of the Netherlands) |
FarGen Project (Proposal to sequence the genomes of a whole population)]
Plants: [1001 Genomes Project (Arabidopsis) | 1KP Plant Genomes Project (transcriptomes of 1000 plant species)]
Vertebrates: [10K Genomes Projects]
Microorganisms: [The 10k Microbial Genomes Project]
Micro-arrays for SNP detection
[LaFramboise-NAR-2009]
PART 1| Lecture 4: Sequence alignment
Date:
24-5-2013
Lecture slides:
[pdf]
Computer lab:
[pdf | doc | data | ANSWERS]
Lecturer(s):
Minh Anh Nguyen
Summary:
"Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences" (Wiki). Sequence alignment is a fundamental activity in bioinformatics. It is used for assembling whole genomes from smaller sequences, in mapping short Next Generation Sequencing reads to a reference genome, in protein analysis (see next lecture), in phylogenetic analysis (not covered), etc. In previous lectures, we have simply assumed that sequence alignment can be performed. In this lecture you will now learn how it actually works. You will be introduced to pairwise- and multiple- sequence alignment approaches.
PART 1 | Lecture 4: Supplemental material
Lecture summary and exercises
[pdf | pdf-with-ANSWERS]
Overview of pairwise- and multiple- sequence alignment tools
[Wiki]
PART 2| Lecture 5: Protein alignment/Gene prediction/
Database searches
Date:
27-5-2013
Lecture slides:
[pdf-part1 | pdf-part2 | pdf-part3]
Computer lab:
n/a
Lecturer(s):
Peter Terpstra
Summary:
The DNA sequence is simply a string of letters that provide a template for producing ”‘functional”’ products such as mRNA that code for proteins. Proteins ”‘make things happen”’ in the cell and produce observable phenotypes. Knowledge of the DNA sequence allows us to know the sequence of proteins. Or alternatively, knowing the amino acid sequence of proteins we can map this back to the genic regions that produced the protein (and/or its homologs). This will help us to predict coding elements in raw DNA sequences and will start the genome ”annotation” process. Together with knowledge of (e.g.) start, stop and splice sites it will allow us to ”predict” genes based on sequence information alone. Knowledge about where the databases are that contain all this information is of course indispensable.
Protein alignment
We will discuss: methods of protein alignments; explain what protein substitution scoring matrices are; how to extract protein “profiles” from multiple alignments leading to protein families; the difference between “profiles” with “motifs” . Also we will briefly discuss how one could predict a protein secondary or 3D structure only from its amino acid sequence.
Gene prediction
We will discuss: layout of pro- and eukaryotic genes; homology and “ab initio” methods of predicting the location of protein coding genes in unannotated DNA sequences; what gene “signals” and “content” are; the differences in predicting genes for pro- versus eukaryote genes.
Database searches
This will be an introduction about databases that contain DNA/RNA/Protein sequence data and DNA variation, expression and metabolic data. Where to find them and how to use them.
PART 2 | Lecture 5: Supplemental material
Overview of online databases
[EBI-EMBL]
Computer Exercises with answers
[pdf]
PART 2| Lecture 6: Epigenome projects
Date:
28-5-2013
Lecture slides:
[pdf-part1 | pdf-part2]
Computer lab:
[pdf | doc | data | ANSWERS]
Lecturer(s):
Frank Johannes | Maria Colome Tatche
Lecturer(s):
All cells in your body have the same DNA sequence (save a few somatic mutations). Nonetheless, there is vast functional divergence between cells from different tissues or time-points during development. This is achieved by turning genes on or off in a tissue-dependent or time-dependent manner. The systematic (genome-wide) study of this "functional organization" of the genome has become known as "epigenomics". In this lecture you will learn the basics of epigenomics. The material will be presented in the context of the ENCODE project, which is the largest epigenomics project to date.
PART 2 | Lecture 6: Supplemental material
The mysterious epigenome: What lies beyond DNA?
[Video]
Epigenomics: Scientific background
[NCBI]
Epigenomics: centralized data deposit
[NCBI]
ENCODE Project information and resources
[Home (UCSC) | NHGRI | Ensembl | Nature ENCODE explorer]
A User's guide to ENCODE
[The ENCODE Project Consortium-PLoSBiology-2011]
modENCODE Project information and resources
Roadmap Epigenomics Program
[Home]
News article about 'ENCODE' and 'Roadmap Epigenomics Program'
PART 2| Lecture 7: Epigenome sequencing
Date:
29-5-2013
Lecture slides:
[pdf-part1 | pdf-part2 | pdf-part3]
Computer lab:
[pdf | doc | data (see Nestor) | ANSWERS]
Lecturer(s):
Frank Johannes | Lionel Morgado | Rene Wardenaar
Summary:
In this lecture you will learn about cutting-edge approaches for measuring specific components of epigenomes. The first method is called Chromatin Immunoprecipitation followed by sequencing (ChIP-seq). This method is used to determined transcription factor binding events as well as the presence/absence of certain histone modification along the genome. You will also learn about Whole Genome Bisulphite Sequencing (WGBS-seq) which is used to determine DNA methylation states at single base resolution. These methods rely on the Next Generation Sequencing technologies which were discussed in earlier lectures.
PART 2 | Lecture 7: Supplemental material
Review of ChIP-seq analysis
[Park-NRG-2009]
Review of DNA methylation analysis
[Laird-NRG-2009]
PART 3| Lecture 8: Data integration: experimental populations
Date:
30-5-2013
Lecture slides:
[pdf]
Computer lab:
[pdf | doc | data | ANSWERS]
Lecturer(s):
Frank Johannes
Summary:
It is remarkable that high-throughput measurements, particularly those generated by Next Generation sequencing technologies, can nowadays be collected with such ease and at ever decreasing cost. As a biologist you must ask yourself: Is this type data useful? And, can it help me answer novel biological questions? In this lecture you will hear a biological story that illustrates the power of high-throughput DNA sequence and epigenomic data in understanding phenotypic inheritance in the model plant Arabidopsis. This case study should help you to see the relevance of what you have learned so far in the course.
PART 3 | Lecture 8: Supplemental material
PART 3| Lecture 9: Data integration: human populations
Date:
31-5-2013
Lecture slides:
[pdf]
Computer lab:
[pdf | doc | data | ANSWERS]
Lecturer(s):
Lude Franke
Summary:
Similar to the previous lecture (Lecture 8), this lecture will illustrate the use of high-throughput data in unraveling the causes of complex diseases in humans. Again, the goal of this lecture is to underline the relevance of the material that you have learned in the context of a real biological question.
PART 3 | Lecture 9: Supplemental material
PART 3| Lecture 10: Big data handling
Date:
3-6-2013
Lecture slides:
[pdf-part1 | pdf-part2]
Computer lab:
[doc-part1 | doc-part2 | doc-part3 | data | ANSWERS1 | ANSWERS2]
Lecturer(s):
Morris Swertz
Summary:
Modern measurement technologies such as the Next Generation Sequencing methods discussed in this course generate a tremendous amount of molecular data. The storage, manipulation and dessimination of this data to end-users is a major bottleneck in gaining meaningful biological insights. In this lecture you will learn about ways to handle this so-called Big data.
PART 3 | Lecture 10: Supplemental material
For additional reading material about this topic see the links and publications listed in the following ppt slide:
[ppt]
Symposium: Day 1
Date:
4-6-2013
Instructions:
[pdf]
Groups presenting:
see spreadsheet below
Symposium: Day 2
Date:
5-6-2013
Instructions:
[pdf]
Groups presenting:
see spreadsheet below
General online and educational material
Online training courses at EBI-EMBL
[go there]
Overview of online software and databases at EBI-EMBL
[go there]
Video tutorial on how to use Ensembl tools
[go there]
Training and tutorials at NCBI
[go there]
NCBI educational resources
[go there]
Educational resources at the National Genome Research Institute
[go there]
Website of USA PBS (Public Broadcasting Station) with interesting NOVA documentaries. Note: the PBS website will not show them, but most of these videos can be viewed at YouTube.
[go there]
Study guide questions
If you really KNOW the answer to the following questions, you should be able to perform well on the final exam.
Lecture 1: Genome Projects
What are the core aims of genome projects?
What are genetic maps?
What are cytological maps?
What are physical maps?
How are each of these maps constructed?
What are factors determining the relationship between genetic and physical distances?
What is "chromosome walking" used for?
What was/is the Human Genome Project (HGP)? (Be aware of the history discussed in class)
What were the landmark goals of the HGP and what was achieved?
Why is "finishing" a genome sequence so difficult?
What is synteny?
Why did the HGP also sequence other species (e.g. the mouse).
What is metagenomics in microbes?
What are examples of metagenomics projects?
Lecture 2: DNA sequencing
How does Sanger sequencing work?
How do you infer the sequence from the banding patterns on a gel?
In which ways did automated Sanger sequencing improve over manual methods?
What is a "trace file"?
What are the issues with reading/interpreting trace files?
How does hierarchical sequencing work?
What is a "tiling path"?
How does whole-genome shotgun sequencing work?
In comparison, what are their advantages and disadvantages?
What are the main steps in Illumina Next Generation Sequencing (NGS)?
What are the disadvantages of NGS methods (such as Illumina's) that use clonal amplification of templates?
Lecture 3: DNA sequence variation
Focus on lecture slides "pdf-part1"
What are the different classes of DNA sequence variations?
SNPs can originate from transitions or from transversions. What is the difference?
Although there are twice as many possibilities to have a transversion than a transition, empirical observations show that that transversions are as frequent as transitions in the genome. What reasons for this were discussed in the lecture?
What are the main steps in Next Generation Sequencing (NGS) DNA sequence variant detection?
Focus on lecture slides "pdf-part3"
What is paired-end sequencing?
What is meant by the term "base coverage"?
What is meant by the term "physical coverage"?
What are the main approaches for SV detection using NGS data? (see slide "Approaches for SV detection using NGS data")
Focusing on the "read pair" approach for detecting SV, what are typical signatures of structural variation seen in the mapped pair-end reads? (see slide "Signatures of structural variation")? (Be able to understand why you expect to see these signatures for a given SV type.)
Lecture 4: Sequence alignment
Pairwise sequence alignment
Dotplots compare sequences, give a quick indication of repeated, deleted/inserted, reversed, exchanged and alignable regions. How do you construct a (filtered) dotplot with given window size and stringency? (Briefly interpret a given dotplot.)
How do you score a given pairwise alignment?
How do you construct a pairwise sequence alignment? (Needleman-Wunsch to construct global alignment and Smith-Waterman algorithm to construct local alignment.)
What are the differences between the two algorithms?
Why do we sometimes want to construct local alignments but not global alignments?
Multiple sequence alignment
How do you score a given multiple sequence alignment? (Two types of scoring: sum-of-pairs score and tree-based score.)
How to compute distance between two (aligned) sequences? (We mostly used Levenshtein distance.)
What is the meaning of an alignment profile?
How do you compute a profile for an alignment?
What is the general idea behind progressive alignment? (Recall that we need a guide tree to do progressive alignment.)
How do you align a profile and a sequence? (You should at least know the idea.) Being able to do this you can then align or add a single sequence to an existing alignment.
How to align two profiles? (You should at least know the idea.) Being able to do this you can then align two existing alignments.
Given a guide tree, how is progressive alignment of performed? (You should at least know the order to perform alignments of sequences along the guide tree in order to align all sequences.)
Lecture 5: Protein alignment, gene prediction and database searches
On what basis can you group amino acids?
What is the BLOSUM matrix? And what do its entries mean?
What is a Position Specific Scoring Matrix?
What is a protein motif?
How can you search for a protein motif in database? (What notation is used there?)
Why do you want to predict genes?
What is the architecture of a prokaryotic gene?
What is the architecture of a eukaryotic gene?
What aspects are taken into account when using computational gene prediction?
Have an overview of important databases(?)
What variants of the Blast alignment programs exist? And for what purposes are they used?
Lecture 6: Epigenome projects and tiling array analysis
Scientific background
What do we mean by the term "epigenome"?
What do we mean by the term "epigenetics"?
What are the main components of the epigenome discussed in class?
What is the rationale for studying epigenomics and epigenetics?
Array-based analysis using finite mixture models
There are two main approaches for measuring components of the epigenome, array-based and sequencing-based methods. Array-based methods often use "tiling-arrays". What is a tiling array?
Variants of finite mixture models are popular methods for analyzing tiling array data. What are the main assumptions of this modeling approach?
What are the main modeling goals of this approach?
What are the key disadvantages of this approach in the context of tiling array data?
Array-based analysis using Hidden Markov Models (HMM)
What is the meaning of the "initial probabilities" of a Markov Chain?
What does the "transition matrix" of a Markov Chain tell you?
Given the initial probabilities and the transition matrix, how do you calculate the probability of a given chain?
A key property of Markov Chains is that the past and the future are conditionally independent given the present. What does this mean?
What is it that is hidden in a Hidden Markov Model (HMM) and a Markov Model?
In an HMM, what is the meaning of an "emission" probability?
What are the key two properties of tiling array data that make the application of HMMs suitable for this data?
Again, a key property of Markov Chains is that the past and the future are conditionally independent given the present. But what does this mean in the context of tiling array data?
Lecture 7: Epigenome sequencing and analysis
The ENCODE project
What was/is the rationale behind the ENCODE project?
What were/are the main findings?
Whole Genome Bisulphite Sequencing (WGBS-seq)
Why do you need to convert unmethylated cytosines into uracil with use of the chemical compound sodium bisulphite for the detection of DNA methylation using sequencing?
How many different types of sequences do you expect from one DNA fragment after bisulphite conversion and PCR amplification? And why?
What are the three types of mapping issues that we have with mapping (or aligning) bisulphite converted read sequences?
Programs that align (or map) read sequences first change the remaining cytosines of the read sequences and all the cytosine of the reference genome into thymines before performing the alignment. This is done on the computer. What is the benefit of this three letter alignment?
What do we mean with the term "bisulphite conversion rate"?
Why do you need to determine the bisulphite conversion rate?
How do you calculate the bisulphite conversion rate?
How can we determine, after mapping, which cytosines are methylated and which ones are unmethylated?
What are the advantages and disadvantages of bisulphite sequencing compared to hybridization based methods like MeDIP-chip?
ChIP-seq
What does ChIP-seq stand for?
What is the difference between ChIP-seq and ChIP-chip?
What is the difference between IP DNA, Mock IP DNA and DNA from non-specific IP?
How can you determine if the sequencing depth of your ChIP-seq experiment is enough?
What are the 4 main steps in the ChIP-seq data analysis workflow?
What is meant by "peak calling"?
How can the quality of your data be checked visually? (Be familiar with the plots discussed in class.)
Lecture 8: Data integration: experimental populations
This presentation was from a project that I am involved in. The goal was to show you how high-throughput data can be integrated to answer novel biological questions. You do NOT need to know any of this material for the exam. Just be aware that your instructor is involved in some very cool projects : )
Lecture 9: Data integration: human populations
What are the difference between discrete (Mendelian) and continuous (complex) traits?
What is linkage disequilibrium?
Why is it not necessary to investigate all SNPs in the genome to study the impact of genetic variation on complex traits? (This has to do with linkage disequilibrium.)
To determine the most common SNP genotypes of an individual, do we need to sequence his/her genome?
In the context of human health, association mapping studies attempt to relate DNA sequence variation (most often at the level of SNPs) to variation in disease outcomes in a given population. One way to perform association mapping is to compare the allele frequencies in a sample of healthy individuals with the frequencies in a sample of unhealthy individuals. An association is claimed to exist if the allele frequencies significantly diverge from what is expected under Hardy-Weinberg equilibrium. What does this mean?
What statistical test can be used to test for significant associations? (Recall your computer lab.)
How do you calculate the allele frequencies from genotype frequencies?
What is the biological and clinical value of identifying DNA sequence variants that are associated with disease outcomes?
What is "Genetical Genomics"?
What is a cis-QTL?
What is a trans-QTL?
Lecture 10: Big Data handeling
What is meant by the term "Big Data"?
What are the key challenges of dealing with Big Data? And what are their solutions?
What are the main processing tools that are used to run large analyses automatically? (Recall your computer practical using GALAXY.)
What are key database challenges? And which database solutions are available? (Recall your computer practical using DBMS.)
THE END