Bioinformatics-2013

Welcome to Bioinformatics

Contact:

Please send all correspondences relating to this course, as well as all completed computer labs to: [course@johanneslab.org]

Course aim:

The aim is to provide an introduction to the following important topics in Bioinformatics:

Jump to Lecture [1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10]

What can you find on this page:

This page provides information for each of the class lectures. It includes pdf versions of the lecture slides as well as links to download datasets and instructions for the computer labs. I am also providing supplemental material with each lecture, which is meant to reinforce the course material. More general online educational resources are listed at the bottom of this page. These include tutorials, webinars and presentations.

Jump to [General online and educational material]

Symposium:

At the end of this course you will give a presentation to the other students. In this presentation you will summarize an original scientific study that was highlighted in the News section of the  journals Nature or Science. Your task is to show how the concepts and/or tools employed in the study relate to one or more topic(s) covered in our lectures. These symposium presentations are an important component of this course, which is reflected in their contribution to your final grade.

Jump to [Symposium]

Study Guide:

In this course you will be exposed to a lot of new information. To help you study for the exam we are putting together a study guide that will highlight the most important topics.

Jump to [Study Guide]

Grading scheme:

Your final grade is obtained as follows:

We will try to keep you up-to-date on your current grades and status of your assignments. See spreadsheet below.

Grades-and-Assignments

PART 1| Lecture 1: Course overview/Genome projects

       

Date:

21-5-2013

Lecture slides:     

[pdf]

Computer lab:      

n/a

Lecturer(s):       

Frank Johannes | Marnix Medema

Summary:

In this lecture you will learn the key components of 'Genome Projects'. We will highlight these components in the context of the Human Genome Project. Finally, we will discuss ongoing genome projects in microbes, as microbial systems are an important focus of GBB.

PART 1 | Lecture 1: Supplemental material

Genome Browsers

[Esembl | NCBI | USCS]

Human Genome Project

[Summary | Detailed Timeline]

Human Genome announcement at the White House

[Video]

Publications of the initial working draft human genome sequence

[IHGSC-Nature-2001 | Venter-Science-2001]  

PART 1| Lecture 2: Genome sequencing

       

Date:

22-5-2013

Lecture slides:     

[pdf]

Computer lab:      

n/a

Lecturer(s):       

Frank Johannes

Summary:

DNA sequencing is the most important component of Genome Projects. In this lecture you will learn the principles of Sanger sequencing (First Generation Sequencing) as well as newer methods such as massively parallel sequencing (Next Generation Sequencing).

PART 1 | Lecture 2: Supplemental material         

Principles of Sanger Sequencing

[Video#1 | Video#2]    

Review of Next Generation Sequencing (NGS) Methods

[Metzker-NRG-2010]

Key companies producing NGS technologies

[Roche 454 | LifeTechnologies (2-color encoding) | Illumina ]

[HelicosBiosciences | Pacific Biosciences]

Whole Genome Shotgun Sequencing in action

[Graig Venter's sorcerer 2 expedition]  

PART 1| Lecture 3: DNA sequence variation

       

Date:

23-5-2013

Lecture slides:     

[pdf-part1 | pdf-part2 | pdf-part3 | pdf-part4]

Computer lab:      

[pdf | doc | data | ANSWERS | Supplemental tutorial]

Lecturer(s):       

Frank Johannes | Victor Guryev

Summary:      

Having determined the DNA sequence of a single (reference) genome, it is of interest to identify and characterize DNA sequence differences between individuals or species. In this lecture you will hear about various types of DNA sequence variations, and how these can be detected using older array-based technologies as well as newer Next Generation Sequencing technologies. We will highlight several ongoing large-scale projects that focus on DNA sequence variation in populations, with particular emphasis on the Genome of the Netherlands (GO.NL).

PART 1 | Lecture 3: Supplemental material    

Large-scale projects to characterize DNA sequencing variation in various populations

 

Humans: [International HapMap Project | 1000 Genomes Project |

Human Variome Project | GO.NL (Genomes of the Netherlands) |

FarGen Project (Proposal to sequence the genomes of a whole population)]  

Plants: [1001 Genomes Project (Arabidopsis) | 1KP Plant Genomes Project (transcriptomes of 1000 plant species)]

Vertebrates: [10K Genomes Projects]

Microorganisms: [The 10k Microbial Genomes Project]

Micro-arrays for SNP detection

[LaFramboise-NAR-2009]

PART 1| Lecture 4: Sequence alignment

       

Date:

24-5-2013

Lecture slides:     

[pdf]

Computer lab:      

[pdf | doc | data | ANSWERS]

Lecturer(s):       

Minh Anh Nguyen

Summary:  

"Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences" (Wiki). Sequence alignment is a fundamental activity in bioinformatics. It is used for assembling whole genomes from smaller sequences, in mapping short Next Generation Sequencing reads to a reference genome, in protein analysis (see next lecture), in phylogenetic analysis (not covered), etc. In previous lectures, we have simply assumed that sequence alignment can be performed. In this lecture you will now learn how it actually works. You will be introduced to pairwise- and multiple- sequence alignment approaches.

PART 1 | Lecture 4: Supplemental material 

Lecture summary and exercises

[pdf | pdf-with-ANSWERS]

Overview of pairwise- and multiple- sequence alignment tools

[Wiki]

PART 2| Lecture 5: Protein alignment/Gene prediction/

Database searches

       

Date:

27-5-2013

Lecture slides:     

[pdf-part1 | pdf-part2 | pdf-part3]

Computer lab:      

n/a

Lecturer(s):       

Peter Terpstra

Summary:

The DNA sequence is simply a string of letters that provide a template for producing ”‘functional”’ products such as mRNA that code for proteins. Proteins ”‘make things happen”’ in the cell and produce observable phenotypes.  Knowledge of the DNA sequence allows us to know the sequence of proteins. Or alternatively, knowing the amino acid sequence of proteins we can map this back to the genic regions that produced the protein (and/or its homologs). This will help us to predict coding elements in raw DNA sequences and will start the genome ”annotation” process. Together with knowledge of (e.g.) start, stop and splice sites it will allow us to ”predict” genes based on sequence information alone. Knowledge about where the databases are that contain all this information is of course indispensable.

Protein alignment

We will discuss: methods of protein alignments; explain what protein substitution scoring matrices are; how to extract protein “profiles” from multiple alignments leading to protein families; the difference between “profiles” with “motifs” . Also we will briefly discuss how one could predict a protein secondary or 3D structure only from its amino acid sequence.

Gene prediction

We will discuss: layout of pro- and eukaryotic genes; homology and “ab initio” methods of predicting the location of protein coding genes in unannotated DNA sequences; what gene “signals” and “content” are; the differences in predicting genes for pro- versus eukaryote genes.

Database searches

This will be an introduction about databases that contain DNA/RNA/Protein sequence data and DNA variation, expression and metabolic data. Where to find them and how to use them.

PART 2 | Lecture 5: Supplemental material

Overview of online databases

[EBI-EMBL]

Computer Exercises with answers

[pdf]

PART 2| Lecture 6: Epigenome projects

       

Date:

28-5-2013

Lecture slides:     

[pdf-part1 | pdf-part2]

Computer lab:      

[pdf | doc | data | ANSWERS]

Lecturer(s):       

Frank Johannes | Maria Colome Tatche

Lecturer(s):   

All cells in your body have the same DNA sequence (save a few somatic mutations). Nonetheless, there is vast functional divergence between cells from different tissues or time-points during development. This is achieved by turning genes on or off in a tissue-dependent or time-dependent manner. The systematic (genome-wide) study of this "functional organization" of the genome has become known as "epigenomics". In this lecture you will learn the basics of epigenomics. The material will be presented in the context of the ENCODE project, which is the largest epigenomics project to date.

PART 2 | Lecture 6: Supplemental material         

The mysterious epigenome: What lies beyond DNA?

[Video]

Epigenomics: Scientific background

[NCBI]

Epigenomics: centralized data deposit

[NCBI]

ENCODE Project information and resources

[Home (UCSC) | NHGRI | Ensembl | Nature ENCODE explorer]

A User's guide to ENCODE

[The ENCODE Project Consortium-PLoSBiology-2011]     

modENCODE Project information and resources

[Home | NHGRI]

Roadmap Epigenomics Program

[Home]

News article about 'ENCODE' and 'Roadmap Epigenomics Program'

[The Scientist Magazine 2012]

PART 2| Lecture 7: Epigenome sequencing

       

Date:

29-5-2013

Lecture slides:     

[pdf-part1 | pdf-part2 | pdf-part3]

Computer lab:      

[pdf | doc | data (see Nestor) | ANSWERS]

Lecturer(s):       

Frank Johannes | Lionel Morgado | Rene Wardenaar

Summary: 

In this lecture you will learn about cutting-edge approaches for measuring specific components of epigenomes. The first method is called Chromatin Immunoprecipitation followed by sequencing (ChIP-seq). This method is used to determined transcription factor binding events as well as the presence/absence of certain histone modification along the genome. You will also learn about Whole Genome Bisulphite Sequencing (WGBS-seq) which is used to determine DNA methylation states at single base resolution. These methods rely on the Next Generation Sequencing technologies which were discussed in earlier lectures.

PART 2 | Lecture 7: Supplemental material 

Review of ChIP-seq analysis

[Park-NRG-2009]       

Review of DNA methylation analysis

[Laird-NRG-2009]

PART 3| Lecture 8: Data integration: experimental populations

       

Date:

30-5-2013

Lecture slides:     

[pdf]

Computer lab:      

[pdf | doc | data | ANSWERS]

Lecturer(s):       

Frank Johannes

Summary:  

It is remarkable that high-throughput measurements, particularly those generated by Next Generation sequencing technologies, can nowadays be collected with such ease and at ever decreasing cost. As a biologist you must ask yourself: Is this type data useful? And, can it help me answer novel biological questions? In this lecture you will hear a biological story that illustrates the power of high-throughput DNA sequence and epigenomic data in understanding phenotypic inheritance in the model plant Arabidopsis.  This case study should help you to see the relevance of what you have learned so far in the course.

PART 3 | Lecture 8: Supplemental material         

PART 3| Lecture 9: Data integration: human populations

       

Date:

31-5-2013

Lecture slides:     

[pdf]

Computer lab:      

[pdf | doc | data | ANSWERS]

Lecturer(s):       

Lude Franke

Summary:

Similar to the previous lecture (Lecture 8), this lecture will illustrate the use of high-throughput data in unraveling the causes of complex diseases in humans. Again, the goal of this lecture is to underline the relevance of the material that you have learned in the context of a real biological question.

PART 3 | Lecture 9: Supplemental material         

PART 3| Lecture 10: Big data handling

       

Date:

3-6-2013

Lecture slides:     

[pdf-part1 | pdf-part2]

Computer lab:      

[doc-part1 | doc-part2 | doc-part3 | data | ANSWERS1 | ANSWERS2]

Lecturer(s):       

Morris Swertz

Summary:

Modern measurement technologies such as the Next Generation Sequencing methods discussed in this course generate a tremendous amount of molecular data. The storage, manipulation and dessimination of this data to end-users is a major bottleneck in gaining meaningful biological insights. In this lecture you will learn about ways to handle this so-called Big data.

PART 3 | Lecture 10: Supplemental material         

For additional reading material about this topic see the links and publications listed in the following ppt slide:

[ppt]

Symposium: Day 1

       

Date:

4-6-2013

Instructions: 

[pdf]

Groups presenting:

see spreadsheet below

Symposium-Groups-Day1

Symposium: Day 2

       

Date:

5-6-2013

Instructions: 

[pdf]

Groups presenting:

see spreadsheet below

Symposium-Groups-Day2

General online and educational material

Online training courses at EBI-EMBL

[go there]

Overview of online software and databases at EBI-EMBL

[go there]

Video tutorial on how to use Ensembl tools

[go there]

Training and tutorials at NCBI

[go there]

NCBI educational resources

[go there]

Educational resources at the National Genome Research Institute

[go there]

Website of USA PBS (Public Broadcasting Station) with interesting NOVA documentaries. Note: the PBS website will not show them, but most of these videos can be viewed at YouTube.

[go there]

Study guide questions

If you really KNOW the answer to the following questions, you should be able to perform well on the final exam.

Lecture 1: Genome Projects

Lecture 2: DNA sequencing

Lecture 3: DNA sequence variation

Focus on lecture slides "pdf-part1"

Focus on lecture slides "pdf-part3"

Lecture 4: Sequence alignment

Pairwise sequence alignment

Multiple sequence alignment

Lecture 5: Protein alignment, gene prediction and database searches

Lecture 6: Epigenome projects and tiling array analysis

Scientific background

Array-based analysis using finite mixture models

Array-based analysis using Hidden Markov Models (HMM)

Lecture 7: Epigenome sequencing and analysis

The ENCODE project

Whole Genome Bisulphite Sequencing (WGBS-seq)

ChIP-seq

Lecture 8: Data integration: experimental populations

Lecture 9: Data integration: human populations

Lecture 10: Big Data handeling

THE END