Is He Guilty?: An Introduction to Working with Sequence Data and Analysis
Exploring HIV Evolution: An Opportunity to Do Your Own Research

Background information on HIV biology

 Is he Guilty?: An Introduction to Working with Sequence Data and Analysis

This is a short paper and pencil exercise to help you get warmed up to working with sequence data. The exercise is built around a famous investigation in the early 1990's where a dentist in Florida was accused of spreading HIV to some of his patients during invasive dental procedures. In addition to using virus sequence data to determine if the dentist was the source of the patients HIV you will get experience with:

  • the types of information that is associated with sequence data submitted to public research databases
  • the differences between working with nucleic acid sequences (DNA) and amino acid sequences (protein)
  • the ways we can read similarities and differences between sequences
  • how a multiple sequence alignment summarizes the comparisons of sequences
  • how a phylogenetic tree graphically represents the differences between sequences and can be used to develop hypothesis about their evolutionary relationships
  • how the evolutionary relationships between sequences can be used as forensic evidence
  • Background on case
    After doing some epidemiological research into the source of the HIV for an AIDS patient with no know risk factors one possible source of infection was identified. This patient had undergone an invasive procedure performed by a dentist with AIDS. Further research found that six other patients of this dentist were HIV-infected. A molecular analysis was done by the Standford University School of Medicine and Center of Disease Control to determine whether the patients of the Florida dentist contracted the virus from him. By comparing the genetic sequences of a virus gene from blood samples of the dentist, his patients, and other HIV+ individuals in the community who did not have contact with the dentist, scientists worked to determine if there was a relationship between the dentist's and patient's viruses. We will use this scenario to learn about comparing sequences and inferring evolutionary relationships based on their similarities and differences.
    -Popular literature
    Gentile, B. (1991). Doctors with AIDS. Newsweek. 48-56.

    - An examination of the role of doctor/patient relationship and whether a person's HIV status should be revealed. Stories of health care professionals that continued to work even after they knew they were HIV positive, including a look at the Florida dentist that infected his patients with HIV.

    -Scientific Literature
    Ou, C.Y.; Ciesislski, C.A.; Myers, G. et al. (1992). Molecular Epidemiology of HIV Transmission in a Dental Practice. Science. 256:1165-1171.

    - A molecular analysis done by the Standford University School of Medicine and Center of Disease Control to determine whether six patients of the Florida dentist that were found to be HIV positive contracted the virus from him. Portions of the HIV proviral envelope gene from each of the seven patients, the dentist, and thirty-five HIV infected people within the geographic area were amplified by polymerase chain reaction and sequenced. Accession numbers are given for the viruses used in the investigation so similar findings can be found by the class

    Molecular epidemiology of HIV transmission in a dental practice.
    Science. 1992 May 22;256(5060):1165-71.
    PMID: 1589796; UI: 92271245
    234 sequences

    More Articles on the Florida Dentist

    Taking a look at sequence data

    The sequence data we are using comes from a public database called GenBank. Follow these links to take a look at a representative sequence record. [stored as a local file] [live from the Internet]

    You can also look at the abstract for the paper that these sequences were published as part of:

    Ou, C.Y.; Ciesislski, C.A.; Myers, G. et al. (1992). Molecular Epidemiology of HIV: Transmission in a Dental Practice. Science. 256:1165-1171. [abstract]
    For this activity we have choosen 6 sequences to help you start exploring how genetic information can be used to determine if in fact the Dentist was the source of virus for the patients who have become HIV+. There are 4 sequences from patients, a dentist sequence, a sequence from someone who is HIV+ and lives in the area but has not had contact with the Dentist (local control), and a sequence from a HIV+ individual who lives in a different part of the world (outgroup). Questions: Interpreting a Multiple Sequence Alignment

    While it is possible to manually compare raw sequence data it quickly gets unwieldy when you are working with long sequences or lots of different sequences. Luckily, computers are very efficient at following instructions and performing mathematical operations. In this section you will work to interpret the output from a program that has performed a multiple sequence alignment on the 6 sequences you have been working with. The program "aligns" sequences by finding the best ways to make their different positions line up with one another and then color codes the positions to characterize the types of differences there are. There is also information available about the % difference between pairs of sequences.

    Multiple sequence alignment for the 6 HIV sequences - click to enlarge the image

    Questions: Tree Reading

    Part of determining if the dentist is the source of the patients' HIV is seeing how the sequences group togther based on their similarity. The assumption is that the sequences that are more similar are more closely related to one anotheróthat is, they share a more recent common ancestor than the sequences from another group. It is possible to figure out the grouping patterns from a multiple sequence alignment but we can also turn that over to the computer and allow it to generate a tree showing the relationships between the sequences.

    Tree of the relationships between the sequences - click to enlarge the image



    Exploring HIV Evolution: An Opportunity to Do Your Own Research

    In this activity you will have the chance to develop your own questions and use the Biology Workbench for Students to answer them. The problem space is built around a rich set of HIV sequence data which is described below.

    The Markham et al. HIV-1 env Sequence Dataset

    Richard Markham and his colleagues (1998), published some research on the pattern of HIV evolution and the rate of CD4 T-cell decline in the Proceedings of the National Academy of Sciences. In addition to the journal article they submitted 666 nucleotide sequences to the GenBank database. They studied a 285 base pair region of the env gene. The gene product, membrane protein gp120, binds to the CD4 receptor site on T-lymphocytes and is involved with the entry of the virus into those cells. Markham et al. followed the evolution of this viral gene sequence in 15 subjects by collecting blood samples at six month intervals for up to four years. For each visit all the forms of the gene (clones) were sequenced and CD4 T-cell counts were made. This data set provides a rich resource for looking closely at the patterns of change in HIV over time.

    Summary of the data set - Data summary table

    Subjects: 15
    Number of visits: 3-9
    Number of clones per visit: 2-18
    Total number of sequences available: 666
    CD4 cell counts for each visit


    Markham RB, Wang WC, Weisstein AE, Wang Z, Munoz A, Templeton A, Margolick J, Vlahov D, Quinn T, Farzadegan H, Yu XF (1998). Patterns of HIV-1 evolution in individuals with differing rates of CD4 T cell decline. Proc. Natl. Acad. Sci. 95(21):12568-73.
    Pub Med ID: 98445411

    PNAS online: Vol. 95, Issue 21, 12568-12573, October 13, 1998 <>

    The Research Scenarios

    HIV evolution scenario #1

    Your research group is working with molecular biologists and physicians to try to develop a drug therapy that is more effective at stopping infections by HIV viruses. The sequences we are working with code for a protein that sits on the outer cell surface and binds with a molecule on the T-cell surface allowing the virus to enter the cell and ultimately destroy it (and with it our immune capabilities). Developing drugs that block the HIV binding protein have had limited short-term success. The major downfall is that the HIV genetic information changes rapidly (mutates) and the change in sequence changes how well the drug therapy will interfere with the HIVís ability to attach to the T-cell. What you need to do is look for patterns in the changes that occur in these HIV sequence over time. Are there certain positions that change more frequently than others? Do they change in predictable ways (have the same change)? You will not be able to study all the available sequences in the time we have availableóhow will you decide which patients data to work with?

    HIV evolution scenario #2

    Your research group is working with epidemiologists to try to understand the ways that HIV is transmitted through a population. Understanding the patterns of movement of the virus from one individual to another is an important step in designing community interventions, educational programs, and other public health approaches to stemming the spread of the HIV/AIDS epidemic. With the advent of inexpensive molecular biology tools public health officials now have a new source of information for studying the transmission of a disease. Before they can make sense of the spread of the disease they need to know something about how rapidly the virus is changing within an individual. They have turned to your research group for help understanding these patterns of change. Is the change occurring at a steady rate within an individual? Is the rate of change consistent between individuals?

    Getting Started with the Biology Workbench for Students

    The Biology Workbench for Students

    If you search for "Markham and Wang" you will see the 666 sequences from this study and you can choose the ones that you would like to work with.


    Background Information on HIV Biology

    Handout on levels of information [genome/gene/sequence/structure]

    Information to accompany figure: The HIV genome is about 9,200,000 RNA bases long (it is a retrovirus). It has 10 genes (transcribed units) that code for 17 different protein products. We will look at part of the env gene (envelope) which codes for 2 proteins that make up the outside of the protein coat or envelope. One of those proteins sits on the outer surface (gp120) of the virus and binds to immune cells (CD4 receptors on T-lymphocytes) and avoids antibodies. The other envelope gene sticks through the viral membrane (gp41) and holds the surface protein. But we are not looking at all of the surface protein sequence (it is about 15K bases long) just one of the variable regions (V3) which is thought to be involve in making contact with the immune cells that the virus attacks.
    HIV biology background []
    Cells Alive HIV tutorial []