Using bioinformatics to mine cancer SNP microarray databases

We will extend the existing Case-It ( SNP microarray simulation which includes cases based onn prostate cancer, HIV resistance, etc. We are working to find a gene(s) for which SNP microarray data are available to build a new case for Case-It. The case will be extended using protein structure information in the RCSB Protein Data Bank and PyMol. Students will map the location of the mutation studied in the case onto the protein structure. We are currently considering using the BRCA-1 gene and will focus only on point mutations.

We may also direct students to protein structure information for the APC gene. This gene is the focus of a colon cancer case on Case-It and would lead to interesting discussions about determining protein structure because it is an instrinsically unstructured protein.

We would like to build a bioinformatics approach that will search databases for other SNP data sets that could be implemented in the Case-It simulation.

Initially, students would perform a small search manually. Next, they would write a simple program using Python that would find SNPs of interest.

Finally, the project would be extended to include a module called ‘Concepts in Bioinformatics Programming’.

Concepts in Bioinformatics Programming

To help students acquire the skills needed to do data mining, we propose to develop a course in bioinformatics programing concepts.  Instructional components of this course would include:

1)      Programming techniques (e.g., in C/C++, Java, Perl, Python).

2)      Basic molecular biology concepts; e.g., DNA (4 nucleotides), amino acids, codons, the genetic code, etc.

By the end of the course the students would be able to do things such as:

1)      Create random sequences

2)      Work with dynamic programming and backtracking

3)      Scoring matrices

4)      Apply local and global alignment

5)      Use sum of pairs or minimum entropy

6)      MSA like: pairwise , tree, progressive , star alignments

7)      Building profiles from MSA.

8)      Work with Blossum matrices

9)      Suffix trees for MSA.

Potential homework projects:

1)      Create random sequences

2)      Use of hashes for simplification in programming

3)      Matching of sequences

4)      Translation codons into am amino acid

5)      Use regular expression for searching

6)      Suffix trees

7)      Multiple sequence alignment (MSA) implementations using other methods.


This entry was posted in Final Projects. Bookmark the permalink.