AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTCGATTAAAAAAAG
AGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTT
TATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATCACCACCATTACC
ATAAAAATTACAGAGTACATCTCTGTGTGGATCAACATCCATGAAACGCATTAG
ACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACA
CAGAAAAAAGCCCGCACCTGACGTCGGATAGCAGCTTCTGCACTGGTTATTTTT
AGGTTATCTGTGTGGATAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCA
Construction
of a graph from the DNA sequence given above
Every plant, animal,
bacteria and fungus has a DNA sequence composed of four different
bases called nucleotides. These nucleotides, Adenine (A), Cytosine
(C), Guanine (G) and Thymine (T), are wrapped up in the famous Watson-Crick
double helix. To describe a DNA sequence we list the sequence of nucleotides,
as above.
 |
A complete list
of nucleotides is called a genome. The genome of a bacterial cell
will typically have several million letters, while the human genome
contains over three billion nucleotides. It is known that some
of an organisms DNA codes for the creation of proteins. But scientists
still have no clear understanding of the function of much of the
DNA sequence. At first the apparent randomness of the genome led
people to believe that much of the sequence is irrelevant junk.
It is now thought that this is not the case and their function
may be uncovered if we can focus research on similar substrings.
However the sheer length of the genome and the number of matches
they contain is so great that it is not easy to make sense of
the data. |
Francis Clark is a PhD student
with Kevin Burrage in the Maths Department at the University of Queensland.
Working closely with biologists they are using graph theory to classify
similarities in DNA subsequences.
A graph is a set of nodes and
edges which connect some of the pairs of nodes. Francis took a DNA
sequence, represented subsequences as nodes and joined similar subsequences,
or nodes, by edges. So the section of DNA sequence given above can
be thought of as an ellipse where the triangle indicates the presence
of three similar subsequences and a line another two similar subsequences.
The graphs constructed for bacterial
genomes (4-6 million nucleotides) are complex with hundreds of thousands
of nodes and edges, and are too big to usefully project onto a two
dimensional page. So Francis refined his model to comprise smaller
disconnected subgraphs. In this way a subgraph may represent subsequences
for a particular known (or unknown) biological function, such as a
site where a particular protein can bind to. Perhaps, in the future,
graph theory can unlock the secrets of those parts of our genome historically
thought of as junk , so the complex workings of DNA can
be understood.
|