Short-cut is found to locate hidden genes in DNA

A needle in a haystack just about sums up the ongoing search for unknown genes secreted along the huge strands of DNA being catalogued…

A needle in a haystack just about sums up the ongoing search for unknown genes secreted along the huge strands of DNA being catalogued each day by labs around the world. A NUI Galway research student has found a short-cut however, using computer-based artificial intelligence to dig out that hidden needle, reports Dick Ahlstrom

The goal is to be able to locate genes dotted along a DNA genome, explains Shaun Mahony, a second-year PhD candidate working in Galway's National Centre for Biomedical Engineering Science.

"These sequences are millions and billions of letters long," he says. "Only 2 per cent of the human genome is part of a gene. It is very difficult to separate them out."

Late last month he won the inaugural Embark/Institution of Engineers in Ireland computer-science research achievement award. Co-sponsored by the IEI and Embark, operated by the Irish Research Council for Science, Engineering and Technology, the award aims to recognise top quality work done by Embark scholarship researchers.

READ MORE

About 90 Embark scholars were eligible for the competition and Mahony was one of 11 sent forward by each recipient third-level institution for a final public presentation of their research projects. Aside from the kudos of the award, Mahony also received a €3,000 bursary to further his research aims.

Mahony graduated two years ago with a degree in electronic engineering, but searching through biomolecules was already familiar territory. He began a bioinformatics project at the end of his BSc and expanded this into his current PhD effort.

"Bioinformatics is the study of DNA sequences using computer programmes," explains Mahony. "It is a very young field. These DNA sequences have only been around for the last three or four years."

The human genome is about three billion "letters" or base-pairs long, and even a lowly bacterium has a genome about two million letters long, he says. The challenge is finding the important working DNA elements, the genes that produce the essential proteins used by the organism to sustain life.

Bioinformatics is the tool being used in this worldwide effort and Mahony is playing his part, writing two computer programmes that help dig out the genes from among all the so-called junk DNA. More importantly, he built-in artificial intelligence, enabling his two programmes RescueNet and Sombrero to learn and improve their searching performance.

He explains how his programmes work by drawing an analogy with ordinary language. A sentence in a book full of gibberish could be equated to a gene sitting in a very long strand of DNA. A sentence can be made up of many different words, but a few, for example "a", "the", "is" and "are", would very often be part of a typical English sentence.

"It is the same in a DNA sequence," he says. "There are DNA words that are used more often in genes but not in non-gene sequences. My software picks out these words that are used more frequently."

RescueNet works on chunks two to five million base pairs long. The software is first "trained" by being given gene sequences that contain the telltale DNA words. It learns as it progresses, getting better and better at spotting where the genes lie in the genome.

"My software is flexible enough to pick out the pattern of these words," says Mahony.

It is about 90 to 92 per cent accurate, close to the current best systems that are 95 per cent accurate.

"My software has particular advantages in some genomes, the hard-case genomes," he adds. The DNA language is made up of four word elements referred to by the letters A, C, G and T. Mahony's hard cases include "high G-C genomes" that have a high proportion of G-C combinations.

The Sombrero package is new and yet to be published in the scientific literature. It is designed to find "gene switches" that initiate gene activity. "These switches or promoter sequences are very small compared to genes," says Mahony. Sombrero searches through DNA blocks about 10,000 base pairs long for promoter sequences only six to 20 base pairs long.

Both packages are based on software artificial intelligence known as "self- organising maps", a form of neural network algorithm, says Mahony. "It is good at clustering data, for finding regions of similarity," he adds. This also enables the software to learn as it goes along.

"By the time it finishes training, it knows the (DNA) words it should search for."