How to grep GenBank files for structural genes

Posted on August 1, 2013

Today, I was trying to find all DNA constructs in the lab that confer kanamycin resistance. Of course, the first tool I thought of was grep! Unfortunately, genbank stores sequence information like this:

ORIGIN
        1 TAGTTATTAA TAGTAATCAA TTACGGGGTC ATTAGTTCAT AGCCCATATA TGGAGTTCCG
       61 CGTTACATAA CTTACGGTAA ATGGCCCGCC TGGCTGACCG CCCAACGACC CCCGCCCATT
      121 GACGTCAATA ATGACGTATG TTCCCATAGT AACGCCAATA GGGACTTTCC ATTGACGTCA
      181 ATGGGTGGAG TATTTACGGT AAACTGCCCA CTTGGCAGTA CATCAAGTGT ATCATATGCC
      241 AAGTACGCCC CCTATTGACG TCAATGACGG TAAATGGCCC GCCTGGCATT ATGCCCAGTA

Clearly, the newlines, spaces and numbering make it so a simple grep ATGATTGAACAAGAT *.gb won’t work. After a little self-instruction in basic unix tools, I hacked together this alternative solution that pulls out the sequence from the genbank file before piping it into grep.

find . -name .ape -print0 | xargs -t -0 -I {} sh -c ’tr -d “[0-9]” < $1 | sed s/.ORIGIN// | grep -i $2’ – {} ATGATTGAACAAGAT

This nifty command recursively searches through subdirectories for filenames ending in .ape. Then, the translate utility deletes all newlines, spaces and digits of those files. Sed deletes everything before ORIGIN, and finally, grep searches for the DNA string we are interested in (ATGATTGAACAAGAT are the first 15 bp of the kanamycin resistance gene).