Perl Weekly Challenge 90, Part 1


Part 1 of this weeks challenge reads:

DNA is a long, chainlike molecule which has two strands twisted into a double helix. The two strands are made up of simpler molecules called nucleotides. Each nucleotide is composed of one of the four nitrogen-containing nucleobases cytosine (C), guanine (G), adenine (A) and thymine (T).


Write a script to print nucleiobase count in the given DNA sequence. Also print the complementary sequence where Thymine (T) on one strand is always facing an adenine (A) and vice versa; guanine (G) is always facing a cytosine (C) and vice versa.

To get the complementary sequence use the following mapping:

T => A; A => T; G => C; C => G


First, we won’t restrict ourselves to printing the complement of a single DNA sequence — that would be too easy, as it would turn the exercise into a gloried Hello, World! program. Instead, we will be reading the sequence from STDIN.

We will solve this using different languages. Most languages have a build in method to calculate the length of a string, but there are differences in how to calculate the complementary sequence:

  • Languages which have a translate (or, in Perl speak, transliterate) operation. (bash, Perl, Python, and Ruby)
  • Languages which can replace all occurrences of a pattern (aka, Perl’s s///g operation. (AWK, Node.js, and SQL)
  • Languages in which you have to replace the character one by one, either in situ (C), or when the character is read (Befunge-93).




while read DNA
do echo "$DNA\c" | wc -c | sed -e 's/ *//'
   echo  $DNA    | tr ATCG TAGC

We read a sequence from the input, then use wc -c to get the number of characters in the sequence. This result is fetched into sed to remove the leading white space wc typically leaves behind.

To get the complementary sequence, we pipe the sequence into tr which replaces each of the characters.


print y/TAGC/ATCG/, "\n", $_ while <>;

Here, we take the sequence from the input, and apply the y operator (y is just an alternative name for tr). We make use of the fact that y returns the number of characters it replaces — which is exactly the required length. We then print $_, which now contains the complementary sequence.


import fileinput
import string

for line in fileinput . input ():
    line = line . rstrip ()
    print len (line)
    print line . translate (string . maketrans ("ATCG", "TAGC"))

We iterate over the input, and remove the trailing newline. Then it’s a matter of printing the length (len), and the result of the translation.


while line = gets do
     line = line . rstrip
     puts line . length
     puts line . tr('ATCG', 'TAGC')

Very similar to the Python solution, but the tr operation is much less verbose.

Global substitutions

Here, we replace groups of identical letters one by one. But we cannot simply replace each A by a T, followed by replacing each T by an A, as that will leave use with just As, and no Ts. Instead, we replace each A by x, then each T by an A, and finally each x by a T. And we do something similar for the swap of C and G.


    print length;
    gsub ("A", "x");
    gsub ("T", "A");
    gsub ("x", "T");
    gsub ("G", "y");
    gsub ("C", "G");
    gsub ("y", "C");

Pretty straight forward. Print the length of the sequence, then do the character replacements as explained above, and print the modified sequence.


  require      ("fs")
. readFileSync (0)               // Read all.
. toString     ()                // Turn it into a string.
. split        ("\n")            // Split on newlines.
. filter       (_ => _ . length) // Filter out empty lines.
. map          (_ => {
    process . stdout . write (_ . length + "\n" +
                              _ . replace (/A/g, 'x')
                                . replace (/T/g, 'A')
                                . replace (/x/g, 'T')
                                . replace (/G/g, 'y')
                                . replace (/C/g, 'G')
                                . replace (/y/g, 'C') + "\n")

We gobble up the entire input, then split on newlines, and throw away any trailing empty lines. Then, for each line of the input, we print the length, the line after we’ve done all the replacements.


Here we are assuming the sequences we need to process are in a table DNA.

SELECT  LENGTH (sequence) || char (10) ||
        REPLACE (
            REPLACE (
                REPLACE (
                    REPLACE (
                        REPLACE (
                            REPLACE (sequence, 'A', 'x'),
                        'T', 'A'),
                    'x', 'T'),
                'G', 'y'),
            'C', 'G'),
        'y', 'C')

Here the replacements are nested, and not chained.

One by one


# include <stdlib.h>
# include <stdio.h>
# include <string.h>

int main (void) {
    char *  line    = NULL;
    size_t  len     = 0;

    while (getline (&line, &len, stdin) != -1) {
        char * line_ptr = line;

        while (*line_ptr) {
            switch (*line_ptr) {
                /* Replace the characters */
                case 'A': *line_ptr = 'T'; break;
                case 'T': *line_ptr = 'A'; break;
                case 'C': *line_ptr = 'G'; break;
                case 'G': *line_ptr = 'C'; break;
                /* We don't want to count the newline */
                default:  *line_ptr =  0;  break;
            line_ptr ++;
        printf ("%lu\n%s\n", strlen (line), line);
    free (line);

    return (0);

We read lines of input, and for each line, we process the line character by character. Characters A, T, C, and G are replaced by T, A, G, and C. We set the trailing newline to the NUL character, so strlen can do its work properly.


          "        "        "        "
 +        T        A        C        G
 1        "        "        "        "
 ^  $ ,   <        <        <        <

Here we have a loop starting from the in the top left corner. We read a character from the input ~, and then check whether the character equals A: :"A"-!. The : duplicates the top of the stack (that is, the character read by ~), "A" pushes the character A on the stack, and - subtracts the two values, leaving a true value on the stack if the values are not equal, and false if they are. ! inverts this.

We then make a decision: #v_. The # causes us to skip the v command. _ sends the program counter to the right if the top of the stack is false, and to the left otherwise. If we go to the left, the v sends us downward. Which means that if we did not read an A, we then do the same thing to see whether we read a T (and then check for a G, or a G). If we did read an A, we follow the path downward: a T is pushed on the stack ("T") then we go left (), print the top of the stack (the T we just pushed), and pop a value from the stack ($, the value we pop is the one we read with the ~ command). Up we go (^) and then we increment a counter (1+) — a counter which was initialized as 0 by the very first command. We’re now back to the start of the loop.

Handling a T, C, or G goes in a similar way.

If we have not read one of the four characters, we must have read a newline. We then reach the end game: ,.55+,@. We start with printing the top of the stack (,), which is the newline read. Then we print the top of the stack (as a number) with the . command. This number is the count of characters from the sequence. 55+ puts two fives on the stack, and replaces them with their sum. The following , prints this value (as a character), which means a final newline is printed. @ terminates the program.


Find the complete programs on GitHub.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s