suicune (re305)

Apparently it’s not efficient enough..

Author: david942j

The challenge binary was a program written in the Crystal programming language, which is a compiled language that has syntax very similar to Ruby’s. If familiar with Go or Rust binaries, one would expect the resulting binary to be full of checks and unknown data structures.

Apart from that, an output file is given, containing the following string.

04dd5a70faea88b76e4733d0fa346b086e2c0efd7d2815e3b6ca118ab945719970642b2929b18a71b28d87855796e344d8

program behaviour

 ➤ ~/ctfs/hitcon19/suicune ➤ ./suicune
Usage: ./suicune <flag> <key>

 ➤ ~/ctfs/hitcon19/suicune ➤ ./suicune aaaaa 1
0205d82dba

The program takes 2 arguments, namely the flag and a key, and outputs a hex-encoded string. So it looks like the program performs some encryption onto our flag.

Playing around with the program a little more, it can be deduced that the provided key generates a key stream that will be xored with the flag to create the output.

 ➤ ~/ctfs/hitcon19/suicune ➤ ./suicune aaaaa 1
0205d82dba
 ➤ ~/ctfs/hitcon19/suicune ➤ ./suicune aaaab 1
0205d82db9
 ➤ ~/ctfs/hitcon19/suicune ➤ ./suicune aaaac 1
0205d82db8

 ➤ ~/ctfs/hitcon19/suicune ➤ python
Python 2.7.15rc1 (default, Nov 12 2018, 14:31:15)
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> hex(0xba ^ ord('a'))
'0xdb'
>>> hex(0xb9 ^ ord('b'))
'0xdb'
>>> hex(0xb8 ^ ord('c'))
'0xdb'

 ➤ ~/ctfs/hitcon19/suicune ➤ ./suicune aaaaa 1
0205d82dba
 ➤ ~/ctfs/hitcon19/suicune ➤ ./suicune aaaba 1
0205d82eba
 ➤ ~/ctfs/hitcon19/suicune ➤ ./suicune aaaca 1
0205d82fba

 ➤ ~/ctfs/hitcon19/suicune ➤ python
Python 2.7.15rc1 (default, Nov 12 2018, 14:31:15)
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> hex(0x2d ^ ord('a'))
'0x4c'
>>> hex(0x2e ^ ord('b'))
'0x4c'
>>> hex(0x2f ^ ord('c'))
'0x4c'

And here, it can be seen that a different key will result in an output that is very different.

 ➤ ~/ctfs/hitcon19/suicune ➤ ./suicune aaaa 1
a546f6b0
 ➤ ~/ctfs/hitcon19/suicune ➤ ./suicune aaaa 2
d96895c0

However, here it is found that the length of the flag is also a factor in generating the key stream.

 ➤ ~/ctfs/hitcon19/suicune ➤ ./suicune aaaa 1
a546f6b0
 ➤ ~/ctfs/hitcon19/suicune ➤ ./suicune aaaaa 1
0205d82dba

Now it is rather clear that our task is to find the key that was used to encrypt the flag to give the output string (given in a file as mentioned earlier) provided as part of the challenge. Since the length of the output string is 49 (after hex decoding), and we know that the flag starts with "hitcon", we just need to find a key that encrypts "hitcon..." (49 chars in total) to 04dd5a70faea...".

Sounds quite straightforward, I can just brute-force the key until I find the one used to encrypt the flag. However, as mentioned in the challenge description, it is not efficient enough and the program seems to take forever to run when the length of the flag is 49.

reverse engineering

Since the program is not efficient enough, I need to reverse the program to know the algorithm used to encrypt the flag. Then, I can code a more efficient version of it, and can brute-force the key to get the flag.

strings

Before starting to reverse, I ran strings on the binary and found some interesting words

 ➤ ~/ctfs/hitcon19/suicune ➤ strings ./suicune
...
Crystal::System::Random::urandom
Crystal::Signal::child_handler
Crystal::Hasher::seed
...

Doing a Google search, I found a reference to the Crystal programming language. Not particularly useful but helpful to have a better understanding of what I am working with.

decompilation

Opening the binary in Ghidra, I found a __crystal_main function which I believe to contain the main code responsible for the program. And this function has more than 1500 lines of pseudocode…

But for compiled languages like this (Rust, Go, etc), most of the code is normally just checking for exceptions, thread management, garbage collection, basically things that are irrelevant to the program. So the job here is to identify parts of code that matter.

At this point, my idea of the program looks like this

void main() {
    input_flag = argv[0];
    input_key = int(argv[1]);

    ...
}

i.e. no idea what it is doing.

Before attempting to reverse the program, I first identified the variables that contain my inputs and renamed them, in order to easily recognize while reversing later.

printing output

First thing I did was to identify where the output was printed, because any code after that would definitely be useless. Because the binary was not stripped, I was able to easily find the line that was responsible for printing the output.

Now the program looks like this

void main() {
    input_flag = argv[0];
    input_key = int(argv[1]);
    
    ...

    print(output);
}

hex encoding

Right above the line of printing the output, there is a loop that contains around 1000 lines. I placed breakpoints at various points around and within that loop, while monitoring the memory.

Apparently before the loop, the raw form of the output (before hex-encoding it) is already present in memory. Turns out this loop is responsible for hex-encoding the output.

void main() {
    input_flag = argv[0];
    input_key = int(argv[1]);
    
    ...

    output = hex_encode(encrypted);
    print(output);
}

After realizing this, I am down to less than 500 lines of decompilation output.

encryption routine

Above the hex-encoding code is another loop, which runs for 16 times.

void main() {
    input_flag = argv[0];
    input_key = int(argv[1]);
    
    ...

    for (int i = 0; i < 16; ++i) {
        ...
    }

    output = hex_encode(encrypted);
    print(encrypted);
}

At the bottom of the loop, there was a loop that performs byte-by-byte xor of two strings, then another loop that reverses the result and stores it in another string.

Once again, setting a breakpoint there and monitor the program memory, I found out that my input flag was part of the xor routine, and after being xored, the string was reversed.

void main() {
    input_flag = argv[0];
    input_key = int(argv[1]);
    
    ...

    for (int i = 0; i < 16; ++i) {
        ...

        input_flag = xor(input_flag, key_stream);
        input_flag = reverse(input_flag);
    }

    output = hex_encode(encrypted);
    print(encrypted);
}

key stream generation

Continuing to move upwards in the code, again setting breakpoints at different places, I found a loop that was the “inefficient part”, as the program freezes once it reaches that part. Expecting it to be complicated, I decided to skip this part first.

void main() {
    input_flag = argv[0];
    input_key = int(argv[1]);
    
    ...

    for (int i = 0; i < 16; ++i) {
        ...

        while(...) {

        }

        input_flag = xor(input_flag, key_stream);
        input_flag = reverse(input_flag);
    }

    output = hex_encode(encrypted);
    print(encrypted);
}

Moving to the part before it, it is actually quite straightforward. Something like a random number generator is seeded based on the input key, then used to generate the key stream. After generating the key stream, it is truncated to match the length of the input flag.

void seed_rng(key) {
    rng = rng * 0x5851f42d4c957f2d + 0x5851f42d4c957f2d + 1;
}

void rng_next() {
    tmp = rng;
    rng = rng * 0x5851f42d4c957f2d + 1;
    c1 = tmp >> 0x3b;
    u1 = (tmp >> 0x12 ^ tmp) >> 0x1b;
    return u1 >> c1 | u1 << (0x20 - c1);
}

void main() {
    input_flag = argv[0];
    key = int(argv[1]);

    seed_rng(key);

    for (int i = 0; i < 16; ++i) {
        key_stream = [0, 1, 2, ..., 0xff];
        for (int i = 255; i > 0; --i) {
            index = rng_next();
            swap(key_stream[i], key_stream[index % (i + 1)]);
        }
        key_stream = keystream[0:len(input_flag)];

        // not sure what this does
        something = rng_next() + rng_next() << 32;

        while(...) {

        }

        input_flag = xor(input_flag, key_stream);
        input_flag = reverse(input_flag);
    }

    output = hex_encode(encrypted);
    print(encrypted);
}

There was another random number generated after generating the key stream but I was not sure what was it used for.

the final part

Now the only thing left is to know what the inefficient part of code is doing.

I was not really keen on reversing the code as there were a lot of things going on. So I decided to try to deduce the algorithm by looking at the changes to the key stream. After collecting enough data, I realized that it was sorting the key stream in descending order.

void seed_rng(key) {
    rng = rng * 0x5851f42d4c957f2d + 0x5851f42d4c957f2d + 1;
}

void rng_next() {
    tmp = rng;
    rng = rng * 0x5851f42d4c957f2d + 1;
    c1 = tmp >> 0x3b;
    u1 = (tmp >> 0x12 ^ tmp) >> 0x1b;
    return u1 >> c1 | u1 << (0x20 - c1);
}

void main() {
    input_flag = argv[0];
    key = int(argv[1]);

    seed_rng(key);

    for (int i = 0; i < 16; ++i) {
        key_stream = [0, 1, 2, ..., 0xff];
        for (int i = 255; i > 0; --i) {
            index = rng_next();
            swap(key_stream[i], key_stream[index % (i + 1)]);
        }
        key_stream = keystream[0:len(input_flag)];

        sort_descending(key_stream);

        input_flag = xor(input_flag, key_stream);
        input_flag = reverse(input_flag);
    }
    output = hex_encode(encrypted);
    print(encrypted);
}

Looks like I’m done. I quickly wrote a C++ program that mimics the code above and brute force the key to find the flag.

nope

However, after running the program for quite long, there was still no sign of the flag. Something must have went wrong. I decided to properly look into the part earlier, to see what actually happens to the key stream.

Setting breakpoints at various part of the code that involves strings, I found out that the key stream is changed to its next lexicographical permutation in every iteration of the loop. For the example of bacd,

bacd -> badc -> bcad -> bcda -> ...

Recall the random number mentioned earlier, which I did not know the purpose of. That was the number of times to perform the permutation. Either that or the program will stop changing the key stream if it reaches its maximum, i.e. dbca for the example above, which was why I mistaken it to be sorting the key stream in descending order.

void seed_rng(key) {
    rng = rng * 0x5851f42d4c957f2d + 0x5851f42d4c957f2d + 1;
}

void rng_next() {
    tmp = rng;
    rng = rng * 0x5851f42d4c957f2d + 1;
    c1 = tmp >> 0x3b;
    u1 = (tmp >> 0x12 ^ tmp) >> 0x1b;
    return u1 >> c1 | u1 << (0x20 - c1);
}

void main() {
    input_flag = argv[0];
    key = int(argv[1]);

    seed_rng(key);

    for (int i = 0; i < 16; ++i) {
        key_stream = [0, 1, 2, ..., 0xff];
        for (int i = 255; i > 0; --i) {
            index = rng_next();
            swap(key_stream[i], key_stream[index % (i + 1)]);
        }
        key_stream = keystream[0:len(input_flag)];

        something = rng_next() + rng_next() << 32;
        key_stream = nth_permutation(key_stream, something);

        input_flag = xor(input_flag, key_stream);
        input_flag = reverse(input_flag);
    }
    output = hex_encode(encrypted);
    print(encrypted);
}

solving

Because the random number generated to determine the number of permutations is a 64-bit number, this explains why it takes forever to run the program when the input flag is long.

There is a good explanation on how to do the permutations under linear time complexity with respect to the length of the input flag. Some minor changes need to be made to the algorithm suggested in the article because it was assumed that the string starts in sorted order, which is not the case in this challenge.

Apart from that there is nothing interesting to be discussed regarding the solution. I just needed to reimplement the encryption routine and brute force the key to find the flag.

There’s a solution by the author written in Crystal, and mine attached below.

#include <algorithm>
#include <vector>

typedef unsigned long long ull;
typedef unsigned int uint;
typedef unsigned char uchar;

ull mult = 0x5851f42d4c957f2d;
ull rng;

char* START = "hitc";
char target[] = "\x04\xdd\x5a\x70\xfa\xea\x88\xb7\x6e\x47\x33\xd0\xfa\x34\x6b\x08\x6e\x2c\x0e\xfd\x7d\x28\x15\xe3\xb6\xca\x11\x8a\xb9\x45\x71\x99\x70\x64\x2b\x29\x29\xb1\x8a\x71\xb2\x8d\x87\x85\x57\x96\xe3\x44\xd8";
int SIZE = strlen(target);

// things to generate xor_key
std::vector<uchar> xor_key;
std::vector<uint> fact(SIZE);
bool is_max = false;

uint rng_next() {
    ull tmp = rng;
    rng = rng * mult + 1;
    uchar c1 = tmp >> 0x3b;
    uint u1 = (tmp >> 0x12 ^ tmp) >> 0x1b;
    return u1 >> c1 | u1 << (0x20 - c1);
}

void gen_fact(ull n, ull i) {
    if (i == SIZE) {
        if (n > 0)
            is_max = true;
        return;
    };
    fact[SIZE - i - 1] = n % (i + 1);
    gen_fact(n / (i + 1), i + 1);
}

void nth_perm(ull n) {
    gen_fact(n, 0);

    std::vector<uchar> key_copy1(xor_key.begin(), xor_key.begin() + SIZE);
    std::sort(key_copy1.begin(), key_copy1.end());
    std::vector<uchar> key_copy2(key_copy1);
    std::vector<uint> offset;

    for (int i = 0; i < SIZE; ++i) {
        uint index = std::distance(key_copy1.begin(), std::find(key_copy1.begin(), key_copy1.end(), xor_key[i]));
        offset.push_back(index);
        key_copy1.erase(key_copy1.begin() + index);
    }

    for (int i = 0; i < SIZE; ++i) {
        uint idx = SIZE - i - 1;
        fact[idx] += offset[idx];
        
        if (idx > 0) {
            fact[idx - 1] += fact[idx] / (i + 1);
            fact[idx] %= (i + 1);
        }
        else {
            if (fact[idx] >= (i + 1)) {
                is_max = true;
            }
        }
    }

    if (is_max) {
        std::sort(xor_key.begin(), xor_key.end(), std::greater<uchar>());
        is_max = false;
    }
    else {
        for (int i = 0; i < SIZE; ++i) {
            xor_key[i] = key_copy2[fact[i]];
            key_copy2.erase(key_copy2.begin() + fact[i]);
        }
    }
}

int main() {
    for (ull key_val = 0; key_val < 65536; ++key_val) {
        rng = key_val * mult + (mult + 1);
        std::vector<uchar> flag(target, target + SIZE);

        for (int ii = 0; ii < 16; ++ii) {
            std::vector<uchar> keygen(256);
            for (uint i = 0; i < 256; ++i) {
                keygen[i] = i;
            }

            for (int i = 255; i > 0; --i) {
                uint index = rng_next();
                std::swap(keygen[i], keygen[index % (i + 1)]);
            }

            xor_key = std::vector<uchar>(keygen.begin(), keygen.begin() + SIZE);
            ull n = (ull) rng_next() + ((ull) rng_next() << 32);

            if (n != 0) {
                nth_perm(n);
            }

            for (int i = 0; i < SIZE; ++i) {
                flag[i] ^= xor_key[i];
            }
            
            std::reverse(flag.begin(), flag.end());
        }

        if (memcmp(flag.data(), START, strlen(START)) == 0) {
            puts("---------------");
            printf("---  %d  ---\n", key_val);
            puts("---------------");
            puts((char *) flag.data());
            break;
        }
    }

}

random notes

One annoying thing about this binary is many variables are reused for different scenarios, making my variable names useless at different parts of the code 😩.

gef commands that I seldom use, trace-run and memory are quite helpful in understanding the binary better.

gdb script to brute force

At first I didn’t want to reimplement the whole program in C++ so decided to write a GDB script to brute force it. Because the only slow part is the nth-permutation part, I could dump the original key stream from memory and rearrange it myself. Unfortunately it was too slow…

# gdb -q -ex "source dump_keys.py" -ex quit ./suicune

def enable_redirect_output():
    """Redirect all GDB output to `to_file` parameter. By default, `to_file` redirects to `/dev/null`."""
    gdb.execute("set logging overwrite")
    gdb.execute("set logging file {:s}".format("dumped.txt"))
    gdb.execute("set logging redirect on")
    gdb.execute("set logging on")
    return

def disable_redirect_output():
    """Disable the output redirection, if any."""
    gdb.execute("set logging off")
    gdb.execute("set logging redirect off")
    return

gdb.execute('gef config context.enable 0')
gdb.execute('pie break *0xcb60')

target = [0x04, 0xdd, 0x5a, 0x70, 0xfa, 0xea]
# target = [0x0e, 0x0c]

for key in range(1, 2):
    flag = b"dani"
    size = 4
    gdb.execute('pie run {:s} {:d}'.format(flag.decode(), key))

    s = flag.ljust(size)

    print(key)

    for _ in range(16):
        addr = int(gdb.execute('p *(void**)($r14+0x10)', to_string=True).split()[-1], 16)
        keys = list(read_memory(addr, size))
        keys = sorted(keys)[::-1]

        print([hex(k) for k in keys])
        
        s = [s[i] ^ keys[i] for i in range(len(s))][::-1]

        print([hex(c) for c in s])
        gdb.execute('c')

    print(key)
    print([hex(c) for c in s])

    if target == s[:len(flag)]:
        print ("!!!!!!!!!!!!")
        print ("!!!!!!!!!!!!")
        print ("!!!!!!!!!!!!")
        print ("!!!!!!!!!!!!")
        print ("FOUND KEY")
        print(key)
        print ("!!!!!!!!!!!!")
        print ("!!!!!!!!!!!!")
        print ("!!!!!!!!!!!!")
        print ("!!!!!!!!!!!!")
        break

gdb commands to help while reversing

At the start of reversing, as mentioned earlier, I wanted to identify the parts of the code that matter. One way I tried was to automate stepping through the code until certain bytes such as the output is in memory. Honestly this wasn’t really useful in the end…

Another thing I tried to do was automate stepping through the instructions and print out the addresses to see which loop was taking forever to complete. This was kind of helpful.

def clear_printed():
    global printed
    printed = ""

def new_print(msg):
    global printed
    printed += msg + "\n"
    # print(msg)

def get_printed():
    global printed
    return printed

class StepTilFind(GenericCommand):
    """Step until find a pattern"""

    _cmdline_ = "stf"
    _syntax_  = "{:s} PATTERN".format(_cmdline_)

    @only_if_gdb_running
    def do_invoke(self, argv):
        argc = len(argv)
        if argc != 1:
            info(str(argc))
            self.usage()
            return

        pattern = argv[0]
        old_print = globals()['gef_print']
        globals()['gef_print'] = new_print

        while gef_current_instruction(current_arch.pc).mnemonic != "ret":
            clear_printed()
            gdb.execute("grep {:s}".format(pattern))
            if "In" in get_printed():
                break
            gdb.execute("ni")

        globals()['gef_print'] = old_print

        return

class TraceTilRet(GenericCommand):
    """Trace until return"""

    _cmdline_ = "ttr"
    _syntax_  = "{:s}".format(_cmdline_)

    @only_if_gdb_running
    def do_invoke(self, argv):
        hide_context()

        out_file = open("trace.txt", "w")
        base = int(gdb.execute("p $_pie()", to_string=True).split()[2], 16)

        while gef_current_instruction(current_arch.pc).mnemonic != "ret":
            gdb.execute("ni")

            remark = ""
            if gef_current_instruction(current_arch.pc).mnemonic[0] == "j":
                remark = "branch"

            out_file.write("{:x} ({:s})\n".format(current_arch.pc - base, remark))

        out_file.close()

        unhide_context()
        return

if __name__ == "__main__":
    register_external_command(StepTilFind())
    register_external_command(TraceTilRet())

I guess these scripts I made helped me better understand the binary in some way, otherwise they were a waste of time.