Root-me 1 is a nice place to learn different aspect of cyber-security.
Whithin the information security field, there is a steganography, which is the art of hiding information without being noticed. One principle in cryptography, the Kerckhoff’s principle says that the secret when hiding a message must not be the algorithm, it must be the key. Steganography often plays against this principle. The goal is to hide a message without being discovered into genuine looking documents, like images, text, pdf, video and many others. Even if you notice there is information, it is often obfuscated, making the reading difficult.
To solve the Root-me challenges, sometimes you need to code, sometimes you need to use the good tool to get the information. In most case, you can recode the tools yourself. It takes some time, but its feasible. This makes you learn about the different encoding format (base64, ASCII, binary, …) and learn a few algorithms.
However, sometimes you can’t, even with the best imagination possible. There is one steganography challenge where I was stuck, tried many things, where none worked. I asked for help, and finally I found THE TOOL on Github which solve the challenge in a second.
This tool is called plainsight
2.
Because I spent a lot of time trying solutions, I wanted to know why I failed to find the solution, and started looking at the code.
Maybe the code was “efficient”, but it was barely readable:
if ... elif ... elif ...
After some time, I finally understood what it was doing. Because of the difficulty to understand how it worked, I decided to recode it.
It was a python2
script.
The first thing I did is to convert it to python3
, by changing the syntax of some updated operators like xrange
.
Despite the few changes, scripts results were not compatible.
I found the issue, and this one cannot be patched (I got troubles to get the information necessary to fill the gap).
Therefore, in this post, I will:
plainsight
worksYou can jump to the section of interest directly:
Plainsight combines two concepts together:
We will detail these two concepts in the following subsections.
Huffman coding (encoding a message using Huffman trees) is a method to compress a signal to transmit. Sending a text message letter by letter, bytes by bytes is inefficient. Some letters occur more frequently than other. Huffman coding exploits the distribution asymetry to reduce the amount of bits exchanged.
Computers speak in bits. They only use 0
and 1
to do, store and make things.
To encode our 26 alphabet, one solution is to encode it over 5 bits.
With these 5 bits, we can encode at most \(2^5=32\) symbols, which is sufficient for our 26 letters.
For instance, we would code A
as 00000
, B
as 00001
, and Z
as 11001
.
We would have 6 unused codes like 11011
or 11111
, which is one source of inefficiency.
In this previous example, we used letters because we can recreate all possible words using them.
However, if we have a limited corpus of words, we can decide to use bits to encode words directly.
If you play Shifumi or rock
, paper
, scissors
, you can decide to encode paper
as 00
, rock
as 01
, and scissors
as 10
.
This will reduce drastically the size of the transmitted signal, as a (4/5/8)-letter word (which would require \(8 \times (4/5/8)\) bits in ASCII) is encoded using only two bits.
Finding the symbol set is the first step to compress a signal.
The compression efficiency can be improved by playing on code length (i.e., the number of bits allocated for one symbol). The main idea is to allocate short codes to frequent items, while longer ones to infrequent symbols. As all codes would not have the same length, it means that we can only decode sequentially the message. We cannot use multiple thread, one that would process the first half, and another the second half because maybe the signal bit stream will be cut in the middle of a symbol.
If we have multiple codes of different length, we would like the reading to be non-ambiguous.
This is where the magic of trees occurs.
They allows to represent code and to verify that one code is not the suffix of another.
For instance, if the code 01
is allocated, then we have guarantees that 011
and 010
do not exist.
To find the codes, we need to build a Huffman tree.
Suppose there is a list of symbols \(S = \{s\}\) with respective frequencies \(\{f_s\}\). At the start, we create \(|S|\) unlinked nodes, one for each symbol, with weight \(w_s = f_s\)
To construct the tree, we loop over these instructions’ set until there is one node left:
n_a
and n_b
, two nodes with the lowest weight (w_a
< w_b
)n_c
with weight w_a + w_b
, with left child n_a
and right child n_b
.n_a
and n_b
from the working set of available and add n_c
to the set.Now, the tree is built, where all the symbols we want to encode are the leaves of the tree.
We need to construct the tree to get the codes. When the tree is built, you create a code by moving from the root to a target leaf:
0
to the current code1
to the current codeIt does not matter if 0
is left or right, as long as it stays consistent for a given node.
When you have found the codes, you need to transmit them to the listening party (otherwise, how could he knows that 01
is paper
?).
Then, you can exchange messages efficiently.
Here is a simple python
implementation.
Here, the inputs are the frequencies (just a list of numbers representing words’ count or frequency).
The result is a list of code, where code[i]
corresponds to the item i
with frequency freqs[i]
.
Let’s look at English letters frequencies3.
The first most frequent letters are listed in this table:
Letter | Freq |
---|---|
E | 11.2% |
A | 8.5 % |
R | 7.6 % |
I | 7.5 % |
O | 7.1 % |
T | 7.0 % |
N | 6.7 % |
The corresponding Huffman tree is:
You can see that only E
get the shortest code.
This is because the differences between frequencies are not very large.
Nonetheless, compare to a fixed 3 bits encoding, we save 1 bit 11%
of the time.
The obtained code depends on the alphabet size and the frequencies. If we look at the corresponding trees for the top-6 and top-5, we get:
and
In all these cases, E
always get a short code of length 2
.
Depending on the case, it gets code 01
, 10
and 11
.
Therefore, it is important to keep in memory and share the code used.
If you want to know more about Huffman Trees, you can have a look at:
If you perform letter analysis, you can without looking at any words find in which language it has been written.
However, this is one-way: you cannot generate an english text by just picking at random letter knowing the frequencies.
Same if you are using words: Cat reform bottle need
is made of english words but is non-sense.
There is a need for coherency.
To improve text likelihood, one idea is to look at letters or words over a fixed window.
If you take a window of size \(2\), you will learn which letters can follow each other.
You will learn that a-a
never occurs while w-h
is very frequent.
The generated words will look like words with many mistakes in.
If you increase the window, they will look even better, and at some point you will get real words.
Same for words, where you will move from a nonsense sentence to something realistic.
As for letter frequencies, the ideas is to look at tuples frequencies. A \(n\)-gram is a sequence of \(n\) characters or \(n\) words.
For instance, The cat is grey
is equivalent to the list [The, cat, is, grey]
.
There are three possible bi-grams:
(The, cat)
(cat, is)
(is, grey)
and two possible tri-grams:
(The, cat, is)
(cat, is, grey)
and one 4-grams.
To generate a text that looks natural, it must have properties similar to regular text.
The first quantity to look at is word frequency.
To generate a text, we cannot pick at random words from a dictionary. Words tends to be distributed according to a power law or Zipf law. In other words, a few words are highly frequent while most of the rest almost never occur. If you pick words at random in a dictionary, it is very likely that you almost never use those words.
Next, we have grammar. There are rules that governs how words are located in a sentence. Frequencies are not sufficient. Coding grammar rules is possible, but time consuming. To avoid this coding, we can instead look at \(n\)-grams. If we have a large corpus, we know what are the valid words following a sequence.
If \(P(w_k | w_0, w_1, ..., w_{k-1}) > 0\), then the word \(w_k\) is a candidate. If \(k\) is large, we often have enough context to select a word that is valid according to grammar rules.
Suppose our corpus is made of:
The cat is grey
The cat is sleeping
The dog is black
(The, dog) + (dog, is) + (is, grey)
gives The dog is grey
, which is a valid sentence in our language.
However, is our corpus is:
The fire is hot
The ice is blue
The ice is hot
and The fire is blue
are valid in terms of grammar, but wrong in terms of truth.
Using large n-grams plus a large corpus prevents from grammar errors and aberrant sentences.
To construct a text with \(n\)-grams, you should exploit the \(n-1\) previous words to find the \(n^{iest}\) word.
This is in this second case where Huffman trees are used.
Given a context w_1, w_2, ..., w_(n-1)
, you have k
options w_(o1), w_(o2), ..., w_(ok)
, each with a particular frequency.
Using an Huffman tree for this particular context, you can associate to each of these k
words a weight w_ox
and find a binary code.
w_n
Taking back our cats and dog example, we get the following tree for bi-grams.
We added symbols TOP
and BOTTOM
to mark the start or end of a sentence.
This allows resetting the context memory when arriving at the end of a sentence.
As an example, we want to encode 011010
.
At the start, we know that we must start a sentence.
We follow the path Root -> TOP
.
Here, there is a single choice because all sentences starts with The
.
So no bit can be encoded.
Because we use bi-gram, we have one word of memory, i.e. C=[The]
.
We follow the path Root -> The
.
Here, we have two options: cat
with code 0
and dog
with code 1
(because cat
is more frequent than dog
).
Because our first bit is 0
, we select cat
.
Now, the current message is M=[The, cat]
, the context C=[cat]
, and the leftover bits 11010
.
We follow the path Root -> cat
. Same than with The
, one single choice.
We follow the path Root -> is
. Here, three choices.
Because they all have the same frequency, we select the lexicographic order
(we have code 00
for grey
, 01
for black
, and 1
for sleeping
).
We select sleeping
,
so now, the current message is M=[The, cat, is, sleeping]
, the context C=[sleeping]
, and the leftover bits 1010
.
Now, single choice, ending the sentence.
We start back with TOP
.
Repeating the process, we have:
The cat is sleeping. The dog
+ 010
The cat is sleeping. The dog is black
+ 0
The cat is sleeping. The dog is black. The cat
+ ``Decoding is as easy as encoding. If I just summarize context, bits recovered and leftover, we have:
C=[TOP]
, B=""
, M="The cat is sleeping BOT TOP The dog is black BOT TOP The cat"
C=[The]
, B=""
, M="cat is sleeping BOT TOP The dog is black BOT TOP The cat"
C=[cat]
, B="0"
, M="is sleeping BOT TOP The dog is black BOT TOP The cat"
C=[is]
, B="0"
, M="sleeping BOT TOP The dog is black BOT TOP The cat"
C=[sleeping]
, B="01"
, M=" BOT TOP The dog is black BOT TOP The cat"
C=[BOT]
, B="01"
, M="TOP The dog is black BOT TOP The cat"
C=[TOP]
, B="01"
, M="The dog is black BOT TOP The cat"
C=[The]
, B="01"
, M="dog is black BOT TOP The cat"
C=[dog]
, B="011"
, M="is black BOT TOP The cat"
C=[is]
, B="011"
, M=" black BOT TOP The cat"
C=[black]
, B="01101"
, M="BOT TOP The cat"
C=[BOT]
, B="01101"
, M="TOP The cat"
C=[TOP]
, B="01101"
, M="The cat"
C=[The]
, B="01101"
, M="cat"
C=[cat]
, B="011010"
, M=""
And we recovered our initial message.
A previous implementation of plainsight can be found on Github.
At the start, we just wanted to modify some operators so the python2
script could be used with python3
.
Changing the xrange
into range
was quite easy.
However, we got a problem:
In both implementation, the \(n\)-grams trees are coded using dictionaries.
However, python2
and python3
dictionaries are not implemented the same way.
Especially concerning the key sorting algorithm:
python2
, dictionary keys are sorted according to a specific hash function.python3
, dictionary keys are sorted according to the arrival. I.e., if dog
is added before cat
, then dog
is the first key, and cat
the second.Look at these two examples:
With python2
:
dic = {"dog": 1, "cat": 3, "frog":5}
print(dic)
> {'dog': 1, 'frog': 5, 'cat': 3}
With python3
:
dic = {"dog": 1, "cat": 3, "frog":5}
print(dic)
> {"dog": 1, "cat": 3, "frog":5}
This creates the incompatibility.
For compatibility, our implementation sorts items deterministically:
plainsight I
doesn’t use Huffman trees.
Instead, when a choice can be made to finish a \(n\)-grams, it encodes all solution with a fixed length code (depending of the \(n-1\) previous words).
If \(W\) is the set of possible words following the current \(n-1\)-gram, with \(|W|=s > 1\)
then it exist a \(k_s\) such as \(2^{k_s} \leq s < 2^{k_s+1}\).
The \(2^{k_s}\) most frequent words would get an associated binary code.
As an example, if we have \(7\) candidates, \(k=2\) therefore only \(4\) of these \(7\) candidates can be used to encode bits.
The most frequent one would get the code 00
, the next 01
, the third 10
and the last 11
.
Because some words are not selected for coding, the richness of the generated text will be smaller. This is not a big problem as the code will always run and always be able to encode a message.
If only \(2^{k_s}\) over the \(s\) items are coding information, it doesn’t mean that the \(s-2^{k_s}\) last items are discarded. The intermediate nodes of the tree (not root, not leaf) are possible paths.
Look at the following figure, which represents the 3-grams of a synthetic corpus:
This is the state when we have added all the \(n\)-grams to the tree. The nodes are sorted from left to right by decreasing frequency.
If we look at node G
, it can be access using the path Root -> C -> B -> G
.
The only possible path after removing C
from the context is Root -> B -> G -> x
If we prune the tree, we get:
where the path Root -> B -> G
does not exist anymore, requiring the script to abort the current sentence.
Therefore, you would get broken sentences with the first version.
In our implementation, this does not occur anymore as all nodes are taken into account.
If you select a small tree depth (2
or 3
), the text wouldn’t look very natural.
In the other hand, if the depth is too large, the n-grams possibilities are less numerous.
As a consequence:
This + is
=> a
)which leads in the end to very large messages. For instance, if you look at the tree of depth \(3\):
you can see that there are 2 final forks: (TOP, The) -> (cat|dog)
and (cat, is) -> (grey|sleeping)
To consume the bits of the message to hide, they can only be encoded using these two transitions.
There are three possible sentences:
The cat is grey
which encodes 00
The cat is sleeping
which encodes 01
The dog is black
which encodes 1
On average, we consume on average \(1.6\) bits per sentence, which is smaller value than with a tree of depth 2
, where each sentence could encode 2 to three bits (previous example).
If you have one byte to hide (one ASCII character), you need at min 4 sentences of 4 words.
If you use very large \(n\)-grams, the beginning of the sentence will often condition all the following words. You will always be able to encode your message, but this will take a lot amount of space.
The \(n\)-gram size must be selected according to the corpus size (and corpus diversity).
Over the Sherlock Holmes corpus and depth=5
, we found:
ah jack she said i have just been in to see if i can be of any assistance to our new neighbors why do you look at me like that jack you are not angry with me
ah jack she said i have just been in to see if i could catch a glimpse of the strange face which had looked out at me on the day before as i stood there imagine my surprise mr holmes when the door suddenly opened and my wife walked out
You can see how similar are the two sentences at the beginning.
Running it with depth=3
:
ah jack she cried
ah jack she cried with a startled face
ah jack she cried trust me jack she said i at last however there came an overwhelming desire to see her slinking into her own husband spoke to her however before i ran into the other side of the gnu fdl 3
ah jack she cried i swear that this mystery comes to an end from now you are surprised said she and i had to choose between you and in my direction and her footsteps coming up the lane but i did not even knock when i left her in peace having no success she goes again next morning and her duty
With a smaller depth, said
is no more used.
In the corpus jack she cried
might occurs more frequently than jack she said
.
But if the sentence starts with ah
, it is never followed by cried
later.
In both version, bits can be hidden whenever there is at least two choices in the current context node of the tree.
In the first version, only \(2^k\) of the \(n\) children nodes can be selected. This reduce the diversity of the possible messages that can be constructed. Nevertheless, assigning the code are much faster in this version, as we only need to rank nodes by frequency.
In the second version, any of the children can be selected. Infrequent words encode more bits than frequent ones. But it is unlikely that a large code word matches exactly the current bit string to encode. One of the disadvantage of this method is the stability with regards to the corpus. If some sentences are discarded from the corpus, the vocabulary set may change and code words reassigned.
The efficiency of a steganographic system corresponds to the number of bits encoded versus the size of the encoded message. Because we use words to encode bits, the encoding power is very low. So this kind of system is quite inefficient, but funny after all.
The second aspect for steganographic evaluation is the detection threshold.
Compare to cryptographic systems, plainsight
is a kind of secret-key steganographic system, as without the corpus and the depth, you cannot recover the message.
In this implementation, the word transition encodes the bits. It is possible to work at different levels:
Instead of text, you can work on graphs. According to a ranking/scoring function, you order the neighborhood of your current node / current path, and jump to a valid neighbor. With this approach, you can generate:
You can also work on a fixed set of elements, like paper, rock, scissors, create images, time series … There are many other ways to play with this kind of encoding method.
In this post, we explained how plainsight work to generate genuine-looking texts hiding secret messages. When reading and testing the initial code, I discovered that the implementation was language dependent, preventing the compatibility with other implementations. This compatibility problem was linked to the way keys are sorted within dictionaries.
In this new version, the sorting is no more language deterministic, which make us independent from a single tool. We also simplified drastically the code, getting the same number of lines of code while adding documentation and many skip lines. You can find the code on Github 6.
>> You can subscribe to my mailing list here for a monthly update. <<