<< Go back to Posts

Plainsight II

Steganography is the art of hiding message without being noticed. To better understand one of these methods, I reimplemented in `python3` a tool initialy written in `python2`. However, the output was different. In this article, I explain how the hiding-message mechanism works, and why the two scripts led to different results.

Introduction

Root-me ¹ is a nice place to learn different aspect of cyber-security.

Whithin the information security field, there is a steganography, which is the art of hiding information without being noticed. One principle in cryptography, the Kerckhoff’s principle says that the secret when hiding a message must not be the algorithm, it must be the key. Steganography often plays against this principle. The goal is to hide a message without being discovered into genuine looking documents, like images, text, pdf, video and many others. Even if you notice there is information, it is often obfuscated, making the reading difficult.

To solve the Root-me challenges, sometimes you need to code, sometimes you need to use the good tool to get the information. In most case, you can recode the tools yourself. It takes some time, but its feasible. This makes you learn about the different encoding format (base64, ASCII, binary, …) and learn a few algorithms.

However, sometimes you can’t, even with the best imagination possible. There is one steganography challenge where I was stuck, tried many things, where none worked. I asked for help, and finally I found THE TOOL on Github which solve the challenge in a second.

This tool is called plainsight ². Because I spent a lot of time trying solutions, I wanted to know why I failed to find the solution, and started looking at the code.

Maybe the code was “efficient”, but it was barely readable:

one function to do many different things (encoding and decoding) with many if ... elif ... elif ...
pointers everywhere
ambiguous variable names
bit counters where the usefulness is quite limited
redundancy.

After some time, I finally understood what it was doing. Because of the difficulty to understand how it worked, I decided to recode it.

It was a python2 script. The first thing I did is to convert it to python3, by changing the syntax of some updated operators like xrange. Despite the few changes, scripts results were not compatible. I found the issue, and this one cannot be patched (I got troubles to get the information necessary to fill the gap).

Therefore, in this post, I will:

Detail how plainsight works
Explain what makes the two versions incompatible
Improvement that have been done and that could be done.

Table of Content

You can jump to the section of interest directly:

How plainsight works
Plainsight I vs II
- The sorting problem
- Transition encoding
Discussion
Conclusion

How Plainsight Works

Plainsight combines two concepts together:

Huffman coding;
\(n\)-grams / Markov chain text generation method.

We will detail these two concepts in the following subsections.

A Short Introduction to Huffman Trees

Huffman coding (encoding a message using Huffman trees) is a method to compress a signal to transmit. Sending a text message letter by letter, bytes by bytes is inefficient. Some letters occur more frequently than other. Huffman coding exploits the distribution asymetry to reduce the amount of bits exchanged.

Knowing the Alphabet

Computers speak in bits. They only use 0 and 1 to do, store and make things. To encode our 26 alphabet, one solution is to encode it over 5 bits. With these 5 bits, we can encode at most \(2^5=32\) symbols, which is sufficient for our 26 letters. For instance, we would code A as 00000, B as 00001, and Z as 11001. We would have 6 unused codes like 11011 or 11111, which is one source of inefficiency.

In this previous example, we used letters because we can recreate all possible words using them. However, if we have a limited corpus of words, we can decide to use bits to encode words directly. If you play Shifumi or rock, paper, scissors, you can decide to encode paper as 00, rock as 01, and scissors as 10. This will reduce drastically the size of the transmitted signal, as a (4/5/8)-letter word (which would require \(8 \times (4/5/8)\) bits in ASCII) is encoded using only two bits.

Knowing the Frequencies

Finding the symbol set is the first step to compress a signal.

The compression efficiency can be improved by playing on code length (i.e., the number of bits allocated for one symbol). The main idea is to allocate short codes to frequent items, while longer ones to infrequent symbols. As all codes would not have the same length, it means that we can only decode sequentially the message. We cannot use multiple thread, one that would process the first half, and another the second half because maybe the signal bit stream will be cut in the middle of a symbol.

If we have multiple codes of different length, we would like the reading to be non-ambiguous. This is where the magic of trees occurs. They allows to represent code and to verify that one code is not the suffix of another. For instance, if the code 01 is allocated, then we have guarantees that 011 and 010 do not exist.

Algorithm

To find the codes, we need to build a Huffman tree.

Building the Tree

Suppose there is a list of symbols \(S = \{s\}\) with respective frequencies \(\{f_s\}\). At the start, we create \(|S|\) unlinked nodes, one for each symbol, with weight \(w_s = f_s\)

To construct the tree, we loop over these instructions’ set until there is one node left:

Select n_a and n_b, two nodes with the lowest weight (w_a < w_b)
Create their parent node n_c with weight w_a + w_b, with left child n_a and right child n_b.
Remove n_a and n_b from the working set of available and add n_c to the set.

Now, the tree is built, where all the symbols we want to encode are the leaves of the tree.

Getting the Codes

We need to construct the tree to get the codes. When the tree is built, you create a code by moving from the root to a target leaf:

If you go left, append 0 to the current code
If you go right, append 1 to the current code

It does not matter if 0 is left or right, as long as it stays consistent for a given node.

When you have found the codes, you need to transmit them to the listening party (otherwise, how could he knows that 01 is paper ?). Then, you can exchange messages efficiently.

Python Code

Here is a simple python implementation. Here, the inputs are the frequencies (just a list of numbers representing words’ count or frequency). The result is a list of code, where code[i] corresponds to the item i with frequency freqs[i].

def get_huffman_codes(freqs):
  """
  Compute the Huffman codes corresponding to the given word frequencies.

  :param freqs: a list of frequencies (non ordered)
  :rparam: the list of binary codes associated to each frequency
  """
  lst = sorted(map(lambda x: (x[1], [x[0]]), enumerate(freqs)))

  # Initialize code words
  dic_code = dict([(ID, '') for ID, _ in enumerate(lst)])

  while len(lst) > 1:
      # The first one has the lowest frequency, which would get a 1
      m1, m0 = lst[0], lst[1]

      # Iterate over all the IDs, and happend a 1 or a 0
      for c in m1[1]:
          dic_code[c] = "1" + dic_code[c]
      for c in m0[1]:
          dic_code[c] = "0" + dic_code[c]

      f_tot  = m1[0] + m0[0]
      ID_tot = m1[1] + m0[1] # List concatenation

      # For very large list of words, this is not optimal
      # It does the job for small inputs.
      lst = sorted([(f_tot, ID_tot)] + lst[2:])

  return list(dic_code.values())

Example

Let’s look at English letters frequencies³.

The first most frequent letters are listed in this table:

Letter	Freq
E	11.2%
A	8.5 %
R	7.6 %
I	7.5 %
O	7.1 %
T	7.0 %
N	6.7 %

The corresponding Huffman tree is:

You can see that only E get the shortest code. This is because the differences between frequencies are not very large. Nonetheless, compare to a fixed 3 bits encoding, we save 1 bit 11% of the time.

The obtained code depends on the alphabet size and the frequencies. If we look at the corresponding trees for the top-6 and top-5, we get:

and

In all these cases, E always get a short code of length 2. Depending on the case, it gets code 01, 10 and 11. Therefore, it is important to keep in memory and share the code used.

\(n\)-grams and Text Generation

If you perform letter analysis, you can without looking at any words find in which language it has been written. However, this is one-way: you cannot generate an english text by just picking at random letter knowing the frequencies. Same if you are using words: Cat reform bottle need is made of english words but is non-sense. There is a need for coherency.

To improve text likelihood, one idea is to look at letters or words over a fixed window. If you take a window of size \(2\), you will learn which letters can follow each other. You will learn that a-a never occurs while w-h is very frequent. The generated words will look like words with many mistakes in. If you increase the window, they will look even better, and at some point you will get real words. Same for words, where you will move from a nonsense sentence to something realistic.

As for letter frequencies, the ideas is to look at tuples frequencies. A \(n\)-gram is a sequence of \(n\) characters or \(n\) words.

For instance, The cat is grey is equivalent to the list [The, cat, is, grey]. There are three possible bi-grams:

(The, cat)
(cat, is)
(is, grey)

and two possible tri-grams:

(The, cat, is)
(cat, is, grey)

and one 4-grams.

Generating Natural-Looking Text

To generate a text that looks natural, it must have properties similar to regular text.

The first quantity to look at is word frequency.

To generate a text, we cannot pick at random words from a dictionary. Words tends to be distributed according to a power law or Zipf law. In other words, a few words are highly frequent while most of the rest almost never occur. If you pick words at random in a dictionary, it is very likely that you almost never use those words.

Next, we have grammar. There are rules that governs how words are located in a sentence. Frequencies are not sufficient. Coding grammar rules is possible, but time consuming. To avoid this coding, we can instead look at \(n\)-grams. If we have a large corpus, we know what are the valid words following a sequence.

If \(P(w_k | w_0, w_1, ..., w_{k-1}) > 0\), then the word \(w_k\) is a candidate. If \(k\) is large, we often have enough context to select a word that is valid according to grammar rules.

Examples

Suppose our corpus is made of:

The cat is grey
The cat is sleeping
The dog is black

(The, dog) + (dog, is) + (is, grey) gives The dog is grey, which is a valid sentence in our language.

However, is our corpus is:

The fire is hot
The ice is blue

The ice is hot and The fire is blue are valid in terms of grammar, but wrong in terms of truth.

Using large n-grams plus a large corpus prevents from grammar errors and aberrant sentences.

Mixing n-grams Text Generation with Huffman Coding

To construct a text with \(n\)-grams, you should exploit the \(n-1\) previous words to find the \(n^{iest}\) word.

If you have a single possibility, you have one single choice.
If you have multiple possibilities, you can select one of them.

This is in this second case where Huffman trees are used. Given a context w_1, w_2, ..., w_(n-1), you have k options w_(o1), w_(o2), ..., w_(ok), each with a particular frequency. Using an Huffman tree for this particular context, you can associate to each of these k words a weight w_ox and find a binary code.

When you encode a message, you will select a word where the code word match the binary sequence you want to hide.
When you decode a message, you recover the code associated to the word w_n

Encoding Example

Taking back our cats and dog example, we get the following tree for bi-grams. We added symbols TOP and BOTTOM to mark the start or end of a sentence. This allows resetting the context memory when arriving at the end of a sentence.

As an example, we want to encode 011010.

At the start, we know that we must start a sentence. We follow the path Root -> TOP. Here, there is a single choice because all sentences starts with The. So no bit can be encoded. Because we use bi-gram, we have one word of memory, i.e. C=[The].

We follow the path Root -> The. Here, we have two options: cat with code 0 and dog with code 1 (because cat is more frequent than dog). Because our first bit is 0, we select cat. Now, the current message is M=[The, cat], the context C=[cat], and the leftover bits 11010.

We follow the path Root -> cat. Same than with The, one single choice.

We follow the path Root -> is. Here, three choices. Because they all have the same frequency, we select the lexicographic order (we have code 00 for grey, 01 for black, and 1 for sleeping).

We select sleeping, so now, the current message is M=[The, cat, is, sleeping], the context C=[sleeping], and the leftover bits 1010.

Now, single choice, ending the sentence. We start back with TOP.

Repeating the process, we have:

The cat is sleeping. The dog + 010
The cat is sleeping. The dog is black + 0
The cat is sleeping. The dog is black. The cat + ``

Decoding Example

Decoding is as easy as encoding. If I just summarize context, bits recovered and leftover, we have:

C=[TOP], B="", M="The cat is sleeping BOT TOP The dog is black BOT TOP The cat"
C=[The], B="", M="cat is sleeping BOT TOP The dog is black BOT TOP The cat"
C=[cat], B="0", M="is sleeping BOT TOP The dog is black BOT TOP The cat"
C=[is], B="0", M="sleeping BOT TOP The dog is black BOT TOP The cat"
C=[sleeping], B="01", M=" BOT TOP The dog is black BOT TOP The cat"
C=[BOT], B="01", M="TOP The dog is black BOT TOP The cat"
C=[TOP], B="01", M="The dog is black BOT TOP The cat"
C=[The], B="01", M="dog is black BOT TOP The cat"
C=[dog], B="011", M="is black BOT TOP The cat"
C=[is], B="011", M=" black BOT TOP The cat"
C=[black], B="01101", M="BOT TOP The cat"
C=[BOT], B="01101", M="TOP The cat"
C=[TOP], B="01101", M="The cat"
C=[The], B="01101", M="cat"
C=[cat], B="011010", M=""

And we recovered our initial message.

Plainsight I vs II

A previous implementation of plainsight can be found on Github.

A Sorting Problem

At the start, we just wanted to modify some operators so the python2 script could be used with python3. Changing the xrange into range was quite easy. However, we got a problem:

\[M \xrightarrow[enc]{python2} M_{end} \xrightarrow[dec]{python2} M\] \[M \xrightarrow[enc]{python3} M'_{end} \xrightarrow[dec]{python3} M\] \[M \xrightarrow[enc]{python3} M'_{end} \xrightarrow[dec]{python2} \text{Unreadable stuff}\] \[M \xrightarrow[enc]{python2} M_{end} \xrightarrow[dec]{python3} \text{Unreadable stuff'}\]

In both implementation, the \(n\)-grams trees are coded using dictionaries. However, python2 and python3 dictionaries are not implemented the same way. Especially concerning the key sorting algorithm:

In python2, dictionary keys are sorted according to a specific hash function.
In python3, dictionary keys are sorted according to the arrival. I.e., if dog is added before cat, then dog is the first key, and cat the second.

Look at these two examples:

With python2:

dic = {"dog": 1, "cat": 3, "frog":5}
print(dic)
> {'dog': 1, 'frog': 5, 'cat': 3}

With python3:

dic = {"dog": 1, "cat": 3, "frog":5}
print(dic)
> {"dog": 1, "cat": 3, "frog":5}

This creates the incompatibility.

For compatibility, our implementation sorts items deterministically:

First items are sorted by frequency
If two items have the same frequency, then the lexicographic order is uded.

Plainsight I: Simple Codes

plainsight I doesn’t use Huffman trees. Instead, when a choice can be made to finish a \(n\)-grams, it encodes all solution with a fixed length code (depending of the \(n-1\) previous words).

If \(W\) is the set of possible words following the current \(n-1\)-gram, with \(|W|=s > 1\) then it exist a \(k_s\) such as \(2^{k_s} \leq s < 2^{k_s+1}\). The \(2^{k_s}\) most frequent words would get an associated binary code. As an example, if we have \(7\) candidates, \(k=2\) therefore only \(4\) of these \(7\) candidates can be used to encode bits. The most frequent one would get the code 00, the next 01, the third 10 and the last 11.

Because some words are not selected for coding, the richness of the generated text will be smaller. This is not a big problem as the code will always run and always be able to encode a message.

If only \(2^{k_s}\) over the \(s\) items are coding information, it doesn’t mean that the \(s-2^{k_s}\) last items are discarded. The intermediate nodes of the tree (not root, not leaf) are possible paths.

Look at the following figure, which represents the 3-grams of a synthetic corpus:

This is the state when we have added all the \(n\)-grams to the tree. The nodes are sorted from left to right by decreasing frequency.

If we look at node G, it can be access using the path Root -> C -> B -> G. The only possible path after removing C from the context is Root -> B -> G -> x If we prune the tree, we get:

where the path Root -> B -> G does not exist anymore, requiring the script to abort the current sentence. Therefore, you would get broken sentences with the first version. In our implementation, this does not occur anymore as all nodes are taken into account.

Discussion

Selecting the good depth

If you select a small tree depth (2 or 3), the text wouldn’t look very natural. In the other hand, if the depth is too large, the n-grams possibilities are less numerous. As a consequence:

You encode less bits per word transition because you have a lowest number of choices (for instance, you would always get This + is => a)
There are more “trivial words”, i.e. words where there are no alternative which encode no bit at all.

which leads in the end to very large messages. For instance, if you look at the tree of depth \(3\):

you can see that there are 2 final forks: (TOP, The) -> (cat|dog) and (cat, is) -> (grey|sleeping) To consume the bits of the message to hide, they can only be encoded using these two transitions. There are three possible sentences:

The cat is grey which encodes 00
The cat is sleeping which encodes 01
The dog is black which encodes 1

On average, we consume on average \(1.6\) bits per sentence, which is smaller value than with a tree of depth 2, where each sentence could encode 2 to three bits (previous example). If you have one byte to hide (one ASCII character), you need at min 4 sentences of 4 words.

If you use very large \(n\)-grams, the beginning of the sentence will often condition all the following words. You will always be able to encode your message, but this will take a lot amount of space.

The \(n\)-gram size must be selected according to the corpus size (and corpus diversity).

Illustration

Over the Sherlock Holmes corpus and depth=5, we found:

ah jack she said i have just been in to see if i can be of any assistance to our new neighbors why do you look at me like that jack you are not angry with me
ah jack she said i have just been in to see if i could catch a glimpse of the strange face which had looked out at me on the day before as i stood there imagine my surprise mr holmes when the door suddenly opened and my wife walked out

You can see how similar are the two sentences at the beginning.

Running it with depth=3:

ah jack she cried
ah jack she cried with a startled face
ah jack she cried trust me jack she said i at last however there came an overwhelming desire to see her slinking into her own husband spoke to her however before i ran into the other side of the gnu fdl 3
ah jack she cried i swear that this mystery comes to an end from now you are surprised said she and i had to choose between you and in my direction and her footsteps coming up the lane but i did not even knock when i left her in peace having no success she goes again next morning and her duty

With a smaller depth, said is no more used. In the corpus jack she cried might occurs more frequently than jack she said. But if the sentence starts with ah, it is never followed by cried later.

Plainsight I VS II

In both version, bits can be hidden whenever there is at least two choices in the current context node of the tree.

In the first version, only \(2^k\) of the \(n\) children nodes can be selected. This reduce the diversity of the possible messages that can be constructed. Nevertheless, assigning the code are much faster in this version, as we only need to rank nodes by frequency.

In the second version, any of the children can be selected. Infrequent words encode more bits than frequent ones. But it is unlikely that a large code word matches exactly the current bit string to encode. One of the disadvantage of this method is the stability with regards to the corpus. If some sentences are discarded from the corpus, the vocabulary set may change and code words reassigned.

Efficiency

The efficiency of a steganographic system corresponds to the number of bits encoded versus the size of the encoded message. Because we use words to encode bits, the encoding power is very low. So this kind of system is quite inefficient, but funny after all.

The second aspect for steganographic evaluation is the detection threshold.

If the tree depth is low, the text is quite unreadable, easily detected, but has a good encoding performance.
With large \(n\)-grams, it is more difficult to identify it at a first glance, but encoding performance are lower.

Compare to cryptographic systems, plainsight is a kind of secret-key steganographic system, as without the corpus and the depth, you cannot recover the message.

Coding Vector

In this implementation, the word transition encodes the bits. It is possible to work at different levels:

transition between letters (Doing neologisms)
transition between sentences. For instance, we could generate poems, like Hikus, where the order of the sentences doesn’t affect the meaning of the text.

Instead of text, you can work on graphs. According to a ranking/scoring function, you order the neighborhood of your current node / current path, and jump to a valid neighbor. With this approach, you can generate:

reading lists: using a citation graph, ranking papers by notoriety, a list of ordered papers can hide information. As a paper may have many citations, it is possible to hide a lot of information in one jump.
wedding lists / invitation: Using social network friend connections.

You can also work on a fixed set of elements, like paper, rock, scissors, create images, time series … There are many other ways to play with this kind of encoding method.

Conclusion

In this post, we explained how plainsight work to generate genuine-looking texts hiding secret messages. When reading and testing the initial code, I discovered that the implementation was language dependent, preventing the compatibility with other implementations. This compatibility problem was linked to the way keys are sorted within dictionaries.

In this new version, the sorting is no more language deterministic, which make us independent from a single tool. We also simplified drastically the code, getting the same number of lines of code while adding documentation and many skip lines. You can find the code on Github ⁶.

>> You can subscribe to my mailing list here for a monthly update. <<