Stonehill College

Data Structures and Discrete Mathematics Learning Community

Professors Ralph Bravaco and Shai Simonson

Cryptography Laboratory

Introduction

In this lab, you will experiment with cryptographic techniques that are the foundation for today’s e-commerce and internet communication.

In World War II the war behind the war was all about the British trying to intercept and decode German transmissions. (Check here for the contributions of the famous pioneer computer scientist, Alan Turing in this effort.).   Here is the recent apology from the British government.

A simple version of the codes used in World War II works like this.  The alphabet is shifted over by some number of letters, so that “a” might become “f”, and “b” would be “g”, etc., wrapping all the way around.  We could call that the f-shift.  The substitutions for the f-shift are shown below:

abcdefghijklmnopqrstuvwxyz
fghijklmnopqrstuvwxyzabcde

For example, the word “shai” is encoded into “xmfn” using the f-shift.  Note that the f-shift is nothing more than adding 5 to each letter’s ASCII value.  You need to be careful to wrap around appropriately modulo 26.

If that was all there was to encoding, then a code breaker would only have to try 25 possibilities to break the code.  A code breaker simply tries translating the message using each of the letter shifts until one translation looks like a real message.

The Germans were more clever in creating their encodings.  The trick they used (more or less) was to have each subsequent letter use a different shift.  After a while these shifts would cycle.  The sequence of shifts was represented by a code word consisting of letter shifts.  For example, if the codeword was “remarkable”, then the first letter of the message would shift by 17, the second letter by 4, …, the 7th letter by 0 etc.  The 11th letter would shift by 17 again like the first and the cycle would start to repeat.

For example, the first few characters of the sentence “We meet in New York for a rendezvous” would be encoded as “Ni ye…”.  Note that spaces between words were not actually transmitted or encoded.

Checkpoint:  Make sure you can encode the rest of the sentence yourself by hand.

For a codeword like “remarkable” of length ten, there are now 2610 different possible encodings.  The longer the sequence of the cycle, the more possibilities there are.

Breaking the Code

Once one knows the length of the cycle, there are a number of techniques for breaking the code without having to try all the possibilities. One idea is to divide the letters of the encoded message up into 10 groups, one for each shift in the cycle.  The 1st, 11th, 21st etc make the first group.  The 2nd, 22nd, 32nd etc. make the second group.  And so on.  The letters in each group are encoded using the same shift.

We try to decode the letters a group at a time.  For each group, we try all 26 possible shifts, but since the letters in each group are scattered throughout the message, we are unlikely to learn anything by doing this.  The chance of something looking familiar will be very small.

There is a way to learn something about the message nevertheless.  Every language has a characteristic statistical frequency for each letter.  For example, in English the letter “e” is the most common.  For a given group, we compare the frequencies of the letters in each of the 26 decodings to the expected frequencies.  Matching the two sets of frequencies helps identify the shift or at least narrows down the number of possibilities from 26 to just a few.

The Germans actually used a more complex version of cycles and shifts that required a more sophisticated computer-aided decoding effort.  For one thing, they allowed an arbitrary substitution of 26 letters rather than a single shift value.  See a description and a simulator of their Enigma machines.  This movie explains all the details about the Enigma hardware:   rotors, reflectors, and plugboard.  lThe end of the story was that the British were successful in their efforts to crack the Enigma code, and the Allies prevailed.

Program 1.  A cycle of shifts is described by a codeword like REMARKABLE, where each letter represents a shift, A is 0, B is 1, … and Z is 25.  Write a program that takes a text and a codeword, and produces the encrypted text.  You will need to use the % (mod) operator and perhaps an if-statement or two to make sure you wrap around appropriately.   You could keep the codeword in a circularly linked list, but that is not necessary.  For simplicity, you may assume that all letters are uppercase.
Program 2.  Modify your program so that it can decode as easily as it encodes. All you need to do is add an option to decode, and then do subtraction rather than addition.

Make a note of how symmetric this is!  This was the hallmark of cryptography since ancient times, until the last twenty years.  If you knew how to encode something, then you knew how to decode it.  You might be thinking “well duh!!”

Determining the Cycle Length

An ingenious method to determine the cycle length was discovered by Philip Friedman whose paper is described by David Kahn in his seminal history of cryptography  as "the most important single publication in cryptology."   Friedman defines the index of coincidence, a statistical measure of text, that for normal English is about 6.6%, but for a random collection of letters is only about 3.8%.

To determine the index of coincidence of a particular piece of text, take the text, rotate it by some arbitrary number of places, and write the rotated text underneath the original text.  For example, the sentence below is rotated by 58 characters and the rotated version appears beneath it.

alanturingbreakscodeslikenobodysbusinessbuthispersonallifesadlybecameeverybodysbusiness
dysbusinessbuthispersonallifesadlybecameeverybodysbusinessalanturingbreakscodeslikenobo

Consider the number of places in which the same letter occurs in both strings of text.  In this example, there are 6 such "coincidences".  The index of coincidence is the ratio of coincidences to the total letters in the text.  In this  example, it is 6/87, which equals approximately 6.9%, a rate similar to the 6.6% of normal English text. You will not likely see 6.6% with every shift, but you will see 6.6% more surely if you average over all the shifts, ranging from 1 through the length_of_the_text - 1.  In the example above, if you shift once, twice, three times, etc., through 86, and average all the indices of coincidence, you will get a value quite close to 6.6%.

An important point to notice is that when a text is encrypted with a single letter codeword, the index of coincidence of the encrypted text is identical to that of the original.  This is because with a fixed single value shift, two letters match before the shift if and only if they match after the shift.  In contrast, the index of coincidence of text encrypted with a codeword of length greater than one, would likely be closer to the 3.8% you expect with random letters. The exception is when we happen to rotate the text by a multiple of the codeword's length.  In this case, the letters that are lined up underneath each other are encoded using the same shift, and therefore the usual English index of coincidence would be expected. 

Hence, to determine (or at least make a good guess at) the codeword length is to rotate the encrypted text by 1, 2, 3 etc. symbols, and examine the results looking for an index of coincidence closer to 6.6% than 3.8%. As we mentioned, you may not see the 6.6% with one particular codeword length rotation but you will likely see it if you consider an average over a larger number of rotations, each of which is a multiple of the codeword length.  Therefore: When considering a codeword of k symbols, pay special attention to rotations that are multiples of k.  You will get better results by looking for 6.6% in the average of these multiples

For example, say you have an encrypted text of 100 symbols, and you calculate an index of coincidence for each rotation from 1 through 99.  Furthermore, assume the data starts like this:

Rotation        Index of Coincidence
1                    1%
2                    3%
3                    5%
4                    8%
5                    3%
6                    7%
7                    2%
8                    1%
9                    8%
10                  4%
...                    ...

This fragment of the data seems to suggest that a likely codeword length is 3, because the average of rotations 3, 6, and 9 equals 6.6%.  To confirm this, you should consider the rest of the data, (3, 6, 9, ... 99), along with averages of multiples of other lengths, such as (2, 4, 6, ..., 98),  (4, 8, 12, ..., 96), (5, 10, 15, 95), etc.  The reasonable candidates for the actual codeword length, are those with multiples that average closest to 6.6%.  Remember, that since this method is statistical, it is not guaranteed to identify the cycle length -- just to point out likely candidates. 

Program 3.  Write a program that takes encrypted text and calculates the index of coincidence for rotations of 1 through the length_of_text - 1.  A chart like the one above should be printed.   Create test data by cutting some English text of at least 150 characters off the Internet, and encrypting the text with a codeword of your choice using Program 1.  Do nut use the example above "alanturingbreakscodeslie..." because it is too short to get very good data.

Analysis:  Assuming that the codeword length is between two and ten inclusive, use your program to help you guess the exact codeword length by calculating the average index of coincidence for all multiples of codeword lengths two through ten, and searching for an average index of coincidence close to 6.6%.  In your report, include the original text, encrypted text, codeword, the data generated by your program, and your analysis of finding the codeword's length. Include the average index of coincidence for multiples of lengths 2 through 10.

Public Key Cryptography - A Revolution in Cryptography

Cryptographic methods do not have to be symmetrical!  Today a new kind of encoding is used which is called Public Key, one-way, or sometimes trapdoor cryptography.  In this new method, the whole world is able to encode information, but unless someone has more information, then they still cannot decode a message.

This new method is what allows e-commerce to flourish without fear of a security breach.  Here’s how this trapdoor technology is used.   Say I want to send my credit card to Amazon.com to buy a book, I encode my number with a publicly published method that anyone else could use, but only Amazon.com can decode it!

What if I don’t trust that I am actually talking to Amazon.com?  That is, I suspect that a thief posing as Amazon, sent me a fake encoding algorithm, and then they will decode it and get my number!  In that case, we do the process in reverse.  I ask Amazon to send me a message encoded with their secret encoding that anyone can decode with their public algorithm.  If I decode the message and it states, “Hi I am Amazon.com”, I know it had to come from them, because nobody else would have known how to encode it correctly.  This is called authentication and is described in nice detail by VeriSign, the largest digital signature provider, at http://www.verisign.com/docs/pk_intro.html.

The Mathematics Behind Public Key Cryptography

Ironically, this new extremely practical technology is based on one of the oldest branches of pure mathematics, famous for its beauty and scarcity of practical applications:  number theory.

The details of public-key cryptography are based on the RSA algorithm, by Rivest, Shamir, and Adelman.  We will review the mathematics here briefly here and end with a few experiments and programs.

Euclid’s Algorithm and Greatest Common Divisors

The greatest common divisor of two integers is the largest integer that divides both evenly.  For example, the greatest common divisor (gcd) of 24 and 15 is 3. One way to calculate the gcd of two numbers, a and b, is to simply try all the numbers starting from the smaller of the two given numbers down to one.  The first of these numbers that divides both a and b evenly, is the greatest common divisor.
Program 4.  Write a program to find the greatest common divisor by trying each number from the lower number and working downwards towards 1.

Program 5.  Write a program to find the greatest common divisor by listing the prime factors for each number, and taking the intersection of the two lists.   For example, if the two numbers are 24 and 40, then the prime factors of 24 are: 2, 2, 2, 3, and the prime factors of 40 are:  2, 2, 2, 5.  The intersection of these two lists is 2, 2, 2, so the gcd is 2x2x2 = 8. A nice way to do this is to store the prime factors sorted in two arrays, and then use an algorithm that resembles the "merging" algorithm. This takes time proportional to the sum of the lengths of the two lists of primes.  A less efficient algorithm would take time proportional to the product of teh lengths of teh two lists of primes.

Although the methods used in both program 4 and 5 are intuitive and seem natural, neither is the best way to find greatest common divisors.  Both these algorithms take time proportional to the size of the two numbers. It will be important to us later to be able to find the greatest common divisor efficiently without factoring the given numbers.

Euclid (300 B.C.E.) invented a method to compute greatest common divisors that bears his name to this day.  Euclid's algorithm in time proportional to the number of digits in the two numbers - a tremendous improvement over Programs 4 and 5.  In Euclid’s algorithm, we use recursion.  He proved that assuming a>b, then the gcd(a,b) = gcd(b, a mod b).  An example is shown below.

a
b
   
123
28
28
11
11
6
6
5
5
1
1
0
When the smaller number, b, equals 0, then the answer is a.  In this case, the gcd(123, 28) = 1.
Program 6.  Write a recursive program to implement Euclid’s algorithm.  Write an iterative program to implement Euclid’s algorithm.

Experiment.  Test your four programs on a variety (2 or 3) of pairs of numbers, time them, and discuss the results.  Make sure to use some very large (long int) sized numbers to see the true differences in time complexity.  Make sure not to time any printing or unrelated work that would inaccurately spoil the timing.  In Java, you can use System.nanoTime() (returns a long),  to get the current CPU time and calculating time elapsed becomes a matter of subtracting the start time from the finish time.

Note that theoretically Euclid’s algorithm (whether recursive or iterative) takes time proportional to the number of digits in the numbers, while the other two algorithms take time proportional to the numbers themselves (an exponential explosion in complexity). This theory should let you check whether your results make sense. Discuss the results of your experiments with respect to the theory.

Connection to Cryptography

It is not obvious that Euclid’s algorithm has a lot to do with RSA public-key cryptography, but it does.  You will need to be patient.

A theorem of Euclid that we will not prove here, states that if the gcd(a,b) = m, then there are two integers  x and y such that ax + by = m.  These values x and y can be computed constructively using Euclid’s algorithm.  Here is an example using the numbers from the previous example.

123  =  28(4) + 11
28  =  11(2) + 6
11  =  6(1) + 5
6  =   5(1) + 1
The greatest common divisor was 1.  Now we work our way backwards:
1  = 6 – 5(1)
1  = 6 – (11 – 6(1))(1)  = 6(2) – 11(1)
1  =  (28–11(2))(2) – 11(1)   = 28(2) – 11(5)
1 = 28(2) – (123 – 28(4))(5) = 28(22) – 123(5)
Each step gives us the x and y for the pair of numbers on that level, until finally we arrive back at the original pair of numbers 28 and 123, where the x and y are 22 and –5 respectively.  Study this example, and try to understand exactly what is going on.  (You may notice that the solutions for x and y are not unique.  In class, we will discuss more details and variations of this method.  But for the purposes of this lab, you won’t need to know anything more about it.)
Checkpoint:  To make sure you get the idea above, compute by hand, the integers x and y, such that ax + by = gcd(a,b) for each pair a, b below.
(99, 101)    (10, 35)      (7, 12)       (36, 42)

Although the next program is only three lines long, you will likely need a hint trying to figure out how to relate the computation of the next row’s x,y values to the previous row’s. 

Program 7.   Write a recursive program to calculate the integers x and y, such that ax + by = gcd(a,b) for a given pair a and b.  This program will be crucial for the RSA cryptography algorithm.  Try your program out on the numbers 233987973 and 41111687.

Hints for program 7:   If gcd(a,b) = a mod b, we are at the last row, and we can return (x,y) = (1, -(a/b) ).  Otherwise, we must recursively compute the function with the inputs (b, a mod b).  Call the outputs of this recursive call x' and y'.  Then the original function can return (x,y) = (y', x' - (a/b)y').  Try this on the example above for 28 and 123, and I will review the example in class if necessary.  Note also that an easier base case that goes one level deeper is "if b = 0, then the gcd is a, and you can return (x,y) = (1, 0).

 Public Key Cryptography – A First Attempt

This section describes a method that is a simple version of RSA, but has the drawback that the private key can be computed from the public key using program 6 that you wrote earlier.  Study this simple version first, because although it does not quite work because of the drawback mentioned, it will give you a chance to understand what is really going on before tackling RSA proper.

All the public key cryptographic methods encode one integer into another, so we assume that our messages are first converted to a sequence of integers.  The exact method of conversion is not trivial but it won’t concern us here.

To encode a number, we will need the public key.  This consists of two integers, for example 5 and 17.  The second integer must be prime, and the first must be relatively prime to the second integer minus one.  In this case, 17 is prime, and 5 is relatively prime to 16.  Recall that two numbers are relatively prime if and only if their greatest common divisor is equal to one.

Encoding and Decoding Numbers – Public and Private Keys

Now to encode a number, say 6, using this key.  All we do is calculate the remainder of 65 after dividing by 17.  You can check that this equals 7.  To decode 7 back into 6, we need to have the private key.  The private key also consist of two numbers, one of which is really public, namely the prime 17, and one of which is really private namely 13.  Then we calculate the remainder of 713 after dividing by 17, and we get 6.

 Where does the 13 come from?  Why does this work?

Thirteen is the solution to the equation 5x = 1 mod 16.

What happened here is that we computed 65(13) mod 17 = 713 mod 17 =  6 mod 17.  Since 5(13) = 1 mod 16, we can write 5(13) = 16(4) + 1.  This means that that 616(4) + 1 = 6 mod 17, and equivalently 616(4) = 1 mod 17.  It turns out that this must happen as long as 5(13) = 1 mod 16, and 5 has no common factors with 16.  There is a theorem to justify this.

Fermat’s Little Theorem

Pierre de Fermat, was an amateur but great mathematician of the early 17th century.  Fermat’s Little Theorem, not to be confused with his more famous Last Theorem states:

If p is a prime, and is not divisible by p, then p evenly divides ap-1 - 1.  For example, if p equals 17, and a equals 6, then 17 evenly divides 616 – 1, or 616 = 1 mod 17.  The proof is left for the discrete math class.   It is elegant but irrelevant for this lab.  It is the result that we need.  If 616 = 1 mod 17, then certainly 616(4) = 1 mod 17, and our encoding and decoding will work.

Computing Inverses

It doesn’t matter if you understand all the mathematics here, but you do need to understand how to calculate this number 13.  Recall that thirteen was the solution to the equation, 5x = 1 mod 16.  That means we must find two numbers x and y such that 5x – 16y = 1.  Sound familiar?  It is just Euclid’s algorithm!

The numbers 5 and 13 are called inverses of each other mod 16.  That means they multiply together to equal one.  The computation of an inverse is just what you did in program 6.

Now What?  If a person wanted to calculate the private key from the public key, they could do it by just using your program 6 and calculating the inverse of the private key mod 16.  What’s wrong is that it is possible to calculate inverses quickly and efficiently. You wrote a program to do it yourself!  So if someone publishes a public key (5, 17) you could have your program compute the private key (13, 17).  That is not good enough.  We want the decoding to be harder than the encoding.  We do not want anyone to be able to find 13 so easily.  Only the person who holds this private key can decode.

 A Second Attempt at Public Key Cryptography

The first attempt described above is based on work by Diffie and Hellman in 1976 (see http://www.verisign.com/docs/pk_intro.html) and was known before the contributions of Rivest, Shamir, and Ademan (RSA) in 1977. The RSA contribution is described next.
Now we start with two prime numbers, p and q, say 2 and 17, and compute their product pq = 34.  We calculate (p-1)(q-1) = 16, and we choose a value that has no common factors with 16, say 5.  The public key becomes the pair of numbers 34 and 5.

Encoding and Decoding

The encoding and decoding is done just like before.  For example, to encode 6, we compute 65 mod 34 = 24 and to decode 24 we compute 2413 mod 34 = 6.  As before, thirteen is the solution to the equation 5x = 1 mod 16.

The difference between this and our first attempt is that previously, 16 was simply one less than the public prime, so that the equation, 5x = 1 mod 16, was easily constructed (and solved by program 6).  But now the only way to calculate 16, is to factor the number 34 into 2 and 17 and compute (2-1)(17-1).  The factoring part is hard!  Nobody knows how to factor numbers quickly.  The best method is proportional to the given numbers.  Contrast this with the calculation of the inverse that is exponentially faster, proportional only to the number of digits in the given numbers.

Of course, anybody can factor 34, but in practice the two primes that are chosen are on the order of 100 digits each.  This makes all currently known factoring algorithms take centuries.  If you come up with an efficient algorithm (like the one for inverses) that can factor numbers, you will be famous!

Checkpoint
a.  Create your own public and private codes for doing RSA encrypting.  Make sure they satisfy all the appropriate constraints.
b. Assuming that each character in a message is represented by its ASCII value, encode the message “Too much work!”.

Program 8.  Write a program to encode numbers using RSA given the public key (e, m).  Recall that encoding a number w is done by calculating we mod m.  Make sure to use the fast modular exponentiation algorithm (Egyptian-style) we discussed in class.  That will allow you to deal with very large numbers efficiently.  Raising x to the yth power can be done in time proportional to log y.  You can test your program with your own numbers. 

Program 9.  Write a program that factors the public key pq, into p and q, computes (p-1)(q-1), computes the inverse to find the private key, and decodes a message.  Use your program to solve the following problem.  This program should use programs 7 and 8.  Be careful when using the results of program 7.  Recall, assuming that program 6 returns x and y such that bx + ay = 1 where b is the exponent and a is the mod value, then if x is positive the secret decoding exponent is x.  However, if x is negative then the secret decoding exponent is a-|x|, or simply a+x.  Test your program on the puzzle below.

Puzzle.  Cracking the UFO Message.
A public code is found etched on a rock on Mars:  (7, 1147).  The message {128, 1040, 129, 1144, 788, 735, 570, 875} is received from outer space on one of the billion machines running the Extraterrestrial Life Detector Screen Saver distributed among the world’s PC’s.  Assuming that this message was encrypted with the public code found on Mars, crack the code and decode the numbers.  (Note that this problem uses small primes, hence giving your program a fighting chance to succeed before we are all dead.)

Note that to use RSA on integers that are really large (which is the whole idea), you need to use a "Big Integer" class in order to avoid overflow.  You can get away with regular Java integers (int) for Challenge 3 and "long" Java integers for Challenge 4 below, but to proceed further on more realistic data, you would need a class that allows you to use integers of arbitrary size.  For C++ programs, here is one stack class that implements Big Integer and here is another.  For Java there is built-in BigInteger class.

For years, RSA published public challenges to factor some very large numbers (hundreds of digits).  Many of these were solved and prizes were paid.  Recently, they shut down the challenges.  Nobody knows exactly why.  Here is a nice explanation of the public RSA challenges.

Cryptographic Decoding Challenges

Challenge 1.

A code word less than six letters long gives an encoding of a well known cerebral song:  The code word is the name of a famous dog from the movie featuring the song.
BQHIERPVBZXOPORHASACNFLQHBYSKFBBZKBHAHASYZHKXFLQHBLIEHBBZKBHAHASKOBB
PWMVMVXHACNUAHLWWPXHAWGYBBBQHIERUSTBHHASKZBBVCEBBTBCGZRVTRTPKOBB

What is the original text? What is the code word?

Challenge 2.

A code word less than six letters long gives an encoding of a math joke about factorials, attributed incorrectly to Woody Allen:
BOQWMNWOOQSSFHQKASRLAGOWRAADWRKAONCIOHKJKCYHVYWHLLC

What is the text?  What is the code word?

Challenge 3.

RSA encoding of nine ascii values, using public key (10555, 21971), gives:
16912    19531    20676    16912    6613    2348    17835    15770    15770

The original text is a famous cartoon hunter.  What is the original text?

Challenge 4.

RSA encoding using public key (5555551, 118513313) gives:
80217189    107242213    96490860    79543571    25953566

What are the original numbers?


Enrichment Essay

We will screen the movie Breaking the Code with Derek Jacobi and The Imitation Game with Benedict Cumberbatch, both showing, in different ways, the role of Alan Turing in the British decryption of German codes in World War II.   We will run also some friendly WWII style cryptographic competitions with prizes for the winners.

Write your impressions of the two movies. Which did you like better and why?   Which did you feel was more authentic and why?