Monday, September 30, 2019

My crypto submission to HackTheBox

I greatly enjoy completing cryptography CTF's and so figured I could try creating my own simple challenge to submit to HackTheBox. There are quite a few crypto submissions though, and only a certain amount are allowed to be active at a time, so even if my challenge does get accepted, it could be quite a while until I'm able to link it to anybody, so instead lets talk about it here for now instead.



I remember my first exposure to crypto and CTF's in general was Google's 2018 summer CTF and the challenge 'dm collision' in which you were given a python script 'not_des.py', a slight alteration on the DES one-way compression function, and find collisions in it, ie two inputs which yielded the same output. This challenge was far outside of my abilities, and still would be difficult for me today, however I did find a write up for it online (really recommend this if you're interested in the challenge: https://fortenf.org/e/ctfs/crypto/2018/06/26/google-ctf-2018-dm-collision.html) and found everything being explained incredibly interesting, it undoubtedly lit the spark for my interest in crypto, and more specifically breaking it. As this challenge was more or less about DES, I ended up reading up a lot on DES and it's vulnerabilities, and so it's something I already know a lot about to make a challenge on. As such, my challenge submitted to HTB involves taking a ciphertext encrypted using DES, and finding it's plaintext.



So our ciphertext comes in base64 format as is standard for raw data like this and decodes into non-printable characters. If we count the number of characters, we find that it's a multiple of 8, thus indicating it's likely a ciphertext encrypted by a block-cipher.

The challenge description is where the real clue on where to go with this is given.

'We've uncovered an old encrypted message that seems to date back to the early 70's. Can you help us crack it?'

The 'leap of faith', characteristic to CTF's, is realizing that the only block-cipher to exist in the early 70's was DES, and so that must be the algorithm used (hence the challenge name 'Old School'). From there the challenger is intended to research DES (the wikipedia page would suffice) to find it's unique property in regards to weak keys. That is, keys wherein the operation of encryption becomes the same as decryption. Due to DES's symmetric nature of operation, keys such as all 0's or all 0xFF's will mean that whatever ciphertext is created using them will be able to be decrypted by simply encrypting the ciphertext again with the same key. A challenger could take this knowledge to iterate over all known weak keys as listed on wikipedia and try encrypting the ciphertext, checking against the output for anything legible.

f
Well, not directly, but Microsoft do use a bad implementation of 'tripple' DES that can be reduced down to a single DES operation, which is still used in wifi authentication protocols today!

I've demonstrated here a basic python script that can be used (with the help of a package called pyDes) to decrypt the cipher. Obviously without already knowing the key was all '0xFF's you would require a loop of all known weak keys, but this is redundant for the example.

Creating this challenge was a lot of fun, and taught me a little about how to properly store and output raw binary data with non-printable characters, as usually I'm only dealing with the other end of things and only trying to take a decrypt things, rather than share encrypted data. I hope HTB can approve the challenge, in which case I may have to take this post down less people go looking for the flag!

HackTheBox Challenge: Infinite Decent


This challenge was a lot of fun and the highest ranked (in terms of difficulty) one I've completed. It begins with two python scripts and an email, containing an RSA public key and messaged encrypted by the corresponding private key.

Not gonna lie, reading this made me feel like a real 1337 hacker, was pretty great

So lets break this email down. This company uses a typical secure communication system wherein a shared secret is sent via one time RSA key pairs, which is then used as the seed of a random number generator (a mersenne twister is a common pseudo-random number generator). This random number generator provides the shared key in future communications encrypted and decrypted with AES.

When you create an RSA key pair, you need two very large prime numbers sufficiently apart from each other. Generating these can take a long amount of time, so what this developer has done is generate the first prime in the standard way, then rather than doing the same again for the second prime has simply started counting in either direction until another large prime is found, as can be seen in the provided script fasterprimes.py:


So the main vulnerability here is that the generated primes in the public key are very close. This means we can easily factor the modulus 'n' value provided in the public key. There are two options - Fermat's factoring algorithm, or the much simpler approach of just taking the square root of n and then counting up and down until two primes that multiply to give the original modulus are found. Writing up a script for this is very simple and can be soon below:


An important thing to take into account though is that as we're dealing with very large numbers we can't treat them like normal using standard python int types, as this would result in a loss of precision at the least significant bits, throwing off our results. Instead we have to use a wrapper for the general multi-precision library called gmpy and keep our numbers as 'mpz' types, such that they'll remain accurate after carrying out operations on them. Lets run our script and see our results:


So now we have our RSA prime numbers, time to reconstruct the private key and decrypt our message, conveniently given in integer form already. This can again be done with a very basic script that uses the Euclidean algorithm for finding greatest common divisors such that we can directly obtain the private exponent d.

Saving the output into a file 'presharedkey' we get:


So, just a simple integer. Lets take a look at the 'AESBootStrap.py' script to see if there are any hints on how we might use this:


Left out is a standard mersenne twister, as implemented on wikipedia. It seems pretty obvious then that we're to take this number and use each triplet of digits as seed values for the random number generator. In this implementation each output from the generator is and'd against the hex 0xFF, or '11111111', so only the first 8 bits are included.


So our output for the AES key is a series of integers, each within a single byte. Converting to ASCII seems like the obvious route from here.


They're really making us work for this one aren't they? Lets convert from base64...


And there we have it! Our flag, finally. This was an incredibly fun challenge that I really enjoyed doing, so props to the creator.

Friday, September 20, 2019

Research and architecture ideas for maths editor

I've been working on a personal project that I'm quite fond of, and it's taken probably the most research and generally just thinking about how to approach each section of any project I've ever taken up. As such I wanted to write out my thoughts on how I structured and designed the workings of the program here.

The program itself is essentially a text editor made specifically for mathematical expressions. I've always found it frustrating that I chose one of the few subjects at university that still require a pen and paper (that is, physics), while everyone else gets to use laptops, so I've decided to try and solve the problem of writing and handling mathematical expressions being much harder on a keyboard than on paper with this program. The core idea is that from a terminal with ncurses you can use a set of keybinds to write an arbitrary expression out in free form quickly and easily, then switch to 'operation mode, much like default/insert mode in vim, to use a new set of keybinds to carry out operations that follow mathematical law on your written expression. The program automatically simplifies expression for you, so if your expression is simply 'x' and you add an x, it'll know to present the resulting expression as '2x', rather than 'x+x'. Likewise if you add a y, you'll be left with 'x+y'. It should be noted that this is not wolframalpha though - you cannot ask it to *solve* mathematical equations for you, by making certain variables the subject or finding roots etc. You still do all of the working out, as you would on paper, the goal here is that when you command an operation to be done, the equation that's returned is the same as what you would had wrote if you were writing it yourself on paper.

So lets get into the details of how this can be done. Currently only the backend is being worked on before the interface is started, so this will only deal with how that works, however I'll come back and add to this blog as more progress is made.

What we're essentially writing is a basic interpreter. We need something that can take an arbitrary string, cut it down into component parts (it's tokens), figure out how these tokens are to be parsed and then carry evaluation. The tokeniser is simple, as it just needs to spit out numbers, variables and operators as it sees them. The only real job is to collect whole numbers as one token, so that '10+5' is returned as '10', '+', '5' rather than '1', '0', '+', '5', but this too is trivial. It also needs to add in implicit operations left out by common short hand such as '10x', so that '10', '*', 'x' is returned, but this is again trivial.

Once we have our tokens then it's a matter of figuring out how to parse them. In interpreters for languages, an abstract syntax tree is used, so lets use something like that. Maths is very simple in this respect and so a parse tree is a natural choice, where each binary operator like a plus has two children which can either be numbers, variables or more operators that eventually terminate with numbers or variables. This of course changes slightly for more advanced operations, like trig or log functions, but they are only unary operators rather than binary, so they would just have one child instead of two. Likewise calculus operations can be considered unary operations. However, the core of the program is first being build with only 'basic' algebra in mind, as to simply work as a proof of concept for the approach being taken.
I've drawn out examples of what the tree should look like given a typical expression:
It's parsed simply by taking the bottom leaves, applying the operator stored in their parent, then replacing the key in that parent node with the result. and recursively doing so until only the root node and therefore result of the operation remains.

However, we're getting ahead of ourselves here. Before we can parse the tree, we need to build it from a stream of tokens. This was initially quite difficult to think about as how could you cleanly take an arbitrary expression like the one seen at the top of the above picture, and construct a tree that when parsed would evaluate the expression in the correct order according to operator precedence? There seems to be no obvious way, based on an *in*fix expression at least. What we can do to make life easier is convert our given equation to *post*fix (aka reverse polish notation) and then simply build the tree recursively using a modified shunting-yard algorithm for postfix expression evaluation, the modification here of course being rather than direct evaluation - which the shunting-yard algorithm isn't equipped for when dealing with algebraic variables - simply constructing a parse tree instead. What makes postfix notation perfect is that it already contains the correct evaluation precedence within it, so brackets don't ever need to be used, meaning you can add nodes to a tree directly rather than having to on the fly deal with bracketing order. I did think about how this could be possible, but also so much more of a mess that the extra step to convert to postfix notation is certainly worth it. It's also worth noting though that what's great about how the parse tree is constructed is that even though brackets are lost in postfix conversion, they can be recovered in the resulting evaluated expression by adding them back in to the rest of what's already been evaluated whenever the next operator seen has a higher evaluation precedence than the previous one. 

So, we have our parse tree constructed in a beautiful and recursive fashion thanks for postfix notation, now for the meaty part of actual evaluation, and where this program in my mind becomes unique. Every term involved in the parse tree - regular numbers, variables or a mix - can be split into three distinct parts: the variable, the coefficient and the exponent. The variable and exponent parts are the determining factors on if combination is possible, while coefficients are just regular numbers and can be operated on each other at will. As such using a 'term' class for every token seems like the best idea, that can store all this information as it gets manipulated over the evaluation and used for checking for simplification opportunities. However things get more complicated when you can't combine two terms and have to leave operators in - for instance the 'x+y' example gave earlier. This case can be tackled in a similar fashion though, it's just a collection of terms, and can be treated like it's own term in evaluation. As such container classes can be used to store different incompatible terms and which operators are connecting them, and when these container terms are evaluated they too can be added to larger container terms, where each term can be retrieved recursively. For instance, 'x*(x+1)' will iterate over the container term '(x+1)' and multiply 'x' to every individual term inside - 'x' and '1'.

I believe the core idea of this part to be solid, however currently there are still some implementation issues I'm considering. Namely, the best way to actually structure these container term classes that makes them easiest to parse. My initial idea was to hold separate variables in the class for numerator, denominator and exponent, but this causes evaluation to get messy with a lot of condition checks required at every turn, so I'm thinking there should be a better and cleaner way to approach this, which is currently where I'm at on this project. I will be excited to update this post as new advancements in the program comes along, though!

HackTheBox Challenge: Weak RSA



This challenge was very important to breaking me into crypto and handling of binary data. The intended way, due to it being an easy level challenge, was to read up on wikipedia about RSA, find out about the existence of the wiener attack, then download a script to run on the given ciphertext to extract the flag. I however assumed it was intended to be exploited from scratch, so I ended up spending far more time than was expected on creating my own python script (with help of a guide on how the wiener attack works) to complete the challenge, and I’m glad I did! The process introduced me to a lot of new concepts that I’ve then been able to use to make harder crypto challenges much easier, and will undoubtedly be useful for future programming projects.



The challenge itself simply gives you a ciphertext and an ascii armored public key, you have to figure out the rest. To me at the time this was quite a lot, as I didn’t really understand the differences between the three major crypto functions (symmetric, asymmetric, one-way compression) to begin with, let alone the details on how they worked. Luckily in comparison to the others, RSA has fairly straight forward internal workings, as it just relies on a small system of equations, rather than a large internal and complicated process like block ciphers or hash functions.

Using a package called pyCrypto I could import the given public key and print it’s public exponent and modulus. I created a small script just to have handy so I could quickly reread the values whenever needed.



After becoming familiar with the fundamentals of RSA and getting a way to access the public key’s elements directly, I started looking into common attack methods on wikipedia. The first was about how that for encryption carried out with a small public exponent (e), say around 3, the original plaintext could be retrieved from the ciphertext by simply taking it’s e’th root. Now as we can see, the public exponent used here was very large, and so this attack wouldn’t be feasible, however this did strike me as odd - how can you take the root of a text file? Well, in the same way RSA itself is applied to any type of text – by breaking it down into one single number and applying the mathematical equations on it, and this was the first time I really realized the know seemingly obvious fact that everything is just a bunch of numbers. Furthermore, with a bit of research and asking around, I also learnt about the algorithms OS2IP (octet stream to integer primitive) and I2OSP (integer to octet stream primitive) which can take either strings of text or whole numbers respectively and convert them from one to the other, which are obviously vital in an RSA encryption and decryption process.

Armed with this new knowledge the next challenge was dealing with encoding and decoding of text data using Python. This was again a new area to me but obviously very important in crytpo and programming generally. A big hurdle was simply realizing that data is actually ‘encoded' when it’s in raw byte form, and ‘decoded’ when in ascii/unicode, at least by python’s definitions anyway. Figuring out when to work with bytes and when to work with characters, when to use little or big endian (another concept I only came across due to this challenge) and also how to encode bytes back to unicode correctly in a preserving format – it turns out printing or piping from stdout to say the base64 utility will discard all non-printable characters and thus alter the data you’re working with. This happens with both Python and C! So, to get around that you also have to encode your bytes to base64 while they’re still in the memory of the program itself, rather than piping through stdout to the base64 tool, and then all your data will be preserved and can be reopened in the same way. This is all unrelated from crypto directly, but if you’re to write programs to carry out exploits from scratch is still vital knowledge that I had no idea about before attempting this.

Eventually I came across the wiener attack and realized that must be the one, as nothing else had worked so far. Learning about the maths was fun, and implementing it in python was fairly trivial.




Though messy, the output of our script contains our flag!

I came out of this now knowing about the internal workings of RSA, how encryption carries out mathematical operations on text, and how to properly handle encoded data to be manipulated or outputted in a program. Even though it was the ‘easiest of the challenges I’ve done, it probably took the longest and taught me the most!