You're about to click on a download button and notice a weird looking code next to it. It doesn't seem to have anything to do how big the file is or whether it's something that you should make a note of. So you go ahead, get the file, and disaster! It doesn't seem to be the same as what you expecting.
If only there was a quick way to see if that the item you've just downloaded is exactly the same as the file that was on the website. Well, there is. Welcome to the world of checksums!
Just what the heck is a checksum?
TL;DR: A checksum is a number, in the form of a binary or hexadecimal value, that's been derived from a data source. The important bits to know: a checksum is typically much smaller than the data source, and it's also almost entirely unique. Meaning that the chances of some other data giving exactly the same checksum is extremely unlikely.
Let's have a look at some examples, the first of which is a simple text file (below), containing some critical information! All files contain data that cover more than just than, say, the text we can see -- there will be bits allocated to telling us what type of file it is, how the data is arranged, and so on...
All of this gets handled in the process of creating the checksum, and we'll show you how it works and how you can do it yourself later in this article.
But for now, let's have a look at the value we get:
By itself, that code doesn't tell us anything. We can't reverse 'hack' it to figure out what the pattern of the ones and zeroes that the text file consisted of. However, it is supposed to be specific to that particular file, so now let's alter the original text file by rearranging some of the words.
The image above clearly shows that it's still the same text, and so technically the same data, but the sequence of the bits is now different. And the checksum this time round is:
Notice how it's the same length -- this is a key aspect of the process of getting the code -- but it's entirely an different checksum. Same data, different order, totally new checksum.
But perhaps that should have been expected; after all, the changes to the file weren't entirely trivial. So let's see what happens when we change just one letter in the whole thing: see if you can spot which one!
Cue the drum roll in the background, as we look at the checksum for this barely altered file.
That change of just one letter has once again given us another unique code. When it comes to checksums, that's the whole point of the system: any changes to an original data source, no matter how small they are, should result in a wholly new checksum, and making it extremely easy to see if something has been altered.
With that out of the way, let's see how it all works then!
The tech behind the check
At the heart of a checksum is the software algorithm that's used to create the codes we saw. In the case of our examples, we used a very common one known as SHA-256 (Secure Hash Algorithm - 256 bits). This algorithm is a type of cryptographic hash function (CHF), with the source data labelled as the message, and the output being called the hash value or just hash (the checksum, in this case).
Developed by the NSA and released nearly 20 years ago, SHA-256 belongs to a class of CHFs that are in widespread use around the world. Their popularity is down to the fact that they work quickly and they're resilient against attempts to 'hack' the code -- although there are much better ones available these days.
Each algorithm has its own way of doing things, but we'll just focus on what SHA-256 does. The process always gives a hash of a fixed length (256 bits in this case), regardless of how large the message is, although technically it's actually 8 values, each 32 bits in size.
So the checksum for our test1 file is actually 798B3808 4999FA50 E7D1861E 07E45F4E 3AA39668 DC6A12A8 4A058CAA A32DE0EB. This has been written in hexadecimal -- writing it out as a string of 256 ones and zeroes would be very tedious!
The first step in algorithm's sequence is to process the message, so that's a collection of blocks, each 512 bits in size. For files that aren't whole number multiples of 512, or if the file is smaller than this size, a trick called padding is employed. This is where a whole stack of zeroes are added after the message's bits are finished, to make it a round 512.
For example, let's say we're trying to find the checksum of a file that's 10145 bits in total size. This would slice up into 19 whole blocks, leave 417 bits left over to fill. To indicate where the data ends and the padding starts, the string of bits that makes up the source has a 1 added to the end. So here, the padding would add 352 zeroes.
Hang on, why isn't it 416? The very final portion of the last block is a special 64-bit number: the length of the original file. That means, for our example, the 20th block would have to finish with the binary value of 10145, resulting in the message only requiring 402 bits of empty space to fill.
Once that's done, the algorithm takes the very first 512-bit block and slices it up into 16 portions, each one 32 bits in length; each of these values will be used in the hash calculation process.
Up until this point, this is the easy part: the rest of process involves a lot of math.
It's all well beyond the scope of this article but if you're interested in digging into it in more detail, you can read more about it here. But to give you a brief overview, it involves creating a starting hash first, using the first 8 prime numbers. These are run through an equation to give a 256-bit long value that's then modified over and over, as the rest of the algorithm works its way through all of the portions, in every block, from the processed source data.
Sounds horribly complicated, yes? For a modern CPU, though, it's a piece of cake.
It takes no more than a dozen or so processor cycles, for every byte of source data, to generate the hash.
So what can you do with a checksum?
TL;DR: A checksum allows you to easily check the integrity of the data that makes up a file.
Picture this scenario: you need to download an important file, that's critical to operate a computer. Really critical, so much so, that you don't want it to have any errors or glitches in it. You've also got a slow and unstable internet connection, and you're worried that it might affect the file as it downloads.
The host of the file knows all of this, so they run a checksum algorithm on the file and put the answer on the download webpage. Once you've got it, you can run the same process and compare the values -- if they're the same, you'll know the file you downloaded is all okay.
And this is primary use of a checksum: checking the integrity of the data that makes up a file. It can be done manually, as we'll see very shortly, or it can be part of an automated operation. Valve uses checksums on the Steam platform as part of the file verification process.
Run your own checksum
All of the major operating systems have a built-in checksum tool, too.
To run a checksum in Windows, the easiest way is to use PowerShell: right-click on the Start Menu button or press Win+X. If you're running an older Windows version, you can download PowerShell from here.
Enter the command get-filehash followed by the file location. Alternatively, enter the command and then drag and drop the file into the PowerShell window. Here's how our first test file was done.
By default, PowerShell uses SHA-256 to produce the checksum, but you can use others such as SHA-512 or MD5. These will all produce a different hash, but it will still be unique to that file. To use a different function, add the command -algorithm followed by the code for it.
For checksums, using a different hash function doesn't bring any significant benefits, although some of the older ones (e.g. MD5, SHA-1) have been shown to produce the same hash for different files -- an issue that's known as a collision. Newer algorithms are more resilient to collisions, which is why PowerShell defaults to SHA-256.
The main reason for needing to switch to a different function is down to the file host choosing to use something other than SHA-256, so you'll need to switch to this, in order to compare the files.
Comparing two long strings of numbers and letters can be a bit difficult to do, but with a tiny bit of programming, you can make PowerShell evaluate the checksums for you. Let's use the above MD5 code as an example and pretend that the original file's hash actually ended with the number 8.
The image below shows the lines of code you need to input, using Shift+Enter after each one.
See how it says 'False'? That's telling you that the file isn't the same. If you're certain that you have the correct hash for the file you want, then all suspicion falls on the data.
Note that a checksum can't tell you how the files are different -- it's a very binary test, if you pardon the pun. But it's a useful tool and there some very specific checksum functions (such as check digit and check bit) that are used all the time, to hunt out errors in data.
Microsoft has made PowerShell available for macOS 10.13 or newer, and Linux, too, but if the thought of using something that originated with Windows gives you the heebie-jeebies, know that you can do the same natively on either OS, too.
For Mac users, you need to fire up the Terminal app, which is in the Utilities folder in Applications. The commands to enter is shasum -a 256, followed by the address for the file you want to check (or just drag and drop into the Terminal window).
The shasum instruction is the equivalent to Get-FileHash in PowerShell, and the '-a 256' part is there to indicate which algorithm to use: 1 for SHA-1, 256 for SHA-256, and 512 for SHA-512.
Notice how it's given us the same checksum for the test file, as we got using PowerShell in Windows? That's the real power of it: no matter what computer or file system you use, as long as the algorithm is the same, you'll always get hash values that can be directly compared.
If you favor the delights of Linux, you'll be pleased to know that it's the same process as above -- fire up the Terminal and enter sha1sum, sha256sum, or sha512sum followed by the file's address to generate the required hash.
Once again, you can see that we've got the same checksum for our text file. All runs are doing the exact same math to create the hash, so none of this shouldn't have come as a surprise, but it's comforting to know that checksums can be done on any computing device.
Adding power to your downloads
Given how quick and easy checksums are, it's perhaps a little surprising that we don't carry them out more often or at all.
While the likes of Steam handle the process for us automatically, we are reliant on file hosts providing accurate checksums for the data they provide. In the case of TechSpot downloads, for example, we don't explicitly provide a checksum but the tools that we use to certify that downloads are clean, such as VirusTotal, use checksum to verify files' integrity and aggregate data when several parties scan the same file over time.
Some websites provide checksums for every file, whereas others only do it for important or very large items (e.g. Microsoft in their secure download sections), but it's becoming an increasingly rare sight. There are various possible reasons for this, such as people simply not being aware of them.
But where hosts do offer it, then at least you now know how you can use the hash -- any extra thing to give you a bit more piece of mind is always a good thing.