A Better Way to Protect Your Database IDs
Table of Contents
A Better Way to Protect Your IDs #
Your web application has a strong way of authentication and every resource has checks if your user is actual authorized to access it. So why bother if the user actual knows the internal IDs for the models she is accessing?
Issue #1: Leak of Business Intelligence Data #
Your IDs might expose more information than you might realize. For instance in a web shop when you make an order you will probably be redirected to a success site with your order_id found as a query parameter (or similar):
From this you can probably estimate how many orders they processed. But it gets worse. If you make another other one, let’s say 14 days later, and it gets the ID 7921 you can deduce that they receive about 4 orders a day. This is business intelligence data you maybe don’t want your competition to know (see also this article for a more thorough discussion on this issue).
Issue #2: Brute Force Guessing of IDs #
Often resources are public, but are protected by the fact that nobody knows the correct identifier. Think about a naive photo sharing site where you can make a single photo accessible via a link:
Now this ID might be totally random, or it might follow a sequence. An attacker could easily check a reasonable interval from e.g. 10989900 to 10990000 and see if any of these links work.
Issue #3: Leak of Personal Information Through IDs #
Sometimes you are required to share your user IDs with a statistics tool or other external service. If this ID is your primary key in the DB and used as ID in the shop it can probably be used on other parts of the site as an user_id parameter and might leak personal information. Think about a message board where users usually have a public profile.
What can be done? #
Solution #1: Using UUIDs #
A simple solution which tackles most of the issues is to not use a sequence in a small numeric range (i.e. a 64-bit integer) but to use a random value of a big numeric range. An implementation of such a concept is the universally unique identifier or UUID. Version 4 is based on random numbers and the one you want to use. Due to having some metadata it cannot use all the 128-bit data, but is limited to 122-bit (which does not make a big difference in real world applications). There is no standardized text representation, but it is usually formatted as something like this:
The range of 122-bit is so huge, you can pick any such number randomly and have a nearly 100% chance of it being unique in your context. In fact, you are probably the first person to every generate this exact number. Note though that a UUID does not give guarantees whatsoever of it being truly random — most implementation are, however. See this post for more info on the issue.
It is now absolutely infeasible for an attacker to guess your IDs (but guessing is theoretically still possible) and if the UUIDs are truly random, there is no observing a sequence (fixes Issue #1 and #2).
The downside of using UUIDs is that it is maybe slower or more expensive to index in your DB if you use it as primary key and might be a hassle to create a new column if you use it as correlation ID. Read here for more in-depth discussion about UUIDs used in databases.
Also, you still expose an internal ID to the public, which, if used as primary key, cannot change. This requirement might not happen often, but it does. You may also still be prone to Issue #3.
Solution #2: Mask your IDs #
The idea is not new, in fact there is a very popular set of libraries called HashIds which tries to tackle this exact issue. The basic principle is this:
Before publishing your IDs, encode or encrypt them in a way that makes the underlying value incomprehensible for any client not knowing the secret key.
In my opinion HashIds has, among others, two main issues:
- No real security, more like a home-brew keyed encoding schema and no forgery protection which means an attacker can still easily brute force IDs without understanding them.
Improved ID Protection: ID-Mask #
Unsatisfied with these properties I tried to create an improved version which tackles these and other issues called ID-Mask with a reference implementation in Java. The basic features are:
- Support of all types usually used for IDs
- Strong cryptography with forgery protection
- Optional randomized IDs
Note that with this (and HashIds) approach, there is no possibility of collision since no compression happens (like with a hash).
Full Type-Support for IDs #
If you think about it, there are only a handful of common types used as identifier:
- 64-bit integers (often called long)
- UUIDs(which are essentially 128-bit numbers)
- Arbitrary precision integers (called BigInteger in Java)
If we somewhat restrict the arbitrary precision part to around 128 bit, we can group into two basic ID types: 64-bit and 128-bit IDs. All of those data types (and some more exotic types for specific uses cases) are supported by the library.
Strong Cryptography with Forgery Protection #
Instead of encoding the ID with a shuffled alphabet, we can just use proven cryptography with the Advanced Encryption Standard (AES) and a hash-based message authentication code (HMAC), the latter of which protects the IDs from being altered by an attacker. All the caller needs to provide is a strong secret key. Fortunately since the key-id is encoded in the ID, the secret key can be changed if it gets compromised. There is a slight difference in the scheme for 64-bit based and 128-bit IDs to optimize for output size. For the interested reader, here is the full explanation of the schema. There is also a discussion on crypto.stackexchange.com
In addition, of solving the main Issue #1, these properties also protect from the attack described in Issue #2: brute forcing. With an so-called authentication tag (i.e. the HMAC) attached to the ID, it is now extremely unlikely for an attacker to generate a valid ID.
Support for Randomized IDs #
With IDs for domain models it makes sense that these are deterministic (i.e. do not change for the same model over time) so that the client can check for equality for example for caching. However, some use cases benefit from IDs that generate randomly looking output to make it hard for an attacker to compare or replicate IDs.
An example would be shareable links. Using the same scenario as above, of the photo sharing app, instead of the actual value it would just look like this:
Using the same ID, generating 2 more masked IDs will result in unrelated looking output:
Using this method the problem described in Issue #3: leak of personal information can be solved by generating randomized IDs for e.g. your user_id which cannot be used to find context information in e.g. your main site since they do not match. You are however still able to map the ids back to the original users.
Adapt the Encoding to Your Needs #
The reference implementation supports a wide array of encodings which may be chosen due to various requirements (short IDs vs. readability vs. should not contain words), all of them being url-safe of course.
Example using 64-bit IDs:
with optional formatting for better readability:
To avoid the problem of randomly occurring (english) words in the masked IDs which could create embarrassing URLs like
a Base32 dialect was added with a custom alphabet containing no vowels and other problematic letters and numbers. For example these could look like this:
Encoding optimized to not contain words
And More #
Currently, the reference implementation is quite fast with around 2–8µs per ID encryption. There is a built-in cache to improve performance for frequently recurring IDs. Additionally, there are default implementations for Java Jackson JSON Serializer and JAX-RS ParamConverter.
Code Example #
Here is a simple example using a 64-bit ID:
For more see the readme of the GitHub project.
There are many reasons why you may want to protect your internal IDs, either because they leak business intelligence, allow for brute forcing to find hidden content and may leak context of personal information. A possible solution is to use an additional ID column in your DB using UUIDs which solves many, but not all of the issues and may not be feasible with millions of data-sets. Another option is to encrypt your IDs to protect their value. There is a set of libraries called HashIds which tries to use this approach but currently has some major issues. I implemented a new approach, called ID-Mask, with a reference implementation in Java which supports wide variety of data types usually used as IDs, strong cryptography and also supports generation randomized IDs. The library is ready to use and can be found on GitHub.