There are many reasons why you may want to protect your internal IDs, either because they leak business intelligence, allow for brute forcing and may leak context of personal information.
A new approach on protecting your internal IDs with a strong cryptographic schema and many other useful features. Inspired by HashIds.
Security, Id, Java, Hashids, Database
A Better Way to Protect Your IDs
Your web application has a strong way of authentication and every resource has checks if your user is actual authorized to access it. So why bother if the user actual knows the internal IDs for the models she is accessing?
Your IDs might expose more information than you might realize. For instance in a web shop when you make an order you will probably be redirected to a success site with your
order_id found as a query parameter (or similar):
From this you can probably estimate how many orders they processed. But it gets worse. If you make another other one, lets say 14 days later and it gets the id
7921 you can deduce that they receive about 4 orders a day. This is business intelligence data you maybe don’t want your competition to know (see also this article for a more thorough discussion on this issue).
Often resources are public, but are protected by the fact that nobody knows the correct identifier. Think about a naive photo sharing site where you can make a single photo accessible via a link:
Now this id might be totally random, or it might follow a sequence. An attacker could easily check a reasonable interval from e.g. 10989900 to 10990000 and see if any of these links work.
Sometimes you are required to share your user IDs with a statistics tool or other external service. If this ID is your primary key in the db and used as id in the shop it can probably be used on other parts of the site as an
user_id parameter and might leak personal information. Think about a message board where users usually have a public profile.
A simple solution which tackles most of the issues is to not use_ a_ sequence in a_ small numeric range_ (i.e. a 64-bit integer) but to use a random value of a_ big numeric range_. An implementation of such a concept is the universally unique identifier or UUID. Version 4 is based on random numbers and the one you want to use. Due to having some metadata it cannot use all of the 128-bit data, but is limited to 122-bit (which does not make a big difference in real world applications). There is no standardized text representation, but it is usually formatted as something like this:
The range of 122-bit is so huge, you can pick any such number randomly and have a nearly 100% chance of it being unique in your context._ In fact you are probably the first person to every generate this exact number._ Note though that a UUID does not give guarantees whatsoever of it being truly random — most implementation are however. See this post for more info on the issue.
It is now absolutely infeasible for an attacker to guess your ids (but guessing_ is_ theoretically still possible) and if the UUIDs are truly random, there is no observing a sequence (fixes_ Issue #1 and #2_).
The downside of using UUIDs is that it is maybe slower or more expensive to index in your DB if you use it as primary key and might be a hassle to create a new column if you use it as correlation id. Read here for more in-depth discussion about UUIDs used in databases.
Also you still expose an internal id to the public, which if used as primary key, cannot change. This requirement might not happen often, but it does. You may also still be prone to_ Issue #3_.
The idea is not new, in fact there is a very popular set of libraries called _HashIds_ which tries to tackle this exact issue. The basic principle is this:
Before publishing your IDs, encode or encrypt them in a way that makes the underlying value incomprehensible for any client not knowing the secret key.
In my opinion HashIds has, among others, two main issues:
No real security, more like a home-brew keyed encoding schema and no forgery protection which means an attacker can still easily brute force IDs without understanding them.
Unsatisfied with these properties I tried to create an improved version which tackles these and other issues called** ID-Mask** with a reference implementation in Java. The basic features are:
Support of all types usually used for IDs
Strong cryptography with forgery protection
Optional randomized IDs
Note that with this (and HashIds) approach,_ there is no possibility of collision_ since no compression happens (like with a hash).
If you think about it, there are only a handful of common types used as identifier:
64-bit integers (often called
UUIDs(which are essentially 128-bit numbers)
Arbitrary precision integers (called
BigInteger in Java)
If we somewhat restrict the arbitrary precision part to around 128 bit, we can group into two basic id types: 64 bit and 128 bit IDs. All of those data types (and some more exotic types for specific uses cases) are supported by the library.
Instead of encoding the ID with a shuffled alphabet, we can just use proven cryptography with the Advanced Encryption Standard (AES) and a hash-based message authentication code (HMAC), the later of which protects the IDs from being altered by an attacker. All the caller needs to provide is a strong secret key. Fortunately since the key-id is encoded in the ID, the secret key can be changed if it gets compromised. There is a slight difference in the scheme for 64-bit based and 128 bit IDs to optimize for output size. For the interested reader, here is the full explanation of the schema. There is also a discussion on crypto.stackexchange.com
In addition of solving the main_ Issue #1_, these properties also protect from the attack described in_ Issue #2: brute forcing._ With an so called_ authentication tag_ (i.e. the HMAC) attached to the id, it is now extremely unlikely for an attacker to generate a valid ID.
With IDs for domain models it makes sense that these are deterministic (i.e. do not change for the same model over time) so that the client can check for equality for example for caching. However some use cases benefit from IDs that generate randomly looking output to make it hard for an attacker to compare or replicate IDs.
An example would be shareable links. Using the same scenario as above, of the photo sharing app, instead of the actual value it would just look like this:
Using the same id, generating 2 more masked IDs will result in unrelated looking output:
Using this method the problem described in_ Issue #3: leak of personal information_ can be solved by generating randomized IDs for e.g. your
user_id which cannot be used to find context information in e.g. your main site since they do not match. You are however still able to map the ids back to the original users.
The reference implementation supports a wide array of encodings which may be chosen due to various requirements (short IDs vs._ readability_ vs._ should not contain words_), all of them being_ url-safe_ of course.
Example using a 64-bit IDs:
with optional formatting for better readability:
To avoid the problem of** randomly occurring (englisch) words** in the masked IDs which could create embarrassing URLs like
Base32 dialect was added with a custom alphabet containing** no vowels** and** other problematic letters** and numbers. For example these could look like this:
Encoding optimized to not contain words
Currently the reference implementation is quite fast with around 2–8µs per ID encryption. There is a built-in cache to improve performance for frequently recurring IDs. Additionally there are default implementations for Java Jackson Json Serializer and JAX-RS
Here is a simple example using a 64-bit ID:
For more see the readme of the Github project.
There are many reasons why you may want to protect your internal IDs, either because they** leak business intelligence, allow for brute forcing** to find hidden content and may leak context of** personal information.** A possible solution is to use an additional ID column in your DB using** UUIDs** which solves many, but not all of the issues and** may not be feasible** with millions of data-sets. Another option is to** encrypt your IDs** to protect their value. There is a set of libraries called** HashIds** which tries to use this approach but currently has some major issues. I implemented a new approach, called** ID-Mask, with a reference implementation in Java which supports wide variety of data types** usually used as IDs,** strong cryptography** and also supports generation** randomized IDs**. The library is ready to use and can be found on Github.
This article was published on 4/18/2020 on medium.com.
Overview How to Centralize your Checkstyle Configuration with Maven A Better Way to Protect Your IDs Security Best Practices: Symmetric Encryption with AES in Java and Android: Part 2: AES-CBC + HMAC The Bcrypt Protocol… is kind of a mess The Concise Interface Implementation Pattern Improving ProGuard Name Obfuscation Handling Proguard as Library Developer Managing Logging in a Multi-Module Android App Security Best Practices: Symmetric Encryption with AES in Java and Android
Patrick Favre-Bulle 2020