I want to create an unique and short id(it could allow the collision) with users input(such as 'today is beautiful!' and etc.)
but I find the MD5 is too long(32 bits) so there are any other methods for me to solve this problem.
You can choose whatever hashing algorithm you like and trim it to the length you need.
E.g.: last 6 digits of the MD5.
Related
I have a table with a varchar column named key, which is supposed to hold a unique, 8-char random string, which is going to be used as an unique identifier by users. This field should be generated and saved on creation of objects, I have a question about how to create it:
Most of recommendations point to UUID field, but it's not applicable for me because it's too long, and if just get a subset of it then there's no guarantee of uniqueness.
Currently I've just implemented a loop in my backend (not DB), which generates a random string and tries to insert it to DB, and retries if the string turns out to be not unique. But I feel that this is just a really bad practice.
What's the best way to do this?
I'm using Postgresql 9.6
UPDATE:
My main concern is to remove the loop that retries to find a random, short string (or number, doesn't matter) that is unique in that table. AFAIK the solution should be a way to generate the string in DB itself. The only thing that I can find for Postgresql is uuid and uuid-ossp that does something like this, but uuid is way too long for my application, and I don't know of any way to have a shorter representation of uuid without compromising it's uniqueness (and I don't think it's possible theoretically).
So, how can I remove the loop and it's back-and-forth to DB?
Encryption is guaranteed unique, it has to be otherwise decryption would not work. Provided you encrypt unique inputs, such as 0, 1, 2, 3, ... then you are guaranteed unique outputs.
You want 8 characters. You have 62 characters to play with: A-Z, a-z, 0-9 so convert your binary output from the encryption to a base 62 number.
You may need to use the cycle walking technique from Format-preserving encryption to handle a few cases.
Can anyone suggest me the best possible options that I can use in pentaho to suit my requirement. The requirement is we need to convert first_name & last_name attributes into hash and load the hash values for these columns into the user table to support the business reports. For the reports the actual values for these columns are not needed, the reporting code only checks for NULL values in first_name & last_name columns, and validates length of these fields.
I tried converting the fields to hash using Add checksum transformation but wasn't sure about which type of checksum to use (CRC 32, ADLER 32, MD5, SHA-1). Any suggestions?
source & target DB is PostgreSql not sure if it's needed.
Thanks in advance.
Hashing and encryption are not the same thing.
It seems you want a one-way hash. What hash you choose depends mainly on how much you care about collisions. If you don't care that multiple names could generate the same hash, a short fast hash like CRC32 is fine. If you do care about collisions then I'd use at least MD5.
According to Amazon Redshift docs, the passwords must be at least 8 chars, and contain at least one uppercase letter, one lowercase letter, and one number.
Is there a way to disable this for a database?
We do not need such stringent requirements.
Also, the docs are unclear, but if I don't specify VALID UNTIL 'something' then it is valid forever, right? (The docs say you can also use VALID UNTIL 'infinity' but don't explain what would happen if you don't include VALID UNTIL at all)
You cannot modify the Redshift password criteria.
If you are referring to ALTER USER ... [VALID UNTIL], the validity date is not a required field. The password will remain valid forever.
By using the md5 function, you can get around the lengh/char requirements:
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_USER.html
In particular, quoting from the above page:
To specify an MD5 password, follow these steps:
Concatenate the password and user name.
For example, for password ez and user user1, the concatenated string is ezuser1.
Convert the concatenated string into a 32-character MD5 hash string. You can use any MD5 utility to create the hash string. The following example uses the Amazon Redshift MD5 Function and the concatenation operator ( || ) to return a 32-character MD5-hash string.
select md5('ez' || 'user1');
md5
153c434b4b77c89e6b94f12c5393af5b
Concatenate 'md5' in front of the MD5 hash string and provide the concatenated string as the md5hash argument.
create user user1 password 'md5153c434b4b77c89e6b94f12c5393af5b';
Log on to the database using the user name and password.
For this example, log on as user1 with password ez.
I'm working with PostgreSQL.
I need to transform "http://www.xyz.com/some_uri/index1.html" in something like "scdfdsffd"(some unique key, based on the URL that is a unique key in the table).
By other words... the URL is a unique key in the table, but I need to generate a small unique key based on the URL.
What can I do with standard PostgreSQL 8.4?
Best Regards,
Several methods:
a) Why not use an auto-incrementing column or sequence generator to generate unique integers per insert? If you have less than 100 million URLs, your identifiers are short and easy to remember. However, if that's not an option (e.g. because you don't want people guessing IDs and attacking the database that way):
b) The built-in MD5() function may help:
INSERT INTO table (pkey, url) VALUES (MD5('http://...'), 'http://...');
MD5() is a hash function and will most likely give you a unique identifier per URL. I say "most likely" because you get a 128-bit hash from MD5, and the likelihood of a hash collision is on the order of 2^-128 (about 10^-55).
If you need smaller identifiers you can chop the result from MD5 down to a smaller number of characters, but you could potentially significantly increase the chance of a hash collision depending on which characters you take.
[Note: timestamp answer redacted since it in no way solves the original problem. -BobG]
So there is this nice picture in the hash maps article on Wikipedia:
Everything clear so far, except for the hash function in the middle.
How can a function generate the right index from any string? Are the indexes integers in reality too? If yes, how can the function output 1 for John Smith, 2 for Lisa Smith, etc.?
That's one of the key problems of hashmaps/dictionaries and so on. You have to choose a good hash function. A very bad but fast hash function could be the length of the keys. You instantly see, that you will get a lot of collisions (different keys, but same hash). Another bad hash function could be the ASCII value of the first character of your key. Lot's of collisions, too.
So you need a function that is a lot better than those two. You could add (xor) all ASCII values of the key characters and mix the length in for instance. In practice you often depend on the values (fields) of the object that you want to hash (same values give same hash => value type). For reference types you can mix in a memory location for instance.
In your example that's just simplified a lot. No real hash function would map these keys to sequential numbers.
Maybe you want to read one of my previous answers to hashmaps
A simple hash function may be as follows:
$hash = $string[0] % HASH_TABLE_SIZE;
This function will return a number between 0 and HASH_TABLE_SIZE - 1, depending on the first letter of the string. This number can be used to go to the correct position in the hash table.
A real hash function will consider all letters in a string, and it will be designed so that there is an even spread among the buckets.
The hash function most often (but not necessarily always) outputs an integer within wanted range (often parameter to the hash function). This integer can be used as an index. Notice that hash function cannot be guaranteed to always produce unique result when given different data to hash. This is called hash collision and hash algorithm must always handle it in some way.
As for your specific question, how a string becomes a number. Any string is composed of characters (J, o, h, n ...) and characters can be interpreted as numbers (in computers). ASCII and UTF standards bind certain values to certain characters, so result is deterministic and always the same on all computers. So the hash function does operation on these characters that processes them as numbers and comes up with another number (output). You could for example simply sum all the values and use modulo operation to range-limit the resulting value.
This would be quite a horrible hashing function because for example "ab" and "ba" would get same result. Design of hash function is difficult and so one should use some ready-made algorithm unless situation dictates some other solution.
There's a really good article on how hash functions (and colision detection/resolution) on MSDN:
Part 2: The Queue, Stack, and Hashtable
You can skip down to the header Compressing Ordinal Indexing with a Hash Function
There are some bits and pieces that are .NET specific (when they talk about which Hash algorithm .NET uses by default) but for the most part it is language agnostic.
All that is required of a hash function is that it returns the same integer given the same key. Technically, a hash function that always returns '1' is not incorrect.