Detect duplicate pieces of text with hashing

Detect duplicate pieces of text with hashing - hash

I'm trying to detect similar pieces of text to stop spammers from posting the same pieces of spam with small alterations.
For this I'd like to use a hash instead of saving all sentences in a datastore. To save up room and make lookups fast.
I am hashing the entire text, without punctuations or weird characters, and compar hashes to find duplicate spam.
But as soon as the spammer adds a random value, the system fails.
Does anyone have a way to improve this system? I tried perceptual hashing, but that only seems to work good enough on large pieces of text.

Well, hashing is basically that, you will not detect 'similar' pieces of text using hashing algorithms because they are designed to change completely for two different test cases even if the difference is a comma. Avalanche effect :https://en.wikipedia.org/wiki/Avalanche_effect
Nice idea to delete punctuations/weird chars though. If you can find out where does the spammer add the value (at end-1 line for example) you can then cut out the text and hash from beginning to end-2 (just an idea).
You can also hash keywords of the text (headline, products...)

Related

why can't json web token be reversed engineered? [duplicate]

Why can't you just reverse the algorithm like you could reverse a math function? How is it possible to make an algorithm that isn't reversible?
And if you use a rainbow table, what makes using a salt impossible to crack it? If you are making a rainbow table with brute force to generate it, then it invents each plaintext value possible (to a length), which would end up including the salt for each possible password and each possible salt (the salt and password/text would just come together as a single piece of text).

MD5 is designed to be cryptographically irreversible. In this case, the most important property is that it is computationally unfeasible to find the reverse of a hash, but it is easy to find the hash of any data. For example, let's think about just operating on numbers (binary files after all, could be interpreted as just a very long number).
Let's say we have the number "7", and we want to take the hash of it. Perhaps the first thing we try as our hash function is "multiply by two". As we'll see, this is not a very good hash function, but we'll try it, to illustrate a point. In this case, the hash of the number will be "14". That was pretty easy to calculate. But now, if we look at how hard it is to reverse it, we find that it is also just as easy! Given any hash, we can just divide it by two to get the original number! This is not a good hash, because the whole point of a hash is that it is much harder to calculate the inverse than it is to calculate the hash (this is the most important property in at least some contexts).
Now, let's try another hash. For this one, I'm going to have to introduce the idea of clock arithmetic. On a clock, there aren't an infinite amount of number. In fact, it just goes from 0 to 11 (remember, 0 and 12 are the same on a clock). So if you "add one" to 11, you just get zero. You can extend the ideas of multiplication, addition, and exponentiation to a clock. For example, 8+7=15, but 15 on a clock is really just 3! So on a clock, you would say 8+7=3! 6*6=36, but on a clock, 36=0! so 6*6=0! Now, for the concept of powers, you can do the same thing. 2^4=16, but 16 is just 4. So 2^4=4! Now, here's how it ties into hashing. How about we try the hash function f(x)=5^x, but with clock arithmetic. As you'll see, this leads to some interesting results. Let's try taking the hash of 7 as before.
We see that 5^7=78125 but on a clock, that's just 5 (if you do the math, you see that we've wrapped around the clock 6510 times). So we get f(7)=5. Now, the question is, if I told you that the hash of my number was 5, would you be able to figure out that my number was 7? Well, it's actually very hard to calculate the reverse of this function in the general case. People much smarter than me have proved that in certain cases, reversing this function is way harder than calculating it forward. (EDIT: Nemo has pointed out that this in fact has not been "proven"; in fact, the only guarantee you get is that a lot of smart people have tried a long time to find an easy way to do so, and none of them have succeeded.) The problem of reversing this operation is called the "Discrete Logarithm Problem". Look it up for more in depth coverage. This is at least the beginning of a good hash function.
With real world hash functions, the idea is basically the same: You find some function that is hard to reverse. People much smarter than me have engineered MD5 and other hashes to make them provably hard to reverse.
Now, perhaps earlier the thought has occurred to you: "it would be easy to calculate the inverse! I'd just take the hash of every number until I found the one that matched!" Now, for the case where the numbers are all less than twelve, this would be feasible. But for the analog of a real-world hash function, imagine all the numbers involved are huge. The idea is that it is still relatively easy to calculate the hash function for these large numbers, but to search through all possible inputs becomes harder much quicker. But what you've stumbled upon is the still a very important idea though: searching through the input space for an input which will give a matching output. Rainbow tables are a more complex variation on the idea, which use precomputed tables of input-output pairs in smart ways in order to make it possible to quickly search through a large number of possible inputs.
Now let's say that you are using a hash function to store passwords on your computer. The idea is this: The computer just stores the hash of the correct password. When a user tries to login, you compare the hash of the input password to the hash of the correct password. If they match, you assume the user has the correct password. The reason this is advantageous is because if someone steals your computer, they still don't have access to your password, just the hash of it. Because the hash function was designed by smart people to be hard to take the reverse of, they can't easily retrieve your password from it.
An attacker's best bet is a bruteforce attack, where they try a bunch of passwords. Just like you might try the numbers less that 12 in the previous problem, an attacker might try all the passwords just composed of numbers and letters less than 7 characters long, or all words which show up in the dictionary. The important thing here is that he can't try all possible passwords, because there are way too many possible 16 character passwords, for example, to ever test. So the point is that an attacker has to restrict the possible passwords he tests, otherwise he will never even check a small percentage of them.
Now, as for a salt, the idea is this: What if two users had the same password? They would have the same hash. If you think about it, the attacker doesn't really have to crack every users password individually. He simply goes through every possible input password, and compares the hash to all the hashes. If it matches one of them, then he has found a new password. What we'd really like to force him to do is calculate a new hash for every user+password combination he wants to check. That's the idea of a salt, is that you make the hash function be slightly different for every user, so he can't reuse a single set of precomputed values for all users. The most straightforward way to do this is to tack on some random string to each user's password before you take the hash, where the random string is different for each user. So, for example, if my password is "shittypassword", my hash might show up as MD5("6n93nshittypassword") and if your password is "shittypassword", your hash might show up as MD5("fa9elshittypassword"). This little bit "fa9el" is called the "salt", and it's different for every user. For example, my salt is "6n93n". Now, this little bit which is tacked on to your password is just stored on your computer as well. When you try to login with the password X, the computer can just calculate MD5("fa9el"+X) and see if it matches the stored hash.
So the basic mechanics of logging in remain unchanged, but for an attacker, they are now faced with a more daunting challenge: rather than a list of MD5 hashes, they are faced with a list of MD5 sums and salts. They essentially have two options:
They can ignore the fact that the hashes are salted, and try to crack the passwords with their lookup table as is. However, the chances that they'll actually crack a password are much reduced. For example, even if "shittypassword" is on their list of inputs to check, most likely "fa9elshittypassword" isn't. In order to get even a small percentage of the probability of cracking a password that they had before, they'll need to test orders of magnitude more possible passwords.
They can recalculate the hashes on a per-user basis. So rather than calculating MD5(passwordguess), for each user X, they calculate MD5( Salt_of_user_X + passwordguess). Not only does this force them to calculate a new hash for each user they want to crack, but also most importantly, it prevents them from being able to use precalculated tables (like rainbow table, for example), because they can't know what Salt_of_user_X is before hand, so they can't precalculate the hashes to test.
So basically, if they are trying to use precalculated tables, using a salt effectively greatly increases the possible inputs they have to test in order to crack the password, and even if they aren't using precalculated tables, it still slows them down by a factor of N, where N is the number of passwords you are storing.
Hopefully this answers all your questions.

Think of 2 numbers from 1 to 9999. Add them. Now tell me the final digit.
I can't, from that information, deduce which numbers you originally thought of. That is a very simple example of a one-way hash.
Now, I can think of two numbers which give the same result, and this is where this simple example differs from a 'proper' cryptographic hash like MD5 or SHA1. With those algorithms, it should be computationally difficult to come up with an input which produces a specific hash.

One big reason you can't reverse the hash function is because data is lost.
Consider a simple example function: 'OR'. If you apply that to your input data of 1 and 0, it yields 1. But now, if you know the answer is '1', how do you back out the original data? You can't. It could have been 1,1 or maybe 0,1, or maybe 1,0.
As for salting and rainbow tables. Yes, theoretically, you could have a rainbow table which would encompass all possible salts and passwords, but practically, that's just too big. If you tried every possible combination of lower case letters, upper case, numbers, and twelve punctuation symbols, up to 50 characters long, that's (26+26+10+12)^50 = 2.9 x 10^93 different possibilities. That's more than the number of atoms in the visible universe.
The idea behind rainbow tables is to calculate the hash for a bunch of possible passwords in advance, and passwords are much shorter than 50 characters, so it's possible to do so. That's why you want to add a salt in front: if you add on '57sjflk43380h4ljs9flj4ay' to the front of the password. While someone may have already computed the hash for "pa55w0rd", no one will have already calculated the has for '57sjflk43380h4ljs9flj4aypa55w0rd'.

I don't think the md5 gives you the whole result - so you can't work backwards to find the original things that was md5-ed

md5 is 128bit, that's 3.4*10^38 combinations.
the total number of eight character length passwords:
only lowercase characters and numbers: 36^8 = 2.8*10^12
lower&uppercase and numbers: 62^8 = 2.18*10^14
You have to store 8 bytes for the password, 16 for the md5 value, that's 24 bytes total per entry.
So you need approx 67000G or 5200000G storage for your rainbow table.
The only reason that it's actually possible to figure out passwords is because people use obvious ones.

How to crack SHA-256 whose preimage consists of multiple words?

I want to crack the preimage of a SHA-256 hash, it is an exercise and my only hint is: Concatenation of four visible words.
I have tried to google the hash and put it into several online crackers / rainbow tables already. I think bruteforce is not an option because four words should be too long in sum, even if each word on its own is short.
So the only thing left would be a dictionary attack, right? But four word dictionaries should be too large to search through, I have tried to generate some via hashcat-utils/combinator.bin and got a RAM overflow at about 50GB even for a short input dictionary. For more popular English nouns (top100) I have created small dictionaries of 4 words with no success either.
Any ideas how to approach this further?
PS: visiblevisiblevisiblevisible is unfortunately not the answer - I tried these puns as well. :-)

You should try shorter wordlists like google-10000-english.txt or wordlists relating to hints from the exercise.
Have a look at this article too: https://www.netmux.com/blog/cracking-12-character-above-passwords

Number only Hash or Cipher decrypt

I wanted to know if it's possible to derive a method to generate a cipher or Hash if I have a large data sample of the ciphered text and it's corresponding ASCII text.
An example of the ciphered text is: 01jvaWf0SJRuEL2HM5xHVEV6C8pXHQpLGGg2gnnkdZU=
That would translate to: 12540991
the ASCII text contains only numbers.
I would think it is possible, since we're dealing with only numbers and I do have a sample of the ciphers and their ASCII translations.
But I am not sure where to start looking, or maybe I am wrong and such a thing is not possible.
What do you guys think ?

If you are trying to derive the original algorithm that generated the hashes of a giving set of values and hashes, you could try mainstream algorithms and see if you get any hits, if not it maybe impossible or simply take to much time to find, the most common homegrown algorithms tend to be a combination of a world wide salt + unique random salt + multiple iterations of a common hashing function SHA256.
If you are trying to invert a mainstream hashing functions, that would be impossible, there one way functions, you can't find the original text giving the hash value, if you still want the original text you would need to iterate over all the possible values to determine which generated that hash, being that its numbers it isn't that bad, just build up a look up table using which ever algorithm was used, the hash would be key and text that generated that hash would be the value, one done simply look up the hash to find the original text. This is called an online attack.

What you're describing is what's called a known-plaintext attack. This is a form of cryptanalysis, so it is certainly possible, although good one-way hash algorithms are designed to be resistant to it.
While it's possible, it is unlikely to be practical against well-known hashing algorithms unless you are an expert in cryptography and an experienced code-breaker--and even then, it's not what one might call a short-term project.
A homegrown algorithm or simple encoding scheme is another matter, of course. If your question is "Is it possible?", then the answer is "Yes."

Words Prediction - Get most frequent predecessor and successor

Given a word I want to get the list of most frequent predecessors and successors of the word in English language.
I have developed a code that does bigram analysis on any corpus ( I have used Enron email corpus) and can predict the most frequent next possible word but I want some other solution because
a) I want to check the working / accuracy of my prediction
b) Corpus or dataset based solutions fail for an unseen word
For example, given the word "excellent" I want to get the words that are most likely to come before excellent and after excellent
My question is whether any particular service or api exists for the purpose?

Any solution to this problem is bound to be a corpus-based method; you just need a bigger corpus. I'm not aware of any web service or library that is does this for you, but there are ways to obtain bigger corpora:
Google has published a huge corpus of n-grams collected from the English part of the web. It's available via the Linguistic Data Consortium (LDC), but I believe you must be an LDC member to obtain it. (Many universities are.)
If you're not an LDC member, try downloading a Wikipedia database dump (get enwiki) and training your predictor on that.
If you happen to be using Python, check out the nice set of corpora (and tools) delivered with NLTK.
As for the unseen words problem, there are ways to tackle it, e.g. by replacing all words that occur less often than some threshold by a special token like <unseen> prior to training. That will make your evaluation a bit harder.

You have got to give some more instances or context of "unseen" word so that the algorithm can make some inference.
One indirect way can be reading rest of the words in the sentences.. and looking into a dictionary for the words where those words are encountered.
In general, you cant expect the algorithm to learn and understand the inference in the first time. Think about yourself.. If you were given a new word.. how well can you make out its meaning (probably by looking into how it has been used in the sentence and how well your understanding is) but then you make an educated guess and over the period of time you understand the meaning.

I just re-read the original question and I realize the answers, mine included got off base. I think the original person just wanted to solve a simple programming problem, not look for datasets.
If you list all distinct word-pairs and count them, then you can answer your question with simple math on that list.
Of course you have to do a lot of processing to generate the list. While it's true that if the total number of distinct words is as much a 30,000 then there are a billion possible pairs, I doubt that in practice there are that many. So you can probably make a program with a huge hash table in memory (or on disk) and just count them all. If you don't need the insignificant pairs you could write a program that flushes out the less important ones periodically while scanning. Also you can segment the word list and generate pairs of a hundred words verses the rest, then the next hundred and so on, and calculate in passes.
My original answer is here I'm leaving it because it's my own related question:
I'm interested in something similar (I'm writing a entry system that suggest word completions and punctuation and I would like it to be multilingual).
I found a download page for google's ngram files, but they're not that good, they're full of scanning errors. 'i's become '1's, words run together etc. Hopefully Google has improved their scanning technology since then.
The just-download-wikipedia-unpack=it-and-strip-the-xml idea is a bust for me, I don't have a fast computer (heh, I have a choice between an atom netbook here and an android device). Imagine how long it would take me to unpack a 3 gigabytes of bz2 file becoming what? 100 of xml, then process it with beautiful soup and filters that he admits crash part way through each file and need to be restarted.
For your purpose (previous and following words) you could create a dictionary of real words and filter the ngram lists to exclude the mis-scanned words. One might hope that the scanning was good enough that you could exclude misscans by only taking the most popular words... But I saw some signs of constant mistakes.
The ngram datasets are here by the way http://books.google.com/ngrams/datasets
This site may have what you want http://www.wordfrequency.info/

Cost of isEqualToString: vs. Numerical comparisons

I'm working on a project with designing a core data system for searching and cataloguing images and documents. One of the objects in my data model is a 'key word' object. Every time I add a new key word I first want to first run though all of the existing keywords to make sure it doesn't already exist in the current context.
I've read in posts here and in a lot of my reading that doing string comparisons is a far more expensive processing than some other comparison operations. Since I could easily end up having to check many thousands of words before a new addition I'm wondering if it would be worth using some method that would represent the key word strings numerically for the purpose of this process. Possibly breaking down each character in the string into a number formed from the UTF code for each character and then storing that in an ID property for each key word.
I was wondering if anyone else thought any benefit might come from this approach or if anyone else had any better ideas.

What you might useful is a suitable hash function to convert your text strings into (probably) unique numbers. (You might still have to check for collision effects.)
Comparing intrinsic numbers in C code is a much faster for several reasons. It avoids the Objective C runtime dispatch overhead. It requires accessing less total memory. And the executable code for each comparison is usually just an instruction or 3, rather than a loop with incrementers and several decision points.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse