Hash functions and polynomial division [closed] - hash

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I understand that a CRC verifies data integrity by producing a checksum, which is the result of polynomial long division. I've heard hash values referred to as hash checksums, so my question is whether hash functions use some sort of polynomial division as well? I know they break the data up into block ciphers, so my guess would be that the hash functions create some relationship between the polynomial check value and how it's divided into the different blocks. Can someone let me know if I'm way off base here?

A CRC is a hash function, but there are many other ways to implement a hash function. The other ways generally do not use polynomial division, though there are some that use a CRC as a part of the hash calculation, in order to make use of hardware CRC instructions. Most hash functions use a long, convoluted series of ands, nots, exclusive-ors, integer additions, multiplications, and modulos.

Related

How do you do stratified sampling across different groups, when creating train and test sets, in pyspark? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am looking for a solution to split my data to Test and Train sets but I want to have all the levels of my categorical variable in both test and train.
My variable has 200 levels and the data is 18 million records. I tried sampleBy function with fractions (0.8) and could get the training set but had difficulties getting the test set since there is no index in Spark and even with creating a key, using left join or subtract is very slow to get the test set!
I want to do a groupBy based on my categorical variable and randomly sample each category and if there is only one observation for that category, put that in the train set.
Is there a default function or library to help with this operation?
A pretty hard problem.
I don't know of an in-built function which will help you get this. Using sampleBy and then so subtraction subtraction would work but as you said - would be pretty slow.
Alternatively, wonder if you can try this*:
Use window functions, add row num and remove everything with rownum=1 into a separate dataframe which you will add into your training in the end.
On the remaining data, using randomSplit (a dataframe function) to divide into training and test
Add the separated data from Step 1 to training.
This should work faster.
*(I haven't tried it before! Would be great if you can share what worked in the end!)

What prevents me from reversing a hash function? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
What actually prevents me from reversing a hash function and generating a possible input from a hash that will have the same hash?
I understand that hash functions are one-way functions, which means I cannot recover the real input by it's hash.
I googled it a lot, and I found out a lot of peoples explaining this simple example hash function:
hash(x) = x % 7
I can't recover the input (x) from the hash here, but if I know the hash, I can generate a possible input from it that will have the same hash:
unhash(h) = some_random_integer * 7 + h
The value of some_random_integer does not matter at all. unhash(3) will be for example: 24 , and hash(24) is: 3 !
One more example that I found is:
hash(x, y) = x * y
So like the previous example, I cannot find the real input (x and y) from the hash but I can find a possible input that will have the same hash:
x = hash / some_random_integer
y = hash / x
When for example, a malicious hacker gains access to a database full of hashed passwords, he would be able to log in to a hacked user only by generating a possible input that will generate the same hash as his password! It does not have to be the exact original password.
I know that real hash functions are a lot more complicated than this examples, but I cannot think of a math operation that cannot be reversed this way. (or maybe there are some?)
What actually prevents me from reversing real hash functions this way? (like MD5, SHA1, etc...)
By hash function the assumption you are referring to cryptographic hash functions such as the SHA family.
The design of the cryptographic hash function keeps you from reversing it, that is the basic criteria in the design.
There are other types of hash functions such as dictionary hash functions that may be quite simple but even these but usually lose portions of the input. hash(x) = x % 7 is an example of such a simple hash function.
In the case of password hashing brute forcing must be taken into account, that is trying passwords from lists of frequently used passwords and fuzzing. The usual solution is to use a hash function that consumes substantial CPU time, ofter by iteration a hash function for about 100ms, PBKFD2 is such a function and is recommended by NIST for password hashing.
Additionally the input may be larger than the output= and some information is intentionally lost.

Understanding term deterministic and non random [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I am confused about a situation which is presented on the following slide:
Last sentences says that:
It is important to note that deterministic does not mean that
xt is non-random. What does this mean? If A and B are random variable, then x must be random right?
I think the point may be that nature may choose randomly among different paths, but once you know which path has been chosen you can predict future values of x_t on the path from past values x_{t-1}, etc. So e.g. nature may flip a coin to choose between the following two paths: x_t=0 for all t, and x_t=1 for all t. Then if you don't know the path, x_t is indeed random. But once you know x_{t-1}, you know x_t.

What's the difference between NOT second preimage resistant and NOT collision resistant [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
By definition, Not 2nd-preimage resistant means: there exists at least one x (which is known) such that it is easy to find another x', such that h(x) = h(x').
While, Not collision resistant indicates: it is easy to find at least one such pair (x, x') that h(x) = h(x')
I don't see any difference here, anyone can tell? Or do I give the wrong definitions?
And, it is said that "Not collision resistant not necessarily means Not 2nd-preimage resistant", why is that?
Putting this into another answer because it's just too much to type for a comment.
The definition of 2nd-preimage-resistant is you have h(x) and x, and can't create x'.
The definition of preimage-resistant (without second!) means you have only h(x), and can't create x.
And the definition of collision resistant is you have nothing, and may choose any h(x), x and x'.
If you use the hash to sign a plaintext message, you need 2nd-preimage-resistancy, but not collision resistancy. It doesn't matter to you if someone can find two colliding messages that produce a hash that is different from yours, but you want to make sure noone is able to craft a different message that has the your hash, even if they know your plaintext.
If you use the hash to store hashed passwords, you don't care about collision resistance, and you don't care about 2nd-preimage-resistance, preimage-resistance is all you need. If an attacker knows one password, you don't really care if he can use that password to find a different one.
So these were two examples where collision resistance is not required, but preimage-resistance or 2nd-preimage-resistance is.
As to "Not collision resistant not necessarily means Not 2nd-preimage resistant", why is that? , consider the hash function if x has less then 24 bits, then h(x)=0, else h(x)=sha256(x). This is very obviously not collision resistant (choose any 2 words that have less than 4 letters), but, as long as your text is longer, this function is preimage-resistant and 2nd-preimage-resistant (assuming sha256 hasn't been broken yet).
2nd preimage resistant means, there's no (easy) way to find a 2nd x (called x') when you have only h(x), and maybe x.
Collision resistant means there's an (easy) way to find a random pair (x, x') with h(x)=h(x').
So the second one is weaker. Think about what happened to MD5 a while ago: there's an algorithm that finds pairs of input bytes that produce the same output. But this works only for specifically constructed input, not for random input. So, while it is possible to find messages that have a collision, the generic case "x is some specific message, find a second message that has the same MD5 as x" is not solved yet.

Quicksort (JAVA) [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Lets say you have an array of size n with randomly generated elements and you want to use quicksort to sort the array. For large enough n (say 1,000,000), in order to speed up quicksort, it would make sense to stop recursing when the array gets small enough, and use insertion sort instead. In such an implementation, the base case for Quicksort is some value base > 1. What would the optimal base value to choose and why?
Think about the time complexity of quicksort (average and worst case) and the time complexity of other sort that might do better for small n.
Try starting with Wikipedia - it has good starting info about comparing the two algorithms. When you have a more specific question, feel free to come back.