I read
Hashing is the transformation of arbitrary size input in the form of a fixed-size value. We use hashing algorithms to perform hashing operations i.e to generate the hash value of an input
And vector embeddings pretty much do the same that they convert an input into a vector of fixed dimension. Trying to understand the difference between them.
Hash encoding can use any function which can convert any string to a random unique number but while creating vector embeddings we try to use domain knowledge and context in which the string might have occurred in the corpus.
Related
I am about to create a vector of size n, with zero and ones. I want to encrypt all the elements of the vector, but I am wondering if encryption of the elements reveals information about the zero and ones. Is there any specific cryptosystem at which when I encrypt the zero and 1s they are not distinguishable in their ciphertext form?
I think I found the answer. If we are encrypting the elements of the vector using encryption algorithms like RSA/padding, we can get different outputs for the same plaintext. So, we can encrypt the elements of the vector with these kinds of algorithms and get different results, so the ciphertext is indistinguishable each time.
I assume that MATLAB vectors/matrices have some meta data about dim/size/lengths. So length(a) is supposed to be a very fast operation if a is of vector. Since MATLAB doc does not talk about complexity in general, do we have any way to confirm this?
You are correct. "Under the hood" MATLAB stores and maintains a size for all array types, and the length operator merely retrieves this value. It isn't quite a simple variable reference, because length has to look at all size dimensions and pick the largest, so it is O(n) in the number of dimensions.
I want to train a SVM classifier in MATLAB for threat detection. The training data is in Excel file and contains both numeric and text fields/columns. When I export this data to MATLAB, it is either in table or cell format. How do I convert it in matrix format?
P.S: Using xlsread function does not import text data.
There are 4 type of attributes in data. Numerical ,discrete , nominal and ordinal. Here you can read more about them . First run an statistical analysis for each feature in your dataset to know the basic statistics such as mean, median, max , min , variable type and if it like nominal or ordinal distinct words and all. So you then have a pretty good idea what you are dealing with.Then according to the variable type you can decide which vectorization we are using.if it is an numerical variable you can divide it into different classes and feature scaling . if it an ordinal variable you can give logical order . if it is nominal variable you can give a identical numerical names. Here , you are just checking how much each feature bring the impact to final prediction
My advice , use Weka GUI too to visualize the data. Then you can pre process the data with column by column
You need to transform your text fields into numeric using dummy variables or another technique, or drop them entirely if they actually are id's (e.g. patient name for medical data, record number, respondent uuid for a survey, etc.)
This would actually be easier in R or Python+Pandas, but in Matlab, you will need to perform encoding by yourself, working from the cell array towards a matrix. Or you can try this toolbox.
I need to create a 16 bit hash from a 32 bit number, and I'm trying to determine if a simple modulus 2^16 is appropriate.
The hash will be used in a 2^16 entry hash table for fast lookup of the 32 bit number.
My understanding is that if the data space has a fairly even distribution, that a simple mod 2^16 is fine - it shouldn't result in too many collisions.
In this case, my 32 bit number is the result of a modified adler32 checksum, using 2^16 as M.
So, in a general sense, is my understanding correct, that it's fine to use a simple mod n (where n is hashtable size) as a hashing function if I have an even data distribution?
And specifically, will adler32 give a random enough distribution for this?
Yes, if your 32-bit numbers are uniformly distributed over all possible values, then a modulo n of those will also be uniformly distributed over the n possible values.
Whether the results of your modified checksum algorithm are uniformly distributed is an entirely different question. That will depend on whether the data you are applying the algorithm to has enough data to roll over the sums several times. If you are applying the algorithm to short strings that don't roll over the sums, then the result will not be uniformly distributed.
If you want a hash function, then you should use a hash function. Neither Adler-32 nor any CRC is a good hash function. There are many very fast and effective hash functions available in the public domain. You can look at CityHash.
I would like to ask, if there is weight distribution equation for hash function or not?
like in channel coding theory there was weight enumerator equation for reed-solmon which give you the number of words of wight i.
Thanks
If you mean cryptographic hash function, then certainly not. Ideally cryptographic hash function can have any value, so every word of a given length is possible under a cryptographic hash function.
Reed-Solomon codes are linear codes, and the minimal weight of each word is the distance of the code, and it is in no way similar to a hash function.