from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('google_news.bin', binary=True)
print(model['the']) # this prints the 300D vector for the word 'the'
the code loads the google_news binary file to model.
my question is, how the line 3 computes the output from a binary file ( Since Binary files contains 0's and 1's).
I'm not sure exactly what the question is here, but I assume you're asking how to load the binary into your Python app? You can use gensim for example which has built-in tools to decode the binary:
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('google_news.bin', binary=True)
print(model['the']) # this prints the 300D vector for the word 'the'
EDIT
I feel your question is more about binary files in general? This does not seem related to word2vec specifically. Anyways, in a word2vec binary file each line is a pair of word and weights in binary format. First the word is decoded into a string by looping the characters until it meets the binary character for "space". Then the rest is decoded from binary into floats. We know the number of floats since word2vec binary files have a header, such as "3000000 300", which tells us there are 3m words, each word is a 300D vector.
A binary file is organized as a series of bytes, each 8 bits. Read more about binary on the wiki page.
The number 0.0056 in decimal format, becomes in binary:
00111011 10110111 10000000 00110100
So here there are 4 bytes that make up a float. How do we know this? Because we assume the binary encodes 32 bit float.
What if the binary file represents 64 bit precision floats? Then the decimal 0.0056 in binary becomes:
00111111 01110110 11110000 00000110 10001101 10111000 10111010 11000111
Yes, twice the length because twice the precision. So when we decode the word2vec file, if the weights are 300d, and 64 bit encoding, then there should be 8 bytes to represent each number. So a word embedding would have 300*64=19,200 binary digits in each line of the file. Get it?
You can google "how binary digits" work, millions of examples.
Related
I want to read a matrix from file using the following syntax in MATLAB . This matrix is of double numbers .
readmtx(fname,nrows,ncols,precision)
Here all the inputs are quite familiar to me . But I want to know about precision . The precision of int is 'int16'. What is the precision of double number?
In this case, the documentation states:
Both binary and formatted data files can be read. If the file is binary, the precision argument is a format string recognized by fread. Repetition modifiers such as '40*char' are not supported. If the file is formatted, precision is a fscanf and sscanf-style format string of the form '%nX', where n is the number of characters within which the formatted data is found, and X is the conversion character such as 'g' or 'd'. Fortran-style double-precision output such as '0.0D00' can be read using a precision string such as '%nD', where n is the number of characters per element. This is an extension to the C-style format strings accepted by sscanf. Users unfamiliar with C should note that '%d' is preferred over '%i' for formatted integers. MATLAB syntax follows C in interpreting '%i' integers with leading zeros as octal. Formatted files with line endings need to provide the number of trailing bytes per row, which can be 1 for platforms with carriage returns or linefeed (Macintosh, UNIX®), or 2 for platforms with carriage returns and linefeeds (DOS).
In addition it is helpful to look at the table summary in the fread documentation:
I need to write data to a binary file, while all the data is aligned to be in Bytes, it's made of several fragments which are not alighted to Bytes:
The total size of the data is 96 bits comprised of:
RGB color: 3*8 bit numbers (24bit),
1st property value: 7bit number
2nd property value: 7bit number
number of objects: 26bit number
memory offsert: 32bit number.
totaling to 96bit or 12B
The reason that the data is split this way is that each number has significance and it's easier to create the data by putting the numbers separately in their correct order. I'm using fwrite for this, but the function only allows to write numbers in sizes of Bytes. I found a way around it by using a "hack":
num=red;
num=num*2^8+green;
num=num*2^8+blue;
num=num*2^7+first_prop_val;
num=num*2^7+second_prop_val;
num=num*2^26+number_of_objects;
fwrite(fid, num,'uint64');
fwrite(fid, memory_offset,'uint32');
This works because all the numbers are positive, but it's ugly. Is there a less "hacky" way to accomplish what I need?
*-the property numbers are the size of 7 because they can get values from 0 to 100, and giving them an extra bit just to align the data would mean that I can have less objects as they all need to be counted
For a signed 32bit integer, you can get the binary representation using:
bi=dec2bin(typecast(int32(-23),'uint32'))=='1'
Signed: You heading n bits, if they are equal to the n+1th bit
bw=7
assert all(bi(1:end-bw+1))|all(~bi(1:end-bw+1))
bi=bi(end-bw+1:end)
For a unsigned one, use:
bi=dec2bin(uint32(23))=='1'
Unsigned: You can remove heading zeros.
assert(all(~bi(1:end-bw)))
bi=bi(end-bw+1:end)
Put this into a function, convert all integers, concatenate to one binary array, cut in 32bit parts and write using uint32-format.
I'm writing an Ama file generating simulator. In the Ama specifications, it gives the number of bytes(not bits) for a field as BCD16, BCD12 etc(BCD{a number}).
For fields with BCD16 the actual length is 8 bytes. Can anyone please tell me what that BCD16 means ?
I know BCD is Binary coded decimal but don't understand what BCD16 means.
It looks like a 2-byte (16-bit) BCD encoding. See here.
This is 2 bytes and can store a 4 digit BDC16 encoded number, with one
BCD digit stored in each half byte, (aka nibble).
Example - 0011-0110 0010-0101 is 3-6 2-5 which is 3x100 + 6x10 + 2x1 +
5x0.1 = 362.5
I think after talking with a senior I found a definition.
BCD{x} means, for a single digit it takes x bits.
With that ,if it is BCD3, it would take three bits to represent a single digit. Of-course with that we cannot represent every combination.But we can represent some very large values.
Eg: In BCD1 we can represent- 11111111 using one byte.
This is what I found. If there is something wrong with the definitions, please correct me.
I'm generating ~1million text files containing arrays of doubles, tab delimited (these are simulations for research). Example output below. Each million text files I expect to be ~5 TB, which is unacceptable. So I need to compress.
However, all my data analysis will be done in matlab. And every matlab script will need to access all million of these text files. I can't decompress the whole million using C++, then run the matlab scripts, because I lack the HD space. So my question is, are there some very simple, easy to implement algorithms or other ways of reducing my text file sizes so that I can write the compression in C++ and read it in matlab?
example text file
0.0220874 0.00297818 0.000285954 1.70E-05 1.52E-07
0.0542912 0.00880725 0.000892849 6.94E-05 4.51E-06
0.0848582 0.0159799 0.00185915 0.000136578 7.16E-06
0.100415 0.0220033 0.00288016 0.000250445 1.38E-05
0.101889 0.0250725 0.00353148 0.000297856 2.34E-05
0.0942061 0.0256 0.00393893 0.000387219 3.01E-05
0.0812377 0.0238492 0.00392418 0.000418365 4.09E-05
0.0645259 0.0206528 0.00372185 0.000419891 3.23E-05
0.0487525 0.017065 0.00313825 0.00037539 3.68E-05
If it matters.. the complete text files represent joint probability mass functions, so they sum to 1. And I need lossless compression.
UPDATE Here is an IDIOTS guide to writing binary in C++ and reading it Matlab, with some very basic explanation along the way.
C++ code to write a small array to a binary file.
#include <iostream>
using namespace std;
int main()
{
float writefloat;
const int rows=2;
const int cols=3;
float JPDF[rows][cols];
JPDF[0][0]=.19493;
JPDF[0][1]=.111593;
JPDF[0][2]=.78135;
JPDF[1][0]=.33333;
JPDF[1][1]=.151535;
JPDF[1][2]=.591355;
JPDF is an array of type float that I save 6 values to. It's a 2x3 array.
FILE * out_file;
out_file = fopen ( "test.bin" , "wb" );
To be honest, I don't quite get what the first line is doing. It seems to be making a pointer of type FILE named out_file. The second line fopen says make a new file for writing (the 'w' of the second parameter), and make it a binary file (the 'b' of the wb).
fwrite(&rows,sizeof(int),1,out_file);
fwrite(&cols,sizeof(int),1,out_file);
Here I encode the size of my array (# rows, # cols). Note that we fwrite the reference to the variables rows and cols, not the variables themselves (& is by ref). The second parameter tells it how many bytes to write. Since rows and cols are both ints, I use sizeof(int). The '1' says do this 1 time. I think. And out_file is our pointer to the file we're writing to.
for (int i=0; i<3; i++)
{
for (int j=0; j<2; j++)
{
writefloat=JPDF[j][i];
fwrite (&writefloat , sizeof(float), 1, out_file);
}
}
fclose (out_file);
return 0;
}
Now I'll iterate through my array and write each value in bytes to my file. The indexing is a little backwards looking in that I'm iterating down each column rather than across a column in the inner loop. We'll see why in a sec. Again, I'm writing the reference to writefloat, which takes on the value of the current array element in each iteration. Since each array element is a float, I'm using sizeof(float) here instead of sizeof(int).
Just to be incredibly, stupidly clear, here's a diagram of how I think of the file we've just created.
[4 bytes: rows][4 bytes: cols][4 bytes: JPDF[0][0]][4 bytes: JPDF[1][0]] ...
[4 bytes: JPDF[1][2]]
..where each chunk of bytes is written in binary (0s and 1s).
To interpret in MATLAB:
FID=fopen('test.bin');
sizes=fread(FID,2,'int')
FID sort of works like a pointer here. Secretly, it probably is a pointer. Then we use fread which operates very similarly to C++ fread. FID is our 'pointer' to our file. The 'int' tells the function how many bytes each chunk contains. So sizes=fread(FID,2,'int') says 'open FID in binary, and read 2 chunks of size INT bytes, and return the 2 elements in vector form. Now, sizes(1)=rows, and sizes(2)=cols.
s=fread(FID,[sizes(1) sizes(2)],'float')
The next part wasn't completely clear to me originally, I thought I'd have to tell fread to skip the 'header' of my binary that contains row/col info. However, it secretly maintains a pointer to where you left off. So now I empty the rest of the binary file, using the fact that I know the dimensions of the array. Note, while the second parameter [M,N] is [rows,cols], fread reads in "column order", which is why we wrote the array data in column order.
The one * is that I think I can only use matlab code 'int' and 'float' if the architecture of the C++ program is concordant with matlab (e.g., both are 64-bit, or both are 32-bit). But I'm not sure about this.
The output is:
sizes =
2
3
s =
0.194930002093315 0.111593000590801 0.781350016593933
0.333330005407333 0.151535004377365 0.59135502576828
To do better than four bytes per number, you need to determine to what precision you need these numbers. Since they are probabilities, they are all in [0,1]. You should be able to specify a precision as a power of two, e.g. that you need to know each probability to within 2-n of the actual. Then you can simply multiply each probability by 2n, round to the nearest integer, and store just the n bits in that integer.
In the worst case, I can see that you are never showing more than six digits for each probability. You can therefore code them in 20 bits, assuming a constant fixed precision past the decimal point. Multiply each probability by 220 (1048576), round, and write out 20 bits to the file. Each probability will take 2.5 bytes. That is smaller than the four bytes for a float value.
And either way is way smaller than the average of 11.3 bytes per value in your example file.
You can get better compression even than that if you can exploit known patterns in your data. Assuming that there are any. I see that in your example, on each line the values go down by some factor at each step. If that is real and not just an artifact of the generation of the example, then you can successively use fewer bits for each sample. Also if the first sample is really always less than 1/8, then you can drop the top three bits off that one, since those bits would always be zero. If the second column is always less than 1/32, you can drop the first five bits off all of those. And so on. Assuming that the magnitudes in the example are maximums across all of the data sets (obviously not true, but just using that as an illustrative case), and assuming you need six decimal digits after the decimal point, I could code each row of six values in 50 bits, for an average of a little over one byte per probability.
And for one last smidgen of compression, since the values add to one, you don't have store the last value.
Matlab can read binary files. Why not save your files as binary instead of text?
Saving each number as a float would only require 4 bytes (if you're running 32 bit linux), you could use doubles but it appears that you aren't using the full double resolution. Under your current scheme each digit every number consumes a byte of space. All of your numbers are easily 4+ char longs, some as long as 10 chars. Implementing this change should cut down your file sizes by more than 50%.
Additionally you might consider using a more elegant data format like HDF5 (more here) that both supports compression and is supported by matlab
Update:
There are lots of examples of how to write binary files in C++, just google it.
Additionally to read in a binary file in Matlab simply use fread
The difference between representing a number as ascii vs binary is really simple. All files are written using binary, the difference is in how that information gets interpreted. Text files are generally read using ASCII, which provides a nice mapping between an 8bit word and characters. When you see a string like "255" what you have is a array of bytes where each byte encodes on character in the array. However when you are storing numbers its really wasteful to store each digit of using a different byte. A single byte can store values between 0-255. So why use three bytes to store the string "255" when I can use a single byte to store the value 255.
You can always go ahead and zip everything using a standard library like zlib. Afterwards you could use a custom dll written in C++ that unzips your data in chunks you can manage. So basically:
Data --> Zip --> Dll (Loaded by Matlab via LoadLibrary) --> Matlab
I am trying to unpack a variable containing a string received from a spectrum analyzer:
#42404?û¢-+Ä¢-VÄ¢-oÆ¢-8æ¢-bÉ¢-ôÿ¢-+Ä¢-?Ö¢-sÉ¢-ÜÖ¢-¦ö¢-=Æ¢-8æ¢-uô¢-=Æ¢-\Å¢-uô¢-?ü¢-}¦¢-=Æ¢-)...
The format is real 32 which uses four bytes to store each value. The number #42404 represents 4 extra bytes present and 2404/4 = 601 points collected. The data starts after #42404. Now when I receive this into a string variable,
$lp = ibqry($ud,":TRAC:DATA? TRACE1;*WAI;");
I am not sure how to convert this into an array of numbers :(... Should I use something like the followin?
#dec = unpack("d", $lp);
I know this is not working, because I am not getting the right values and the number of data points for sure is not 601...
First, you have to strip the #42404 off and hope none of the following binary data happens to be an ASCII number.
$lp =~ s{^#\d+}{};
I'm not sure what format "Real 32" is, but I'm going to guess that it's a single precision floating point which is 32 bits long. Looking at the pack docs. d is "double precision float", that's 64 bits. So I'd try f which is "single precision".
#dec = unpack("f*", $lp);
Whether your data is big or little endian is a problem. d and f use your computer's native endianness. You may have to force endianness using the > and < modifiers.
#dec = unpack("f*>", $lp); # big endian
#dec = unpack("f*<", $lp); # little endian
If the first 4 encodes the number of remaining digits (2404) before the floats, then something like this might work:
my #dec = unpack "x a/x f>*", $lp;
The x skips the leading #, the a/x reads one digit and skips that many characters after it, and the f>* parses the remaining string as a sequence of 32-bit big-endian floats. (If the output looks weird, try using f<* instead.)