Hashing into 7-Bucket table using Open and Closed hash tables

Hashing into 7-Bucket table using Open and Closed hash tables - hash

I am working on some questions about open and closed hash tables, and have run into a problem.
I have to show how the 7-bucket hash table looks when it is filled in using Open hashing, and closed hashing, using the following input.
1,8,27,64,125,516,343
With the hash function h(k) = k mod 7
I believe that for open hashing, the resulting table would look like this:
0 -> 343
1 -> 1 -> 8 -> 64
2 ->
3 ->
4 ->
5 ->
6 -> 27 -> 125 -> 216
I understand this type of table. However, for a closed table, I know that you are supposed to just stick the item in the next available bucket. I have included what I think the closed hash table should look like, right before you go to insert 125.
0 ->
1 -> 1
2 -> 8
3 -> 64
4 ->
5 ->
6 -> 27
So now I have to insert 125. 125 mod 7 is 6. But there is a collision in the 6 bucket. So now I would move to the next open bucket. But there isn't one. Do I just restart at the beginning of the hash table, and insert it into bucket 0?

Yes, you would "loop" around and start at the beginning. There's a very good explanation of linear probing here:
Consider the situation mentioned above where data 'F' has the same hash code as data 'D'. In order to resolve the collision, the add algorithm will need to probe the table in order to find the first free space (after 'C').
Consider the situation mentioned above where data 'F' has the same hash code as data 'D'. In order to resolve the collision, the add algorithm will need to probe the table in order to find the first free space (after 'C').
If the probe loops back, and finally reaches the same element that it started at, it means that the hash table is full, and can no longer hold any more data. The addition operation will fail.

Related

KDB - Is there a limit to the number of functions called at one time when updating tables?

I’ve been running a number of functions to update a table and I keep adding more functions as I wish to update and call other various items. I have not run into any issues yet (currently at 7 functions) but I’m mindful that there may be a limit. I did find that there is a limit of 8 parameters for a single function but nothing noting a limit on the below. If not, great. I wanted to be mindful as I scale up.
updateTable: FuncG FuncF FunE FuncD FuncC FuncB FuncA ::; // max number of functions?
t: updateTable t;

I made a fake update statement with loads of function calls, and it seems like you're fine:
q)t:([]a:1 2 3)
q)f:{x+1}
q)value "update ",(raze 1000#enlist"f "),"a from t"
a
----
1001
1002
1003
One thing you might want to do is make a single function composed from a list of your functions:
q)f:{x+1}
q)g:{2*x}
q)h:{x+1+2}
q)(('[;])/)(f;g;h)
{x+1}{2*x}{x+1+2}
q)composed:(('[;])/)(f;g;h)
q)t:([]a:1 2 3)
q)update composed a from t
a
--
9
11
13
so that you only have a single function in your update statement, and it should scale.

kdb q - efficiently count tables in flatfiles

I have a lot of tables stored in flat files (in a directory called basepath) and I want to check their number of rows. The best I can so right now is:
c:([] filename:system "ls ",basepath;
tablesize:count each get each hsym `$basepath,/:system "ls ",basepath)
which loads each table entirely into memory and then performs the count (that's quite slow). Is saving as splayed tables the only way to make this faster (because I would only load 1 column and count that) or is there a trick in q that I can use?
Thanks for the help

If you have basepath defined as a string of the path to directory where all your flat tables are stored then you can create a dictionary of the row counts as follows:
q)cnt:{count get hsym x}
q)filename:key hsym `$basepath
q)filename!cnt each filename
t| 2
g| 3
This is where I have flat tables t and g saved in my basepath directory. This stops you from having to use system commands which are often less effiecient.
The function cnt takes the path of each flat table (as a symbol) and returns the number of rows without saving them into memory.
The best solution if you have control of the process of saving such files down is to add an extra step of saving the meta information of the row count somewhere seperate at the same time of saving the raw data. This would allow you to quickly access the table size from this file instead of reading the full tbale in each time.
However, note that to avoid pulling them into memory at all you would have to instead use read1 and look at the headers on the binary data. As you said it would be better to save as a splayed table and read in one column.
UPDATE: I would not recommend doing this and strongly suggest doing the above but for curiosity after looking into using read1 here's an example what what a hacky solution might look like:
f:{
b:read1(y;0;x);
if[not 0x62630b~b[2 4 5];'`$"not a table"];
cc:first first((),"i";(),4)1:b 7+til 4;
if[null ce:first where cc=sums 0x0=11 _ b;:.z.s[x*2;y]];
c:`$"\000" vs "c"$b[11+til ce];
n:first first((),"i";(),4)1:b[(20+ce)+til 4];
:`columns`rows!(c;n);
}[2000]
The q binary file format isn’t documented anywhere, the only way to figure it out is to save different things and see how the bytes change. It’s also subject to changes between versions - the above is written for 3.5 and is probably valid for 3.0-3.5 only, not the latest 3.6 release or anything 2.X.
The given code works in the following way:
reads a chunk from the front of the file
validates that it looks like a flat unkeyed table (flip[98] of a dict[99] with symbol[11] keys)
reads the count of symbols in the list of columns as a little endian 4 byte int
scans through the null terminated strings for that many zero bytes
if the columns are so numerous or verbose that we don’t have them
all in this chunk it will double the size of the chunk and try again
turn the strings into symbols
using the offset we get from the end of the column list, skip a bit
more of the header for the mixed list of columns
then read the count from the header of the first column
Hope this answers your question!

From experimenting with the binary files, it seems that the table count is saved as part of the binary file when you save down a flat file, taking up 4 bytes after the initial object type and column headings which will vary from table to table.
`:test set ([]a:1 2 3;b:4 5 6;c:7 8 9;aa:10 11 12;bb:13 14 15)
q)read1 `:test
0xff016200630b000500000061006200630061610062620000000500000009000300000
0 7 11 31
bytes | example | meaning
---------------------------------------------------------------------------------------
0 - 5 | 0xff016200630b0 | object is a flat table
7 - 11 | 0x05000000 | number of columns (5)
12- 22 | 0x6100620063006161006262 | one byte for the ascii values of column "a" and "b" in hex followed by the one byte separator
23 - 30 | 0x0000050000000900 | 8 bytes that can be skipped
31 - 34 | 0x0300000 | 4 bytes for row count of first column (3)
This should help you understand the function that Fiona posted.
The binary is saved down little-endian meaning the most-significant byte is the right-hand most digit - doing this in decimal for the number 100 would give 001, with the 100's (most significant) on the right and then 10s and finally 1s on the left. In the binary file, each group of 2 digits is a byte.
You can use 1: to read in the contents of a binary file, with additional arguments in the list specifying the offset - where to start reading from, and how many bytes to read. In our case we want to start at byte 31 and read in 4 bytes, specifying the output should be an integer and to cut the input into separate 4 byte chunks.
q)first first (enlist "i";enlist 4)1:(`:test;31;4)
3i
Converting the little-endian bytes into a long gives us the row count. Since this only has to read in 4 bytes instead of the whole file it is a lot quicker.
For a table with 10000 rows and 2 columns there is not much difference:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10000;31;4)
0
q)\t count get `:test10000
0
For a table with 100m rows and 2 columns:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10m;31;4)
0
q)\t count get `:test10m
2023
If you have a splayed table instead you can read in the number of elements in one of the columns from bytes 9-13 like so, assuming the column is a simple list:
q)first first (enlist "i";enlist 4)1:(`:a;8;4)
3i
You can read more about reading in from binary files here https://code.kx.com/q/ref/filenumbers/#1-binary-files

You can make what you currently have more efficient by using the following
counttables:{count each get each hsym `$basepath}
This will improve the speed of the count by not including the extra read in of the data as well as the join which you are currently doing. You are correct though that if the tables where saved splayed you would only have to read in the one column making it much more efficient.

If your tables are stored uncompressed there's probably something quite hacky you could do with a read1 on the headers within the file until you find the first column header.
But v hacky :-(
Are you responsible for saving these down? Can you keep a running state as you do?

Difference between Matlab JOIN vs. INNERJOIN

In SQL, JOIN and INNER JOIN mean the same thing. In Matlab, they are different commands. Just from perusing the documentation thus far, they appear on the surface to fufill the same general function, with possible differences in the details, as controlled by parameters. I am slogging through the individual examples and may (or may not) find the fundamental difference. However, I feel that the difference should not be a subtlety that users have to ferrut out of the examples. These are two separate commands, and the documentation should make it clear up front why they are both needed. Would anyone be able to chime in about the key difference? Perhaps it could become a request to place it front and centre in the documentation.

I've empirically characterized the difference between JOIN and INNERJOIN (some would refer to this as reverse engineering). I'll summarize from the perspective of one who is comfortable with SQL. As I am new to SQL-like operations in Matlab, I've only been able to test drive it to a limited degree, but the INNERJOIN appears to join records in the same manner as SQL. Since SQL is a pretty open language, the behavioural specification of INNERJOIN is readily available, and I won't dwell on that. It's Matlab's JOIN that I need to suss out.
In short, from my testing, Matlab's JOIN seems to "join" the rows in the two operand table in a manner more like Excel's VLOOKUP rather than any of the JOINS in SQL. In general, the main differences with SQL joins seem to be (i) that the right hand table cannot have repeating values in the columns used matching up rows between the two tables and (ii) all combinations of values in the key columns of the left hand table must show up in the right hand table.
Here is the empirical testing. First, prepare the test tables:
a=array2table([
1 2
3 4
5 4
],'VariableNames',{'col1','col2'})
b=array2table([
4 7
4 8
6 9
],'VariableNames',{'col2','col3'})
c=array2table([
2 10
4 8
6 9
],'VariableNames',{'col2','col3'})
d=array2table([
2 10
4 8
6 9
6 11
],'VariableNames',{'col2','col3'})
a2=array2table([
1 2
3 4
5 4
20 99
],'VariableNames',{'col1','col2'})
Here are the tests:
>> join(a,b)
Error using table/join (line 111)
The key variable for B must have unique values.
>> join(a,c)
ans = col1 col2 col3
____ ____ ____
1 2 10
3 4 8
5 4 8
>> join(a,d)
Error using table/join (line 111)
The key variable for B must have unique values.
>> join(a2,c)
Error using table/join (line 130)
The key variable for B must contain all values in the key
variable for A.
The first thing to notice is that JOIN is not a symmetric operation with respect to the two tables.
It seems that the 2nd table argument is used as a lookup table. Unlike SQL joins, Matlab throws an error if it can't find a match in the 2nd table [See join(a2,d)]. This is somewhat hinted at in the documentation, though not entirely clearly. For example, it says that the key values must be common to both tables, but join(a,c) clearly shows that the tables do not have to have common key values. On the contrary, just as one would expect of a lookup table, 2nd table contains entries that aren't matched do not throw errors.
Another difference with SQL joins is that records that cause the key values to replicate in 2nd table are not allowed in Matlab's join. [See join(a,b) & join(a,d)]. In contrast, the fields used for matching records between tables aren't even referred to as keys in SQL, and hence can have non-unique values in either of the two tables. The disallowance of repeated key values in the 2nd table is consistent with the view of the 2nd table as a lookup table. On the other hand, repetition on of key values are permitted in the 1st table.

Cannot allocate memory for a column of compound floats on a partitioned table

I have a partitioned table in my hdb that includes a column containing large lists of floats (at most 400 floats per element). eg each element looks like
(100.0 1.0 ...)
When trying to select on this column from days where there are particularly high numbers of rows I get an error saying
'./2015.02.07/table/column# Cannot allocate memory
The same error arises from a query like:
select column[;0] from table where date=2015.02.07
even though on days with fewer rows this query returns the first value of each element in the column.
Is there a way to stream this column in a select to decrease the memory requirements of holding the whole column in memory for a large day?
EDIT
.Q.ind on large days fails with the same error.
ie given I can work with 2015.02.01 but not 2015.02.02:
.Q.ind[select from table where date=2015.02.01;enlist 1]
is fine but
.Q.ind[select from table where date=2015.02.02;enlist 1]
fails with
{0!$[#.Q.pm;p3;(?).]#[x;0;p1[;y;z]]}
'./2015.02.10/table/column2#: Cannot allocate memory
#
.[?]
(+`time`sym`column1`column2!`:./2015.02.02/table;();0b;())
I should note I am using the free 32-bit version

I think this is all just a combination of the free-32bit memory limitation, the fact that your row counts are possibly large and the fact that (unavoidably) something must be pulled entirely into memory when retrieving data from a column, whether it is the column itself that gets entirely pulled in (in the non-nested case) or if its the nested-index column that gets entirely pulled in.
Another thing to consider is that kdb uses powers-of-two (buddy) memory allocation. Even if todays table only contains one more row than yesterdays, the memory requirements per column could double. Take a simple example:
In the free 32bit version (windows) you can create this many floats and it only uses ~1.07gb of memory
q)\ts 134217726?1.0
3093 1073741952
However, try to generate one extra float and you hit a memory limit
q)\ts 134217727?1.0
wsfull
So even a small amount of rows in the difference between one day and the next can be very significant if you're near the boundary of allocatable powers of two.
--DISCLAIMER-- the following is hacky and is only intended for debugging!
You can actually manually try to access the data from the nested list, though you may still have memory issues here anyway.
Create a nested table and splay it
q)tab:([] col1:(101 102 103f;104 105f;106 107 108 109 110f;111 112f))
q)tab
col1
--------------------
101 102 103f
104 105f
106 107 108 109 110f
111 112f
q)
q)`:test/ set tab
`:test/
You can try to read in the indices from the nested-index file
q)2_first (enlist "j";enlist 8)1:`:test/col1
3 5 10 12
So the indices for splitting the full list of floats (the col1# file) is index 3, index 5, 10 etc etc
Say I want the first 3 rows
q)myrows:3#2_first (enlist "j";enlist 8)1:`:test/col1
q)myrows
3 5 10
then I know that I need the first 10 floats from the col1# file and need to split them at index 3 and 5. Then I can read the col1# file partially and split it correctly
q)(0,-1_myrows) cut raze (enlist "f";enlist 8)1:(`$":test/col1#";0;8*last myrows)
101 102 103f
104 105f
106 107 108 109 110f
But this is precisely what KDB does under the covers anyway so I suspect that you'll still have trouble even reading in the nested-index file in the first place.
Check this debug/hack and see if you can partially read that way. But obviously it's not a long-term solution!

Nested columns make querying in the usual way difficult, as the # file also needs to be loaded into memory (even with a [;0])
Your best bet is to select map a date partition in, and then select within that chunk by chunk, e.g. a million rows at a time (or whatever is sensible given the size of nested floats).
Perhaps also consider 32bit floats, if some decimal accuracy can be sacrificed.
EDIT
So after comments I guess the best way is to go each partition a number of lines at a time with .Q.ind

Just to give my 2 cents on this, I had a similar error but with a 64-bit instance.
I suspected that the memory needed to be de-fragmented as it was running for almost a year.
Bouncing the instance solved the issue, and released a lot of virtual memory

Process two space delimited text files into one by common column [duplicate]

This question already has answers here:
merge two files by key if exists in the first file / bash script [duplicate]
(2 answers)
Closed 9 years ago.
I have two text files that look like:
col1 primary col3 col4
blah 1 blah 4
1 2 5 6
...
and
colA primary colC colD
1 1 7 27
foo 2 11 13
I want to merge them into a single wider table, such as:
primary col1 col3 col4 colA colC colD
1 blah blah 4 a 7 27
2 1 5 6 foo 11 13
I'm pretty new to Perl, so I'm not sure what the best way is to do this.
Note that column order does not matter, and there are a couple million rows. Also my files are unfortunately not sorted.
My current plan unless there's an alternative:
For a given line in one of the files, scan the other file for the matching row and append them both as necessary into the new file. This sounds slow and cumbersome though.
Thanks!

Solution 1.
Read the smaller of two files line by line, using a standard CPAN delimited-file parser like TXT::CSV_XS to parse out columns.
Save each record (as arrayref of columns) in a hash, with your merge column being the hash key
When done, read the larger of two files line by line, using a standard CPAN delimited-file parser like TXT::CSV_XS to parse out columns.
For each record, find the join key field, find the matching record from your hash storing the data from file#1, merge the 2 records as needed, and print.
NOTE: This is pretty memory intensive as the entire smaller file will live in memory, but won't require you to read one of the files million times.
Solution 2.
Sort file1 (using Unix sort or some simple Perl code) into "file1.sorted"
Sort file2 (using Unix sort or some simple Perl code) into "file2.sorted"
Open both files for reading. Loop until both are fully read:
Read 1 line from each file into the buffer if the buffer for that file is empty (buffer being simply a variable containing the next record).
Compare indexes between 2 lines.
If index1 < index2, write the record for file1 into output (without merging) and empty buffer1. Repeat step 3
If index1 > index2, write the record for file2 into output (without merging) and empty buffer2. Repeat.
If index1 == index2, merge 2 records, write the merged record into output and empty out both buffers (assuming the join index column is unique. If not unique, this step is more complicated).
NOTE: this does NOT require you to keep entire file in memory, aside from sorting the files (which CAN be done in memory constrained way if you need to).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse