How do you get Matlab mapreduce to use the key passed from the map function - matlab

Here is a sample reduce function from the Matlab documentation:
function MeanDistReduceFun(intermKey, intermValIter, outKVStore)
sumLen = [0 0];
while hasnext(intermValIter)
sumLen = sumLen + getnext(intermValIter);
end
add(outKVStore, 'Mean', sumLen(1)/sumLen(2));
end
This creates a final dataset tagged by the key Mean. However, I would like to dynamically generate the key based off the unique keys from the map stage. Can I simply use intermKey in place of 'Mean' in the add function, or should I include the key in intermValIter somehow and extract it?

The short answer: Yes, you can use intermKey to use the unique keys passed by the map function.
The long answer: Between the map and reduce stages, all of the unique keys and their associated values are stored as ValueIterator objects:
http://www.mathworks.com/help/matlab/ref/valueiterator-object.html
Each ValueIterator object is associated with a single key. So once this intermediate grouping is done, the reduce function is called a single time for each unique key passed by the mapper, and intermValIter contains all of the values associated with the unique key intermKey. Therefore, specifying intermKey will use each of the unique keys passed by the mapper.
As the above doc link mentions, the only interaction you have with the ValueIterator object is by using hasnext and getnext to loop through the values contained in intermValIter.

Related

Creating an array of columns from an array of column names in data flow

How can I create an array of columns from an array of column names in dataflow?
The following creates an array of sorted columns with and exception of the last column:
sort(slice(columnNames(), 1, size(columnNames()) - 1), compare(#item1, #item2))
I want to get an array of the columns for this array of column names. I tried this:
toString(byNames(sort(slice(columnNames(), 1, size(columnNames()) - 1), compare(#item1, #item2))))
But I keep getting the error:
Column name function 'byNames' does not accept column or argument parameters
Please can anyone help me with a workaround for this?
Update--
It seems using ColumnNames() in any way (directly or assigning it to a parameter) seems to be leading to error. As at runtime on Spark it is fed to the byNames() function. Due to unavailability of a way to re-introduce as parameter or assign variable in Data Flow directly, see below which works for me.
Have empty string array type parameter in DataFLow
Use sha2 function as usual in derived column with parameter sha2(256,byNames($cols))
Create pipeline, there use getMetadata to get Structure from which you can get column names.
For each column, inside ForEach activity append to a variable.
Next, connect to DataFLow and pass the variable containing Column names.
The documentation for the byNames function states 'Computed inputs are not supported but you can use parameter substitutions'. This explains why you should use a parameter as input to create the array used in the byNames function:
Example: Where $cols parameter hold the list of columns.
sha2(256,byNames(split($cols,',')))
You can use computed columns names as input by creating the array prior to using in function. Instead of creating the expression in-line in the function call, set the column values in a parameter prior and then use it in your function directly afterwards.
For a parameter $cols of type array:
$cols = sort(slice(columnNames(), 1, size(columnNames()) - 1), compare(#item1, #item2))
toString(byNames(sort(slice($cols, compare(#item1, #item2))))
Refer: byNames

Key, Value, Hash and Hash function for HashTable

I'm having trouble understanding what the Hash Function does and doesn't do, as well as what exactly a Bucket is.
From my understanding:
A HashTable is a data structure that maps keys to values using a Hash Function.
A HashFunction is meant to map data from an array of arbitrary/unknown size to a data array of fixed size.
There can be duplicate Values in the original data array, but this is irrelevant.
Each Value will have a unique Key. Thus, each Key has exactly 1 Value.
The HashFunction will generate a HashCode for each (Value, Key) pair. However, Collisions can occur in which multiple (Value, Key) pairs map to the same HashCode.
This can be remedied by using either Chaining/Open Addressing methods.
The HashCode is the index value indicating the position of a particular entry from the original data array within the Bucket array.
The Bucket array is the fixed data array constructed that will contain the entries from the original array.
My questions:
How are the Keys generated for each value? Is the HashFunction meant to generate both Key and HashCode values for each entry? Does each Bucket thus contain only one entry (assuming a Chaining implementation to remedy Collision)?
How are the Keys generated for each value?
Key is not generated, it is provided by you and serves as an input to the hash function which in turn converts that key into index of hash table. Simply speaking:
H(key)=index
so the value you are looking for is:
hash_table[index] = value
Is the HashFunction meant to generate HashCode values for each entry?
It all depends on the implementation of hash function and hash table. Some hash functions might generate a hashcode out of provided key and then for example take its modulo(size) where size is the size of hash table, in order to get the index. Others might convert the key directly into index. In either case the ultimate goal of hash function is to find the location of searched data within hash table in constant time.
Does each Bucket thus contain only one entry (assuming a Chaining implementation to remedy Collision)?
Ideally each key should be mapped to a unique index but mostly that's not the case since the number of buckets (i.e. indices) is far smaller than the number of keys so the average length of a chain per bucket (i.e. number of collisions per bucket) is no.of keys/no.of indices

Do MATLAB tables remove the need for dictionaries?

MATLAB tables let you index into any column/field using the row name, e.g., MyTable.FourthColumn('SecondRowName'). Compared to this, dictionaries (containers.Map) seem primitive, e.g., it serves the role of a 1-column table. It also has its own dedicated syntax, which slows down the thinking about how to code.
I'm beginning to think that I can forget the use of dictionaries. Are there typical situations for which that would not be advisable?
TL;DR: No. containers.Map has uses that cannot be replaced with a table. And I would not choose a table for a dictionary.
containers.Map and table have many differences worth noting. They each have their use. A third container we can use to create a dictionary is a struct.
To use a table as a dictionary, you'd define only one column, and specify row names:
T = table(data,'VariableNames',{'value'},'RowNames',names);
Here are some notable differences between these containers when used as a dictionary:
Speed: The struct has the fastest access by far (10x). containers.Map is about twice as fast as a table when used in an equivalent way (i.e. a single-column table with row names).
Keys: A struct is limited to keys that are valid variable names, the other two can use any string as a key. The containers.Map keys can be scalar numbers as well (floating-point or integer).
Data: They all can contain heterogeneous data (each value has a different type), but a table changes how you index if you do this (T.value(name) for homogeneous data, T.value{name} for heterogeneous data).
Syntax: To lookup the key, containers.Map provides the most straight-forward syntax: M(name). A table turned into a dictionary requires the pointless use of the column name: T.value(name). A struct, if the key is given by the contents of a variable, looks a little awkward: S.(name).
Construction: (See the code below.) containers.Map has the most straight-forward method for building a dictionary from given data. The struct is not meant for this purpose, and therefore it gets complicated.
Memory: This is hard to compare, as containers.Map is implemented in Java and therefore whos reports only 8 bytes (i.e. a pointer). A table can be more memory efficient than a struct, if the data is homogeneous (all values have the same type) and scalar, as in this case all values for one column are stored in a single array.
Other differences:
A table obviously can contain multiple columns, and has lots of interesting methods to manipulate data.
A stuct is actually a struct array, and can be indexed as S(i,j).(name). Of course name can be fixed, rather than a variable, leading to S(i,j).name. Of the three, this is the only built-in type, which is the reason it is so much more efficient.
Here is some code that shows the difference between these three containers for constructing a dictionary and looking up a value:
% Create names
names = cell(1,100);
for ii=1:numel(names)
names{ii} = char(randi(+'az',1,20));
end
name = names{1};
% Create data
values = rand(1,numel(names));
% Construct
M = containers.Map(names,values);
T = table(values.','VariableNames',{'value'},'RowNames',names);
S = num2cell(values);
S = [names;S];
S = struct(S{:});
% Lookup
M(name)
T.value(name)
S.(name)
% Timing lookup
timeit(#()M(name))
timeit(#()T.value(name))
timeit(#()S.(name))
Timing results (microseconds):
M: 16.672
T: 23.393
S: 2.609
You can go simpler, you can access structs using string field:
clear
% define
mydata.('vec')=[2 4 1];
mydata.num=12.58;
% get
select1='num';
value1=mydata.(select1); %method 1
select2='vec';
value2=getfield(mydata,select2) %method 2

How are same hash vs same key handled?

This question is not specific to any programming language, I am more interested in a generic logic.
Generally, associative maps take a key and map it to a value. As far as I know, implementations require the keys to be unique otherwise values get overwritten. Alright.
So let us assume that the above is done by some hash implementation.
What if two DIFFERENT keys get the same hash value? I am thinking of this in the form of an underlying array whose indices are in a result of hash on said keys. It could be possible that more than one unique key gets mapped to the same value yes? If so, how does such an implementation handle this?
How is handling same hash different from handling same key? Since same key results in overwriting and same hash HAS to retain the value.
I understand hashing with collision, so I know chaining and probing. Do implementations iterate over the current values which are hashed to a particular index and determine if the key is the same?
While I was searching for the answer I came across these links:
1. What happens when a duplicate key is put into a HashMap?
2. HashMap with multiple values under the same key
They don't answer my question however. How do we distinguish between same hash vs same key?
By comparing the keys. If you look at object-oriented implementations of hash maps, you'll find that they usually require two methods to be implemented on the key type:
bool equal(Key key1, Key key2);
int hash(Key key);
If only the hash function can be given and no equality function, that restricts the hash map to be based on the language's default equality. This is not always desirable as sometimes keys need to be compared with a different equality function. For example, if the keys are strings, an application may need to do a case-insensitive comparison, and then it would pass a hash function that converts to lowercase before hashing, and an equal function that ignores case.
The hash map stores the key alongside each corresponding value. (Usually, that's a pointer to the key object that was originally stored.) Any lookup into the hash map has to make a key comparison after finding a matching hash, to verify that the key actually matches.
For example, for a very simple hash map that stores a list in each bucket, the list would be a list of (key, value) pairs, and any lookup compares the keys of each list entry until it finds a match. In pseudocode:
Array<List<Pair<Key, Value>>> buckets;
Value lookup(Key k_sought) {
int h = hash(k_sought);
List<Pair<Key, Value>> bucket = buckets[h];
for (kv in bucket) {
Key k_found = kv.0;
Value v_found = kv.1;
if (equal(k_sought, k_found)) {
return v_found;
}
}
throw Not_found;
}
You can not tell what a key is from the index, so no you can not iterate over the values to find any information about the keys. You will either have to guarantee 0 collisions or store the information that was hashed to give the index.
If you only have values stored in your structure, there is no way to tell if they have the same key or just the same hash. You will need to store the key along with the value to know.

Cuckoo Hashing: What is the best way to detect collisions in hash functions?

I implemented a hashmap based on cuckoo hashing.
My hash functions take values of any length and return keys of type long. To match the keys to my array size n, I do key % n.
I'm thinking about following scenario:
Insert value A with key A.key into location A.key % n
Find value B with key A.key
So for this example I get the entry for value A and it is not recognized that value B hasn't even been inserted. This happens if my hash function returns the same key for two different values. Collisions with different keys but same locations are no problem.
What is the best way to detect those collisions?
Do I have to check every time I insert or search an item if the original values are equal?
As with most hashing schemes, in cuckoo hashing, the hash code tells you where to look in the table for the element in question, but the expectation is that you store both the key and the value in the table so that before returning the stored value, you first check the key stored at that slot against the key you're looking for. That way, if you get the same hash code for two objects, you can determine which object was stored at that slot.