How to get permission levels for Files/Folders/Documents in Sharepoint using rest endpoints? - rest

I am using the following rest end point while trying to get permission levels for a particular file/document.
https://<web url>/_api/Web/GetFileByServerRelativeUrl('<path to the file>')/getlimitedwebpartmanager(scope=1 or 0)
I am able to get hold of the file successfully. But how should I get hold of the permission levels now?

What I do to get the permission levels is by using the "$expand" query parameter equal to "ListItemAllFields/RoleAssignments/XXXX" where XXXX is Member, RoleDefinitionBindings, and so forth expanding out the chain as far as I need to get the user and permission level info that I need. For example,
.../GetFileByServerRelativeUrl('')?$expand=ListItemAllFields/RoleAssignments/Member,ListItemAllFields/RoleAssignments/RoleDefinitionBindings,ListItemAllFields/RoleAssignments/Member/Users
It took a lot of web searching to figure out that the "$expand" query parameter even existed but it's very useful to get all the info needed in one GET. I subsequently add a "$select" parameter to the query to filter out just the pieces that my application uses.
Edit:
Also look at Radu's question and my follow-up answer below. The answer to the question depends on what you're trying to accomplish. To quote part of my answer:
If you're dredging for all permissions had by all users, use my approach. However, you'll need sufficient rights to get that information. If you want to know the base permissions of the user you're using to make the request, use his approach. I have both use cases in my work so I, in fact, do both.

The accepted solution didn't work for me. It seems that a user doesn't by default have permission to see RoleAssignments. But here's what I ended up doing.
I grabbed the effectiveBasePermissions by getting a response from
/_api/web/getFileByServerRelativeUrl('my/relative/path')/ListItemAllFields/effectiveBasePermissions
I think this works for any item/folder in sharepoint by appending the /ListItemAllFields/effectiveBasePermissions to the url)
So, this returns the permission that the current user has for that file/item. The Permissions "object" is a set of 35 flags which are encoded in 64 bits (not all 64 bits are used - only 35). And these bits are provided by the endpoint as two integers (32 bits each) named:
Low - representing the first 32 bits
High - the next 32 bits in sequence, up to 64
Now, to see if the user has, for instance, editing permissions on that file, we need to look at the corresponding bit in this Low-High sequence. You can find what each bit in the sequence means, below:
ViewListItems = 1,
AddListItems = 2,
EditListItems = 3,
DeleteListItems = 4,
ApproveItems = 5,
OpenItems = 6,
ViewVersions = 7,
DeleteVersions = 8,
CancelCheckout = 9,
ManagePersonalViews = 10,
ManageLists = 12,
ViewFormPages = 13,
AnonymousSearchAccessList = 14,
Open = 17,
ViewPages = 18,
AddAndCustomizePages = 19,
ApplyThemeAndBorder = 20,
ApplyStyleSheets = 21,
ViewUsageData = 22,
CreateSSCSite = 23,
ManageSubwebs = 24,
CreateGroups = 25,
ManagePermissions = 26,
BrowseDirectories = 27,
BrowseUserInfo = 28,
AddDelPrivateWebParts = 29,
UpdatePersonalWebParts = 30,
ManageWeb = 31,
AnonymousSearchAccessWebLists = 32,
UseClientIntegration = 37,
UseRemoteAPIs = 38,
ManageAlerts = 39,
CreateAlerts = 40,
EditMyUserInfo = 41,
EnumeratePermissions = 63
As you can see, some bits don't mean anything (like bit 11, 33, 34 etc)
So, we need to look at bit number 3 (for editing) from the first 32 bits to tell if user can edit this file/item/folder. We can just ignore the High bits. We must make a bitwise comparison with the integer which in binary format has only the 3rd bit set to 1 (the rest being 0). In this case this is 100 (this is the binary representation of 4). There is a formula for this, actually: binaryNr = 2^(bitIndex - 1). For our example, bitIndex is 3.
Now that we have the binaryNr we use it as a mask to find out if the third bit in the low sequence is 0 or 1:
hasEditingPermission = (binaryNr | LowSequence) == LowSequence
and here's a handy pseudocode function for the whole thing (^ is the power operator):
function hasPermission (low, high, bitIndex){
var sequence = low;
if (bitIndex >= 32){
sequence = high;
bit -= 32;
}
return ((2^(bitIndex - 1)) | sequence) == sequence
}

Radu made a great point in his answer for a different use case than I had given in my answer. If you're dredging for all permissions had by all users, use my approach. However, you'll need sufficient rights to get that information. If you want to know the base permissions of the user you're using to make the request, use his approach. I have both use cases in my work so I, in fact, do both.
However, as I commented to Radu in his answer, I had some trouble using his hasPermission() function. I'm adding a new answer here to provide an example of why it didn't work for me:
I'm definitely not a expert bit twiddler but, in Java at least, 1 << (bitIndex-1) is not equivalent to 2 ^ (bitIndex-1) as Radu asserted in his comment. Perhaps the overall expressions were intended to accomplish the same thing so here's an example where I discovered the XOR approach didn't work for me.
Example:
The permissions had by the user correspond to the permission list "Limited Access" (low = 134287360). I want to check if the user has the "Open" permission, bit 17. In binary, the low value (which becomes "sequence" in hasPermission()) is 1000000000010001000000000000. As you will see, bit 17 is set. Running the expression ((2^(bitIndex - 1)) | sequence) yields 1000000000010001000000010010 which obviously does not equal sequence as required to get a true response from hasPermission().
So, not understanding or knowing for sure exactly what was intended by Radu's XOR approach, I decided to use a more straightforward, less obtuse way for testing for the presence of a bit. Like so: return (sequence & (1 << (bitIndex - 1))) != 0; Taking 1 and shifting it bitIndex-1 spaces to the left and then doing a bitwise AND not only makes it obvious what I'm trying to accomplish but it also works in every case I've tested.
Not being a bit twiddler as I said, I briefly considered whether there might be problems with signed values etc. (I'm using ints for high and low) but I don't think my approach really would suffer from any of that since I'm not shifting the ints themselves and the remainder of my logic is simply bitwise.

$.ajax({
url: _spPageContextInfo.webAbsoluteUrl + "/_api/web/getFileByServerRelativeUrl('<your file or folder url>')/ListItemAllFields/effectiveBasePermissions",
type: "GET",
headers: { Accept: "application/json;odata=verbose" },
success: function(data){
var permissions = new SP.BasePermissions();
permissions.initPropertiesFromJson(data.d.EffectiveBasePermissions);
var hasPermissions = permissions.has(SP.PermissionKind.<any level to check>);
//for example: var bool_has_editListItems = permissions.has(SP.PermissionKind.editListItems);
}
});
to check more levels, see SP.PermissionKind list

Related

Transferring arrays/classes/records between locales

In a typical N-Body simulation, at the end of each epoch, each locale would need to share its own portion of the world (i.e. all bodies) to the rest of the locales. I am working on this with a local-view approach (i.e. using on Loc statements). I encountered some strange behaviours that I couldn't make sense out of, so I decided to make a test program, in which things got more complicated. Here's the code to replicate the experiment.
proc log(args...?n) {
writeln("[locale = ", here.id, "] [", datetime.now(), "] => ", args);
}
const max: int = 50000;
record stuff {
var x1: int;
var x2: int;
proc init() {
this.x1 = here.id;
this.x2 = here.id;
}
}
class ctuff {
var x1: int;
var x2: int;
proc init() {
this.x1 = here.id;
this.x2 = here.id;
}
}
class wrapper {
// The point is that total size (in bytes) of data in `r`, `c` and `a` are the same here, because the record and the class hold two ints per index.
var r: [{1..max / 2}] stuff;
var c: [{1..max / 2}] owned ctuff?;
var a: [{1..max}] int;
proc init() {
this.a = here.id;
}
}
proc test() {
var wrappers: [LocaleSpace] owned wrapper?;
coforall loc in LocaleSpace {
on Locales[loc] {
wrappers[loc] = new owned wrapper();
}
}
// rest of the experiment further down.
}
Two interesting behaviours happen here.
1. Moving data
Now, each instance of wrapper in array wrappers should live in its locale. Specifically, the references (wrappers) will live in locale 0, but the internal data (r, c, a) should live in the respective locale. So we try to move some from locale 1 to locale 3, as such:
on Locales[3] {
var timer: Timer;
timer.start();
var local_stuff = wrappers[1]!.r;
timer.stop();
log("get r from 1", timer.elapsed());
log(local_stuff);
}
on Locales[3] {
var timer: Timer;
timer.start();
var local_c = wrappers[1]!.c;
timer.stop();
log("get c from 1", timer.elapsed());
}
on Locales[3] {
var timer: Timer;
timer.start();
var local_a = wrappers[1]!.a;
timer.stop();
log("get a from 1", timer.elapsed());
}
Surprisingly, my timings show that
Regardless of the size (const max), the time of sending the array and record strays constant, which doesn't make sense to me. I even checked with chplvis, and the size of GET actually increases, but the time stays the same.
The time to send the class field increases with time, which makes sense, but it is quite slow and I don't know which case to trust here.
2. Querying the locales directly.
To demystify the problem, I also query the .locale.id of some variables directly. First, we query the data, which we expect to live in locale 2, from locale 2:
on Locales[2] {
var wrappers_ref = wrappers[2]!; // This is always 1 GET from 0, okay.
log("array",
wrappers_ref.a.locale.id,
wrappers_ref.a[1].locale.id
);
log("record",
wrappers_ref.r.locale.id,
wrappers_ref.r[1].locale.id,
wrappers_ref.r[1].x1.locale.id,
);
log("class",
wrappers_ref.c.locale.id,
wrappers_ref.c[1]!.locale.id,
wrappers_ref.c[1]!.x1.locale.id
);
}
And the result is:
[locale = 2] [2020-12-26T19:36:26.834472] => (array, 2, 2)
[locale = 2] [2020-12-26T19:36:26.894779] => (record, 2, 2, 2)
[locale = 2] [2020-12-26T19:36:27.023112] => (class, 2, 2, 2)
Which is expected. Yet, if we query the locale of the same data on locale 1, then we get:
[locale = 1] [2020-12-26T19:34:28.509624] => (array, 2, 2)
[locale = 1] [2020-12-26T19:34:28.574125] => (record, 2, 2, 1)
[locale = 1] [2020-12-26T19:34:28.700481] => (class, 2, 2, 2)
Implying that wrappers_ref.r[1].x1.locale.id lives in locale 1, even though it should clearly be on locale 2. My only guess is that by the time .locale.id is executed, the data (i.e. the .x of the record) is already moved to the querying locale (1).
So all in all, the second part of the experiment lead to a secondary question, whilst not answering the first part.
NOTE: all experiment are run with -nl 4 in chapel/chapel-gasnet docker image.
Good observations, let me see if I can shed some light.
As an initial note, any timings taken with the gasnet Docker image should be taken with a grain of salt since that image simulates the execution across multiple nodes using your local system rather than running each locale on its own compute node as intended in Chapel. As a result, it is useful for developing distributed memory programs, but the performance characteristics are likely to be very different than running on an actual cluster or supercomputer. That said, it can still be useful for getting coarse timings (e.g., your "this is taking a much longer time" observation) or for counting communications using chplvis or the CommDiagnostics module.
With respect to your observations about timings, I also observe that the array-of-class case is much slower, and I believe I can explain some of the behaviors:
First, it's important to understand that any cross-node communications can be characterized using a formula like alpha + beta*length. Think of alpha as representing the basic cost of performing the communication, independent of length. This represents the cost of calling down through the software stack to get to the network, putting the data on the wire, receiving it on the other side, and getting it back up through the software stack to the application there. The precise value of alpha will depend on factors like the type of communication, choice of software stack, and physical hardware. Meanwhile, think of beta as representing the per-byte cost of the communication where, as you intuit, longer messages necessarily cost more because there's more data to put on the wire, or potentially to buffer or copy, depending on how the communication is implemented.
In my experience, the value of alpha typically dominates beta for most system configurations. That's not to say that it's free to do longer data transfers, but that the variance in execution time tends to be much smaller for longer vs. shorter transfers than it is for performing a single transfer versus many. As a result, when choosing between performing one transfer of n elements vs. n transfers of 1 element, you'll almost always want the former.
To investigate your timings, I bracketed your timed code portions with calls to the CommDiagnostics module as follows:
resetCommDiagnostics();
startCommDiagnostics();
...code to time here...
stopCommDiagnostics();
printCommDiagnosticsTable();
and found, as you did with chplvis, that the number of communications required to localize the array of records or array of ints was constant as I varied max, for example:
locale
get
execute_on
0
0
0
1
0
0
2
0
0
3
21
1
This is consistent with what I'd expect from the implementation: That for an array of value types, we perform a fixed number of communications to access array meta-data, and then communicate the array elements themselves in a single data transfer to amortize the overheads (avoid paying multiple alpha costs).
In contrast, I found that the number of communications for localizing the array of classes was proportional to the size of the array. For example, for the default value of 50,000 for max, I saw:
locale
get
put
execute_on
0
0
0
0
1
0
0
0
2
0
0
0
3
25040
25000
1
I believe the reason for this distinction relates to the fact that c is an array of owned classes, in which only a single class variable can "own" a given ctuff object at a time. As a result, when copying the elements of array c from one locale to another, you're not just copying raw data, as with the record and integer cases, but also performing an ownership transfer per element. This essentially requires setting the remote value to nil after copying its value to the local class variable. In our current implementation, this seems to be done using a remote get to copy the remote class value to the local one, followed by a remote put to set the remote value to nil, hence, we have a get and put per array element, resulting in O(n) communications rather than O(1) as in the previous cases. With additional effort, we could potentially have the compiler optimize this case, though I believe it will always be more expensive than the others due to the need to perform the ownership transfer.
I tested the hypothesis that owned classes were resulting in the additional overhead by changing your ctuff objects from being owned to unmanaged, which removes any ownership semantics from the implementation. When I do this, I see a constant number of communications, as in the value cases:
locale
get
execute_on
0
0
0
1
0
0
2
0
0
3
21
1
I believe this represents the fact that once the language has no need to manage the ownership of the class variables, it can simply transfer their pointer values in a single transfer again.
Beyond these performance notes, it's important to understand a key semantic difference between classes and records when choosing which to use. A class object is allocated on the heap, and a class variable is essentially a reference or pointer to that object. Thus, when a class variable is copied from one locale to another, only the pointer is copied, and the original object remains where it was (for better or worse). In contrast, a record variable represents the object itself, and can be thought of as being allocated "in place" (e.g., on the stack for a local variable). When a record variable is copied from one locale to the other, it's the object itself (i.e., the record's fields' values) which are copied, resulting in a new copy of the object itself. See this SO question for further details.
Moving on to your second observation, I believe that your interpretation is correct, and that this may be a bug in the implementation (I need to stew on it a bit more to be confident). Specifically, I think you're correct that what's happening is that wrappers_ref.r[1].x1 is being evaluated, with the result being stored in a local variable, and that the .locale.id query is being applied to the local variable storing the result rather than the original field. I tested this theory by taking a ref to the field and then printing locale.id of that ref, as follows:
ref x1loc = wrappers_ref.r[1].x1;
...wrappers_ref.c[1]!.x1.locale.id...
and that seemed to give the right result. I also looked at the generated code which seemed to indicate that our theories were correct. I don't believe that the implementation should behave this way, but need to think about it a bit more before being confident. If you'd like to open a bug against this on Chapel's GitHub issues page, for further discussion there, we'd appreciate that.

Heart Rate Value in BLE

I am having a hard time getting a valid value out of the HR characteristics. I am clearly not handling the values properly in Dart.
Example Data:
List<int> value = [22, 56, 55, 4, 7, 3];
Flags Field:
I convert the first item in the main byte array to binary to get the flags
22 = 10110 (as binary)
this leads me to believe that it is U16 (bit[0] is == 1)
HR Value:
Because it is 16 bit I am trying to get the bytes in the 1 & 2 indexes. I then try to buffer them into a ByteData. From there I get convert them to Uint16 with the Endian set to Little. This is giving me a value of 14136. Clearly I am missing something fundamental about how this is supposed to work.
Any help in clearing up what I am not understanding about how to process the 16 bit BLE values would be much appreciated.
Thank you.
/*
Constructor - constructs the heart rate value from a BLE message
*/
HeartRate(List<int> values) {
var flags = values[0];
var s = flags.toRadixString(2);
List<String> flagsArray = s.split("");
int offset = 0;
//Determine whether it is U16 or not
if (flagsArray[0] == "0") {
//Since it is Uint8 i will only get the first value
var hr = values[1];
print(hr);
} else {
//Since UTF 16 is two bytes I need to combine them
//Create a buffer with the first two bytes after the flags
var buffer = new Uint8List.fromList(values.sublist(1, 3)).buffer;
var hrBuffer = new ByteData.view(buffer);
var hr = hrBuffer.getUint16(0, Endian.little);
print(hr);
}
}
Your updated data looks much better. Here's how to decode it, and the process you'd use to figure this out yourself from scratch.
Determine the format
The Bluetooth site has been reorganized recently (~2020), and in particular they got rid of some of the document viewers, which makes things much harder to find and read IMO. All the documentation is in the Heart Rate Service (HRS) document, linked from the main GATT page, but for just parsing the format, the best source I know of is the XML for org.bluetooth.characteristic.heart_rate_measurement. (Since the reorganization, I don't know how you can find this page without searching for it. It doesn't seem to be linked anymore.)
Byte 0 - Flags: 22 (0001 0110)
Bits are numbered from LSB (0) to MSB (7).
Bit 0 - Heart Rate Value Format: 0 => UINT8 beats per minute
Bit 1-2 - Sensor Contact Status: 11 => Supported and detected
Bit 3 - Energy Expended Status: 0 => Not present
Bit 4 - RR-Interval: 1 => One or more values are present
The meaning of RR-intervals is explained in the HRS document, linked above. It sounds like you just want the heart rate value, so I won't go into them here.
Byte 1 - UINT8 BPM: 56
Since Bit 0 of flags was 0, this is the beats per minute. 56.
Bytes 2-5 - UINT16 RR Intervals: 55, 4, 7, 3
You probably don't care about these, but there are two UINT16 values here (there can be an arbitrary number of RR-Interval values). BLE is always little-endian, so [55, 4] is 1,079 (55 + 4<<8), and [7, 3] is 775 (7 + 3<<8).
I believe the docs are a little confusing on this one. The XML suggests that these values are in seconds, but the comments say "Resolution of 1/1024 second." The normal way to express this would be <BinaryExponent>-10</BinaryExponent>, and I'm certain that's what they meant. So these would be:
RR0: 1.05s (1079/1024)
RR1: 0.76s (775/1024)

Does a function cache exist in Matlab?

In Python we have lru_cache as a function wrapper. Add it to your function and the function will only be evaluated once per different input argument.
Example (from Python docs):
#lru_cache(maxsize=None)
def fib(n):
if n < 2:
return n
return fib(n-1) + fib(n-2)
>>> [fib(n) for n in range(16)]
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610]
>>> fib.cache_info()
CacheInfo(hits=28, misses=16, maxsize=None, currsize=16)
I wonder whether a similar thing exists in Matlab? At the moment I am using cache files, like so:
function result = fib(n):
% FIB example like the Python example. Don't implement it like that!
cachefile = ['fib_', n, '.mat'];
try
load(cachefile);
catch e
if n < 2
result = n;
else
result = fib(n-1) + fib(n-2);
end
save(cachefile, 'result');
end
end
The problem I have with doing it this way, is that if I change my function, I need to delete the cachefile.
Is there a way to do this with Matlab realising when I changed the function and the cache has become invalidated?
Since matlab 2017 this is available:
https://nl.mathworks.com/help/matlab/ref/memoizedfunction.html
a = memoized(#sin)
I've created something like this for my own personal use: a CACHE class. (I haven't documented the code yet though.) It appears to be more flexible than Python's lru_cache (I wasn't aware of that, thanks) in that it has several methods for adjusting exactly what gets cached (to save memory) and how the comparisons are made. It could still use some refinement (#Daniel's suggestion to use the containers.Map class is a good one – though it would limit compatibility with old Matlab versions). The code is on GitHub so you're welcome to fork and improve it.
Here is a basic example of how it can be used:
function Output1 = CacheDemo(Input1,Input2)
persistent DEMO_CACHE
if isempty(DEMO_CACHE)
% Initialize cache object on first run
CACHE_SIZE = 10; % Number of input/output patterns to cache
DEMO_CACHE = CACHE(CACHE_SIZE,Input1,Input2);
CACHE_IDX = 1;
else
% Check if input pattern corresponds something stored in cache
% If not, return next available CACHE_IDX
CACHE_IDX = DEMO_CACHE.IN([],Input1,Input2);
if ~isempty(CACHE_IDX) && DEMO_CACHE.OUT(CACHE_IDX) > 0
[~,Output1] = DEMO_CACHE.OUT(CACHE_IDX);
return;
end
end
% Perform computation
Output1 = rand(Input1,Input2);
% Save output to cache CACHE_IDX
DEMO_CACHE.OUT(CACHE_IDX,Output1);
I created this class to cache the results from time-consuming stochastic simulations and have since used it to good effect in a few other places. If there is interest, I might be willing to spend some time documenting the code sooner as opposed to later. It would be nice if there was a way to limit memory use as well (a big consideration in my own applications), but getting the size of arbitrary Matlab datatypes is not trivial. I like your idea of caching to a file, which might be a good idea for larger data. Also, it might be nice to create a "lite" version that does what Python's lru_cache does.

How to select the last column of numbers from a table created by FoldList in Mathematica

I am new to Mathematica and I am having difficulties with one thing. I have this Table that generates 10 000 times 13 numbers (12 numbers + 1 that is a starting number). I need to create a Histogram from all 10 000 13th numbers. I hope It's quite clear, quite tricky to explain.
This is the table:
F = Table[(Xi = RandomVariate[NormalDistribution[], 12];
Mu = -0.00644131;
Sigma = 0.0562005;
t = 1/12; s = 0.6416;
FoldList[(#1*Exp[(Mu - Sigma^2/2)*t + Sigma*Sqrt[t]*#2]) &, s,
Xi]), {SeedRandom[2]; 10000}]
The result for the following histogram could be a table that will take all the 13th numbers to one table - than It would be quite easy to create an histogram. Maybe with "select"? Or maybe you know other ways to solve this.
You can access different parts of a list using Part or (depending on what parts you need) some of the more specialised commands, such as First, Rest, Most and (the one you need) Last. As noted in comments, Histogram[Last/#F] or Histogram[F[[All,-1]]] will work fine.
Although it wasn't part of your question, I would like to note some things you could do for your specific problem that will speed it up enormously. You are defining Mu, Sigma etc 10,000 times, because they are inside the Table command. You are also recalculating Mu - Sigma^2/2)*t + Sigma*Sqrt[t] 120,000 times, even though it is a constant, because you have it inside the FoldList inside the Table.
On my machine:
F = Table[(Xi = RandomVariate[NormalDistribution[], 12];
Mu = -0.00644131;
Sigma = 0.0562005;
t = 1/12; s = 0.6416;
FoldList[(#1*Exp[(Mu - Sigma^2/2)*t + Sigma*Sqrt[t]*#2]) &, s,
Xi]), {SeedRandom[2]; 10000}]; // Timing
{4.19049, Null}
This alternative is ten times faster:
F = Module[{Xi, beta}, With[{Mu = -0.00644131, Sigma = 0.0562005,
t = 1/12, s = 0.6416},
beta = (Mu - Sigma^2/2)*t + Sigma*Sqrt[t];
Table[(Xi = RandomVariate[NormalDistribution[], 12];
FoldList[(#1*Exp[beta*#2]) &, s, Xi]), {SeedRandom[2];
10000}] ]]; // Timing
{0.403365, Null}
I use With for the local constants and Module for the things that are other redefined within the Table (Xi) or are calculations based on the local constants (beta). This question on the Mathematica StackExchange will help explain when to use Module versus Block versus With. (I encourage you to explore the Mathematica StackExchange further, as this is where most of the Mathematica experts are hanging out now.)
For your specific code, the use of Part isn't really required. Instead of using FoldList, just use Fold. It only retains the final number in the folding, which is identical to the last number in the output of FoldList. So you could try:
FF = Module[{Xi, beta}, With[{Mu = -0.00644131, Sigma = 0.0562005,
t = 1/12, s = 0.6416},
beta = (Mu - Sigma^2/2)*t + Sigma*Sqrt[t];
Table[(Xi = RandomVariate[NormalDistribution[], 12];
Fold[(#1*Exp[beta*#2]) &, s, Xi]), {SeedRandom[2];
10000}] ]];
Histogram[FF]
Calculating FF in this way is even a little faster than the previous version. On my system Timing reports 0.377 seconds - but such a difference from 0.4 seconds is hardly worth worrying about.
Because you are setting the seed with SeedRandom, it is easy to verify that all three code examples produce exactly the same results.
Making my comment an answer:
Histogram[Last /# F]

perfect hash function

I'm attempting to hash the values
10, 100, 32, 45, 58, 126, 3, 29, 200, 400, 0
I need a function that will map them to an array that has a size of 13 without causing any collisions.
I've spent several hours thinking this over and googling and can't figure this out. I haven't come close to a viable solution.
How would I go about finding a hash function of this sort? I've played with gperf, but I don't really understand it and I couldn't get the results I was looking for.
if you know the exact keys then it is trivial to produce a perfect hash function -
int hash (int n) {
switch (n) {
case 10: return 0;
case 100: return 1;
case 32: return 2;
// ...
default: return -1;
}
}
Found One
I tried a few things and found one semi-manually:
(n ^ 28) % 13
The semi-manual part was the following ruby script that I used to test candidate functions with a range of parameters:
t = [10, 100, 32, 45, 58, 126, 3, 29, 200, 400, 0]
(1..200).each do |i|
t2 = t.map { |e| (e ^ i) % 13 }
puts i if t2.uniq.length == t.length
end
On some platforms (e.g. embedded), modulo operation is expensive, so % 13 is better avoided. But AND operation of low-order bits is cheap, and equivalent to modulo of a power-of-2.
I tried writing a simple program (in Python) to search for a perfect hash of your 11 data points, using simple forms such as ((x << a) ^ (x << b)) & 0xF (where & 0xF is equivalent to % 16, giving a result in the range 0..15, for example). I was able to find the following collision-free hash which gives an index in the range 0..15 (expressed as a C macro):
#define HASH(x) ((((x) << 2) ^ ((x) >> 2)) & 0xF)
Here is the Python program I used:
data = [ 10, 100, 32, 45, 58, 126, 3, 29, 200, 400, 0 ]
def shift_right(value, shift_value):
"""Shift right that allows for negative values, which shift left
(Python shift operator doesn't allow negative shift values)"""
if shift_value == None:
return 0
if shift_value < 0:
return value << (-shift_value)
else:
return value >> shift_value
def find_hash():
def hashf(val, i, j = None, k = None):
return (shift_right(val, i) ^ shift_right(val, j) ^ shift_right(val, k)) & 0xF
for i in xrange(-7, 8):
for j in xrange(i, 8):
#for k in xrange(j, 8):
#j = None
k = None
outputs = set()
for val in data:
hash_val = hashf(val, i, j, k)
if hash_val >= 13:
pass
#break
if hash_val in outputs:
break
else:
outputs.add(hash_val)
else:
print i, j, k, outputs
if __name__ == '__main__':
find_hash()
Bob Jenkins has a program for this too: http://burtleburtle.net/bob/hash/perfect.html
Unless you're very lucky, there's no "nice" perfect hash function for a given dataset. Perfect hashing algorithms usually use a simple hashing function on the keys (using enough bits so it's collision-free) then use a table to finish it off.
Just some quasi-analytical ramblings:
In your set of numbers, eleven in all, three are odd and eight are even.
Looking at the simplest forms of hashing - %13 - will give you the following hash values:
10 - 3,
100 - 9,
32 - 6,
45 - 6,
58 - 6,
126 - 9,
3 - 3,
29 - 3,
200 - 5,
400 - 10,
0 - 0
Which, of course, is unusable due to the number of collisions. Something more elaborate is needed.
Why state the obvious?
Considering that the numbers are so few any elaborate - or rather, "less simple" - algorithm will likely be slower than either the switch statement or (which I prefer) simply searching through an unsigned short/long vector of size eleven positions and using the index of the match.
Why use a vector search?
You can fine-tune it by placing the most often occuring values towards the beginning of the vector.
I assume the purpose is to plug in the hash index into a switch with nice, sequential numbering. In that light it seems wasteful to first use a switch to find the index and then plug it into another switch. Maybe you should consider not using hashing at all and go directly to the final switch?
The switch version of hashing cannot be fine-tuned and, due to the widely differing values, will cause the compiler to generate a binary search tree which will result in a lot of comparisons and conditional/other jumps (especially costly) which take time (I've assumed you've turned to hashing for its speed) and require space.
If you want to speed up the vector search additionally and are using an x86-system you can implement a vector search based on the assembler instructions repne scasw (short)/repne scasd (long) which will be much faster. After a setup time of a few instructions you will find the first entry in one instruction and the last in eleven followed by a few instructions cleanup. This means 5-10 instructions best case and 15-20 worst. This should beat the switch-based hashing in all but maybe one or two cases.
I did a quick check and using the SHA256 hash function and then doing modular division by 13 worked when I tried it in Mathematica. For c++ this function should be in the openssl library. See this post.
If you were doing a lot of hashing and lookup though, modular division is a pretty expensive operation to do repeatedly. There is another way of mapping an n-bit hash function into a i-bit indices. See this post by Michael Mitzenmacher about how to do it with a bit shift operation in C. Hope that helps.
Try the following which maps your n values to unique indices between 0 and 12
(1369%(n+1))%13