Terminology for creating an ID using the position in a graph? (adjacency data) - hash

Is there a term for the process of creating an ID, based on connectivity information, that helps identify elements as matching based on their neighbors?
This can be as simple as looping over items in an array and accumulating next and previous items (possibly with bit-shifting, xor... etc).
Another example is using an order independent hash based on nodes connected by edges in a graph.
I've used this multiple times, but don't know if there is a term for it.
Typically the following steps are done by assigning an ID to each element(often a number - created by hashing the contents).
Then iterate:
Store a copy of all ID's to prevent reading modified values.
Loop over each element and create a new ID from its value combined with the connected elements.
Each iteration the range of influence elements have on each-other increases - following a triangle number sequence.
Is there a term for each iteration?
Is there a term for this entire process?

Related

Questions about LSH (Locality-sensitive hashing) and minihashing implementation

I'm trying to implement this paper
Browser Fingerprint Coding Methods Increasing the Effectiveness of User Identification in the Web Traffic
I got a couple of questions about the LHS algorithm in general and the proposed implementation:
The LSH algorithm it's used only when you have a lot of documents to compare with each other (because it is supposed to put the similar ones in the same bucket from what I got). If for example I have a new document and I want to calculate the similarity with the others, I have to relaunch the LHS algorithm from scratch, including the new document, correct?
In 'Mining of Massive Datasets, Ch3', it is said that for the LHS we should use one hash function per band. Each hash function creates n buckets.
So, for the first band, we are going to have n buckets. For the second band onward, Am I supposed to keep using the same hash function (so this way I keep using the same buckets as before) or another one (ending so with m>>n buckets)?
This question is related t the previous one. If I use the same hash function for all the bands, then I'll have n buckets. No problem here. But If I have to use more hash functions (one different function per row), I'm going to end up with a lot of different buckets. Am I supposed to measure the similarity for each pair in each bucket? (If I have to use only one hash function then here it's not a problem).
In the paper, I understood most of the algorithm except for its end.
Basically, two Signatures matrices are created (one for stable features and one for unstable features) via minhashing. Then, they use LSH on the first matrix to obtain a list of candidates pairs. So far so good.
What happens at the end? do they perform the LHS on the second matrix? How the result of the first LHS is used? I cannot see the relationship between the first and the second LHS.
The output of the final step is supposed to be a list of pairing candidates, right? and all that I have to do is performing Jaccard similarity on them and setting a threshold, right?
Thanks for your answers!
I got a partial answer to my question (still missing question 4)
No. You would keep the bucket structure and hash the new doc into it. Then compare with only those docs in one of the buckets it fell into.
No. You HAVE to use different hash functions and a different set of buckets for each hash function.
This is irrelevant because of the answer to (2).

Show number of elements in multiple sets in a chart

I create about 10 sets using my tableau data. I want to show the number of elements in all sets in a chart, for example, bubble chart, or bar chart. When I move a single set to the sheet and select the number of records and filter the in elements I can see the number of elements in the set, however, I want to simultaneously see the number of records in multiple sets.
When I try to put multiple sets to a for example bubble chart, Tableau creates one single bubble instead of multiple bubbles.
Sets are very useful, but may not be the best approach when you have a very large number of similar groupings to compare side by side when you are using them as dimensions.
Remember the purpose of dimensions is to partition your data into non overlapping blocks prior to aggregating measures. Since a data row may belong to multiple sets, using sets as dimensions doesn't fit the particular application you describe. (but using sets as filters or building blocks for calculations might)
So here is one approach that will give you some flexibility. Define a calculated field for each set to return 1 if the record is in set 1, null otherwise (One way to think of sets is as a boolean function)
Number of Set 1 Records
if [Set_1] then 1 end
Then you you can use SUM([Number of Set 1 Records]) as a measure as desired. You can use Measure Values to display multiple measures together.
This way your set definitions are used for calculating your measures, but not for partitioning the data rows.
If your sets are completely defined by a condition, and this is the only way you use them, you could simplify by using the condition directly in the calculated fields above and not creating the corresponding sets.

How hashtables are linear on same or collision values?

I was looking at this StackOverflow answer to understand hashing better and saw the following (regarding the fact that we would need to get bucket size in constant time):
if you use something like linear probing or double hashing, finding all the items that hashed to the same value means you need to hash the value, then walk through the "chain" of non-empty items in your table to find how many of those hashed to the same value. That's not linear on the number of items that hashed to the same value though--it's linear on the number of items that hashed to the same or a colliding value.
What does this mean that it's "linear on the number of items that hashed to the same or a colliding value"? Wouldn't it be linear on total number of items in the hashtable, since it's possible that it will need to walk through every value during linear probing? I don't see why it would just have to go through the ones that collided.
Like for example, if I am using linear probing (step size 1) on a hashtable and I have different keys (not colliding, all hash to unique values) mapping to the odd index slots 1,3,5,7,9..... Then, I want to insert many keys that all hash to 2, so I fill up all my even index spots with those keys. If I wanted to know how many keys hash to 2, wouldn't I need to go through the entire hash table? But I'm not just iterating through items that hashed to the same or colliding value, since the odd index slots are not colliding.
A hash table is conceptually similar to an array (table) of linked lists (bucket in the table). The difference is in how you manage and access that array: using a function to generate a number that is used to compute the array index.
Once you have two elements placed in the same bucket (the same computed value, i.e. collission), then the problem turns out to be a search in a list. The number of elements in the list is hopefully lower than the total elements in the hash table (meaning that other elements exist in other buckets).
However, you are skipping the important introduction in that paragraph:
If you use something like linear probing or double hashing, finding all the items that hashed to the same value means you need to hash the value, then walk through the "chain" of non-empty items in your table to find how many of those hashed to the same value. That's not linear on the number of items that hashed to the same value though -- it's linear on the number of items that hashed to the same or a colliding value.
Linear probing is a different implementation of a hash table in which you don't use any list (chain) for your collissions. Instead, you just find the nearest available spot in the array, starting from the expected position and going forward. The more populated the array is, the higher the chance is to find that the next position is being used too, so you just need to keep searching. The positions are used by items that hashed to the same or colliding value, although you are never (and you don't really care) which of these two cases is, unless you explicitly see the hash of the existing element there.
This CppCon presentation video makes a good introduction and in-depth analysis of hash tables.

Guidance on Merge Sort Algorithm

Currently working on a class assignment to create a merge sort algorithm using MIPS assembly language. Ill paste the instructions to said assignment to make sure my interpretation of what I'm supposed to do is correct.
The Instructions:
Convert "merge" in Assignment 3 (Assignment 3 is a merge algorithm that takes two ordered lists merges all elements of that list into one long ordered list, I've already completed this) into a subroutine. Write a "main" program to perform merge-sorting of a list of integers by calling "merge" repeatedly ( ** Im assuming this means calling my previous Assignment 3**). For example, if the sorting program takes (6,5,9,1,7,0,-3,2) as input, it will produce a sorted list (-3,0,1,2,4,6,7,9).
The original unsorted list of integers should be received from the keyboard input. Your program should first ask for the user to input the number of integers in the original list, and then ask for inputting all the integers. The total number of integers to be sorted by this program should be a power of 2. This means, the program should work with a list of 2, 4, 8, 16, or 32 (...) integers (but your program needs only to handle up to 32 integers).
Now, my merge algorithm takes two ordered lists but this assignment takes only 1 list. However, the link bellow explains merge sort in a way where the original unsorted list is progressively divided up into individual elements and then it works backwards and puts the elements in order. (Since my assignment 3 (the merge algorithm that i already have) takes two ordered lists would it be possible for me to simply do one iteration of my merge algorithm right after dividing the unsorted list into two unsorted lists??)
Basically, calling my merge algorithm at the second step in the bellow link.
https://www.tutorialspoint.com/data_structures_algorithms/merge_sort_algorithm.htm
Thank you so much!

Tableau Dual Axis with different filters

I am trying to create a graph with two lines, with two filters from the same dimension.
I have a dimension which has 20+ values. I'd like one line to show data based on just one of the selected values and the other line to show a line excluding that same value.
I've tried the following:
-Creating a duplicate/copy dimension and filtering the original one with the first, and the copy with the 2nd. When I do this, the graphic disappears.
-Creating a calculated field that tries to split the measures up. This isn't letting me track the count.
I want this on the same axis; the best I've been able to do is create two sheets, one with the first filter and one with the 2nd, and stack them in a dashboard.
My end user wants the lines in the same visual, otherwise I'd be happy with the dashboard approach. Right now, though, I'd also like to know how to do this.
It is a little hard to tell exactly what you want to achieve, but the problem with filtering is common.
The principle that is important is that Tableau will filter the whole dataset by row. So duplicating the dimension you want to filter won't help as the filter on the original dimension will also filter the corresponding rows in the second dimension. Any solution has to be clever enough to work around this issue.
One solution is to build two new dimensions that use a calculation rather than a filter to create the new result. Let's say you have a dimension, [size] that has a range of numbers from 1 to 10 and you want to compare the total number of rows including and excluding the number 5. You could create a new field using a formula like if [size] <> 5 then 1 else 0 end
Summing the new field will give a count of the number of rows that don't contain a 5 and this can be compared directly to a rowcount of the original [size] field which will give the number including the value 5.
This basic principle can be extended to much more complex logic. The essential point is to realise that filters act on every row in your data and can't, by themselves, show comparisons with alternative filter choices on a single visualisation.
Depending on the nature of your problem there may be other solutions worth looking at including sets and groups but you would need to provide more specific details for users here to tell you whether they would be useful.
We can make a a set out of the values of the dimension and then place it in the required shelf. So, you will have your dimension which will plot accordingly and set which will have data as per the requirement because with filter you can't have that independence of showing data everytime you want.