Guidance on Merge Sort Algorithm - merge

Currently working on a class assignment to create a merge sort algorithm using MIPS assembly language. Ill paste the instructions to said assignment to make sure my interpretation of what I'm supposed to do is correct.
The Instructions:
Convert "merge" in Assignment 3 (Assignment 3 is a merge algorithm that takes two ordered lists merges all elements of that list into one long ordered list, I've already completed this) into a subroutine. Write a "main" program to perform merge-sorting of a list of integers by calling "merge" repeatedly ( ** Im assuming this means calling my previous Assignment 3**). For example, if the sorting program takes (6,5,9,1,7,0,-3,2) as input, it will produce a sorted list (-3,0,1,2,4,6,7,9).
The original unsorted list of integers should be received from the keyboard input. Your program should first ask for the user to input the number of integers in the original list, and then ask for inputting all the integers. The total number of integers to be sorted by this program should be a power of 2. This means, the program should work with a list of 2, 4, 8, 16, or 32 (...) integers (but your program needs only to handle up to 32 integers).
Now, my merge algorithm takes two ordered lists but this assignment takes only 1 list. However, the link bellow explains merge sort in a way where the original unsorted list is progressively divided up into individual elements and then it works backwards and puts the elements in order. (Since my assignment 3 (the merge algorithm that i already have) takes two ordered lists would it be possible for me to simply do one iteration of my merge algorithm right after dividing the unsorted list into two unsorted lists??)
Basically, calling my merge algorithm at the second step in the bellow link.
https://www.tutorialspoint.com/data_structures_algorithms/merge_sort_algorithm.htm
Thank you so much!

Related

Questions about LSH (Locality-sensitive hashing) and minihashing implementation

I'm trying to implement this paper
Browser Fingerprint Coding Methods Increasing the Effectiveness of User Identification in the Web Traffic
I got a couple of questions about the LHS algorithm in general and the proposed implementation:
The LSH algorithm it's used only when you have a lot of documents to compare with each other (because it is supposed to put the similar ones in the same bucket from what I got). If for example I have a new document and I want to calculate the similarity with the others, I have to relaunch the LHS algorithm from scratch, including the new document, correct?
In 'Mining of Massive Datasets, Ch3', it is said that for the LHS we should use one hash function per band. Each hash function creates n buckets.
So, for the first band, we are going to have n buckets. For the second band onward, Am I supposed to keep using the same hash function (so this way I keep using the same buckets as before) or another one (ending so with m>>n buckets)?
This question is related t the previous one. If I use the same hash function for all the bands, then I'll have n buckets. No problem here. But If I have to use more hash functions (one different function per row), I'm going to end up with a lot of different buckets. Am I supposed to measure the similarity for each pair in each bucket? (If I have to use only one hash function then here it's not a problem).
In the paper, I understood most of the algorithm except for its end.
Basically, two Signatures matrices are created (one for stable features and one for unstable features) via minhashing. Then, they use LSH on the first matrix to obtain a list of candidates pairs. So far so good.
What happens at the end? do they perform the LHS on the second matrix? How the result of the first LHS is used? I cannot see the relationship between the first and the second LHS.
The output of the final step is supposed to be a list of pairing candidates, right? and all that I have to do is performing Jaccard similarity on them and setting a threshold, right?
Thanks for your answers!
I got a partial answer to my question (still missing question 4)
No. You would keep the bucket structure and hash the new doc into it. Then compare with only those docs in one of the buckets it fell into.
No. You HAVE to use different hash functions and a different set of buckets for each hash function.
This is irrelevant because of the answer to (2).

Structural Sharing in Scala Vector

Structural sharing in Scala List is straightforward and easy to understand. But Scala Vector is a more complicated data structure than a list. How is structural sharing achieved in Scala Vector?
Vector is basically a tree (trie) with 32-wide branching at each level. If you have a Vector of, say, 3000 elements and you want to index element 2045, for example, which converts to 100000010101 in binary, it will decompose it into 5-bit blocks to use as indices into the tree: 10 (i.e. 2) in the first branch then 00000 (i.e. 0) in the next, and finally 10101 (i.e. 21) in the terminal branch, and then there's the data.
Given this structure, it's easy to see how to structurally share things: you can share any sub-trees that aren't changed. So if you make a new vector with a different element 2045, you have to change not all 3000 elements but recreate "only" three arrays of size 32: the terminal one is replaced by a copy with its element 21 updated; then its parent has to be replaced by a copy with this new child in index 0; then its parent has to be replaced with the correct subtree in index 2.
Now, this provides quite a lot of structural sharing as long as you have far more than 32 elements in your vector, but it's still a pretty big overhead. Because of this, additions to the end of the vector are special-cased so that you just add to the existing array. The old Vectors still point to that array, but they think the end is earlier (and that part is unchanged) so it works out okay.
There's a more complex but similar scheme to allow addition at the front of a vector in a similar fashion (basically, by leaving space at the front and keeping track of where to point via indices and offsets in addition to the indexing scheme).
The trick as implemented doesn't work to allow alternating addition to both front and back, though, so there you effectively rebuild the trees every addition. Making a version with even better structural sharing would be possible, but it would probably be a bit slower to access.

Hash a Sequence of positive/negative integers

I have a file with millions of lines (actually it's an online stream of data, which means we are receiving it line by line) , each line consists of an array of integers which is not sorted (positive and negative) there's no limit for the each number and the lengths are different and we might have duplicate values in one line,
I want to remove the duplicate lines (if 2 lines have same values regardless of how they are ordered we consider them duplicate), is there any good hashing function ?
We want to do this in O(n) while n is number of lines (we can assume that the maximum possibele number of elements in each line is constant, e.g. we have maximum of 100 elements in each line)
I've read some of the questions posted here in stackoverflow and I also googled it, most of them were for the cases where the arrays are of the same length or the integers are positive or even or they are sorted, is there any way to solve this in general case ?
My solution:
First we sort each line with the use of O(n) sorting algorithm e.g. Counting sort , then we put them into a string and then we use md5 hashing to put them into a hashset. If it's not in the set we put it into that set, if it's already in the list we check the arrays with the same hash value.
Problem with the solution : sorting using the Counting Sort takes a lot of space as there's no limit for the numbers and the collisions are possible .
The problem with using a hashing algorithm on a set of data this large is that you have a high probability of two different lines hashing to the same value. You want to stay in O(n) but I am not sure that is possible, with the size of the data and accuracy needed. If you use heapsort, which is space efficient and then traverse down the new sorted data removing consecutive lines that are the same you could accomplish this in O(nlogn)

example where quicksort crushes

I know that quicksort is not stable method, namely for equal elements, maybe member of array will not be placed at correct position, I need example of array (in which elements are repeated several times) quicksort does not work (need for example three way of partitioning method). I could not be able to find such example of array in internet and could you help me?
sure I can use other kind of sorting methods for this problem (like heap sort ,merge sort, etc), but my attitude is to know in real world example, what kind of data contains risk for quicksort, because as know it is one most useful method and is used often
Quicksort shouldn't crash no matter what array it is given.
When a sorting algorithm is called 'stable' or 'not stable', it does not refer to safety of the algorithm or whether or not it crashes. It is related to maintaining relative order of elements that have the same key.
As a brief example, if you have:
[9, 5, 7, 5, 1]
Then a 'stable' sorting algorithm should guarantee that in the sorted array the first 5 is still placed before the second 5. Even though for this trivial example there is no difference, there are examples in which it makes a difference, such as when sorting a table based on one column (you want the other columns to stay in the same order as before).
See more here: http://en.wikipedia.org/wiki/Stable_sort#Stability

Hash function for an integer sequence

Say there is a list of permutations. Each permutation is a long list of integers. Let's consider a sample permutatation and call it samplePerm. My task is to find out if the list contains the samplePerm. I think that it will be a good idea to use a hash function technique. So that permutations are very large (more than 10000 items) the polinomial variant (like for strings) is useless. Does anybody know the best practice?
UPDATE:
THE ORDER OF INTEGERS IN A PERMUTATION IS A KEY CRITERION! All permutations consist of the same numbers
The solution is dividing integers into groups and considering each group as a string via concatenating integers. After that it is possible to apply a hash function (see java String.hashCode() for an algorithm) to each group. Finally it is possible to add the result numbers. The last activity may provide collisions so it is a place where it is required a better idea :)