Scala working with very large list

Scala working with very large list - scala

I have a List with about 1176^3 positions.
Making smth like
val x = list.length
takes hours ..
When in list is 1271256 positions is ok, just few seconds.
Any one have idea how to speed up it ?

List is possibly the wrong data structure for a length operation as it is O(n) - it takes longer to complete the longer the list is.
Vector is possibly a better data structure to use if you are needing to invoke length as its storage supports random access in a finite time.
This, of course, does not mean that List is a poor structure to use, just in this case it might not be preferable.

To add to gpampara's answer, in cases like these you may actually be able to justify using an Array, since it has the lowest overhead per item stored and O(1) access to elements and length determination (since it's recorded in the array header itself).
Array has many down-sides, but I consider it justifiable when memory overhead is a primary consideration (and when a fixed-size collection whose size is known at the time of creation is feasible).

Related

A questions about HashMap's time Complexity

This is not a code questions iam just trying to understand the concept of hashmaps and time complexity.
Well i think i know how hashMaps/sets etc. work and i think i understand why HashMaps.get has a constant time but when we have a very big hashMap the indicies where the values get stored should overlap. When 2 hashcodes resolve to the same index they get stored in a LinkList at this index right?.
Couldn't it be that all elements got stored in one index as a link List. Shouldn't now HashMap.get run at worst case in O(n).

Yes the worst case time complexity for HashMap is O(n) where n is the number of elements stored in one bucket. The worst case will be when all the n elements of the HashMap gets stored in one bucket.
However this can be improved if binary search is implemented for each of the buckets. In that case the worst case time complexity will be O(log(n)) as binary search tree is used to store the elements instead of a doubly linked list.

Time Complexity of Counting Edges in OrientDB

I would like to retrieve the total number of incoming or outgoing edges of a given type to/from a vertex in OrientDB. The obvious method is to construct a query using count() and inE(MyEdgeType), outE(MyEdgeType), or bothE(MyEdgeType). However, I am concerned about time complexity; if this operation is O(N) rather than O(1), I might be better off storing the number in the database rather than using count() each time I need it, because I anticipate the number of edges in question becoming very large. I have searched the documentation, but it does not seem to list the time complexities of OrientDB's functions. Also, I am unsure of whether to use in/out/both or inE/outE/bothE; I presume the E versions will be faster, but depending on how OrientDB stores edges under the hood, that may be wrong.
Is counting the set of incoming/outgoing/both edges of a given type to/from a vertex a constant-time operation--and if not, what is its time complexity? To be most efficient, do I need to use inE/outE/bothE, or in/out/both? Or is there some other method entirely that I've missed?

In OrientDB this is O(1): inE()/outE()/bothE().size() with or without the edge label as parameter.

What is the right data structure for a queue that support Min, Max operations in O(1) time?

What is the right data structure for a queue that support Enque, Dequeue, Peak, Min, and Max operation and perform all these operations in O(1) time.
The most obvious data structure is linked list but Min, Max operations would be O(n). Priority Queue is another perfect choice but Enqueue, Dequeue should works in the normal fashion of a Queue. (FIFO)
And another option that comes to mind is a Heap, but I can not quite figure out how one can design a queue with Min, Max operation using Heaps.
Any help is much appreciated.

The data structure you seek cannot be designed, if min() and max() actually change the structure. If min() and max() are similar to peek(), and provide read-only access, then you should follow the steps in this question, adding another deque similar to the one used for min() operations for use in max() operation. The rest of this answer assumes that min() and max() actually remove the corresponding elements.
Since you require enqueue() and dequeue(), elements must be added and removed by order of arrival (FIFO). A simple double-ended queue (either linked or using a circular vector) would provide this in O(1).
But the elements to be added could change the current min() and max(); however, when removed, the old min() and max() values should be restored... unless they were removed in the interim. This restriction forces you to keep elements sorted somehow. Any sorting structure (min-heap, max-heap, balanced binary tree, ...) will require at least O(log n) to find the position of a new arrival.
Your best bet is to pair a balanced binary tree (for min() and max()) with a doubly-linked list. Your tree nodes would store a set of pointers to the list nodes, sorted by whatever key you use in min() and max(). In Java:
// N your node class; can return K, comparable, used for min() and max()
LinkedList<N> list; // sorted by arrival
TreeMap<K,HashMap<N>> tree; // sorted by K
on enque(), you would add a new node to the end of list, and add that same node, by its key, to the HashMap in its node in tree. O(log n).
on dequeue(), you would remove the node from the start of list, and from its HashMap in its node in tree. O(log n).
on min(), you would look for the 1st element in the tree. O(1). If you need to remove it, you have the pointer to the linked list, so O(1) on that side; but O(log n) to re-balance the tree if it was the last element with that specific K.
on max(), the same logic applies; except that you would be looking for the last element in the tree. So O(log n).
on peek(), looking at but not extracting the 1st element in the queue would be O(1).
You can simplify this (by removing the HashMap) if you know that all keys will be unique. However, this does not impact asymptotic costs: they would all remain the same.
In practice, the difference between O(log n) and O(1) is so low that the default map implementation in C++'s STL is O(log n)-based (Tree instead of Hash).

Any data structure that can retrieve Min or Max in O(1) time needs to spend at least O(log n) on every Insert and Remove to maintain elements in partially sorted order. The data structures that do achieve this are called priority queues.
The basic priority queue supports Insert, Max, and RemoveMax. There are a number of ways to build them, but binary heaps work best.
Supporting all of Insert, Min, RemoveMin, Max, and RemoveMax with a single priority queue is more complex. A way to do this with a single data structure, adapted from a binary heap, is described in the paper:
Atkinson, Michael D., et al. "Min-max heaps and generalized priority queues." Communications of the ACM 29.10 (1986): 996-1000.
It is fast and memory-efficient, but requires a good amount of care to implement correctly.

This structure DOES NOT exist!
There is a simple way to approve this conclusion.
As we all know,the complexity of sorting problem is O(nlogn).
But if the structure you said exists,there will be a solution for sorting:
Enque every element one by one costs O(n)
Dequeue every max(or min) element one by one costs O(n)
which means the sorting problem can be solved by O(n).But it is IMPOSSIBLE.

Assumptions:
that you only care about performance and not about space / memory / ...
A solution:
That the index is a set, not a list (will work for list, but may need some extra love)
You could do a queue and a hash table side by side.
Example:
Lets say the order is 5 4 7 1 8 3
Queue -> 547813
Hash table -> 134578
Enqueue:
1) Take your object, and insert into the hash table in the right bucket Min / Max will always be the first and last index. (see sorted hash tables)
2) Next, insert into your queue like normal.
3) You can / should link the two. One idea would be to use the hash table value as a pointer to the queue.
Both operations with large hash table will be O(1)
Dequeue:
1) Pop the fist element O(1)
2) remove element from hash table O(1)
Min / Max:
1) Look at your hash table. Depending on the language used, you could in theory find it by looking at the head of the table, or the tail of the table.
For a better explanation of sorted hash tables, https://stackoverflow.com/questions/2007212
Note:
I would like to note, that there is no "normal" data structure that will do what you are requiring that I know of. However, that does not mean it is not possible. If you are going to attempt to implement the data structure, most likely you will have to do it for your needs and will not be able to use current libraries available. You may have to look at using a very low level language like assembly in order to achieve this, but maybe C or Java might be able to if your good with those languages.
Good luck
EDITED:
I did not explain sorted hash tables, so added a link to another SO to explain them.

I wonder which is faster between list.contains() and set.contains() in scala

I have a very large collection which contains more than a million String elements. It is very often to check whether a coming String is in this collection or not.
I wonder which collection is better to use, List or Set? And why?

Set will be more fast, since it could be based on tree structure (complexity will be something like O(height of the tree) or using hashes (complexity will be near O(const)) , on the other size contains for List will have complexity O(n), where n - size of the list
So Set should be faster when we talk about large calls of contains()

Preferred Scala collection for progressively removing random items?

I have an algoritm which takes many iterations, each of which scores items in a collection and removes the one with the highest score.
I could populate a Vector with the initial population, continually replacing it as a var, or choose a mutable collection as a val. Which of the mutable collections would best fit the bill?

You could consider a DoubleLinkedList, which has a convenient remove() method to remove the current list cell.

I think a Map (or its close relative, the Set) might do well. It doesn't have indexed access, but that doesn't seem to be what you want. If you go for a TreeMap, you'll even get an ordered collection.
However, might I point out that your algorithm seems to call for a Heap? A heap is optimized for repeatedly finding/removing the maximum element (or minimum, if you invert the the comparison building the heap). Scala doesn't have a ready made heap, but a heap is easily implemented with an array.