How can I improve on peach by using the scan adverb? - kdb

I have a list of times, and I want to count how many are within a given time window from each time. i.e. for each time, how many of the following times are less than 10 minutes ahead.
This is what I have so far, and it works well for small lists, ts<10000, but even using peach, it struggles when the count is above this, and get wsfull errors.
q)ts:asc `time$10?10000000
q)ts where each {(x<=y) and (y<x+00:10)}[;ts] peach ts
00:10:20.526 00:11:41.084 00:15:59.360 00:20:15.625
00:11:41.084 00:15:59.360 00:20:15.625
00:15:59.360 00:20:15.625
,00:20:15.625
,01:11:14.831
02:14:36.999 02:17:47.700
02:17:47.700 02:25:44.267 02:27:02.389
02:25:44.267 02:27:02.389 02:28:16.790
02:27:02.389 02:28:16.790
,02:28:16.790
I have tried using scan and over, but can't figure out how to stop the iteration when I need to.

EDIT - If its just the count you're after then all you need is:
q)1+(ts bin ts+00:10)-til count ts
1 3 2 1 1 2 2 1 1 1
OLD ANSWER - If you're trying to actually generate the list of times (not sure why you need to do that) then no matter what you do you're going to end up eating up a good bit of memory (generating a large list of potentially large lists of times). Also peach may not be useful since the time gained in outsourcing to other threads might be undone by the time needed to send the result back to the main thread. And any form of iteration/loop is likely to be slow since it will be acting atomically
Having said that, the best solution would be to make use of bin, especially if your list is sorted. For example, either of these two should give you the list of times and they scale a bit better (again, you shouldn't need to generate the lists if you're just using them to count - see edit above):
ts t+til each 1+(ts bin ts+00:10)-t:til count ts
{y[1]#y[0]_x}[ts] each t,'1+(ts bin ts+00:10)-t:til count ts
but they still involve generating lists of lists of indices and they will still add up.
Note that the bin (which is giving the index of the last item within 10mins of each item) is incredibly fast and memory efficient, even if the list is in the tens of millions:
q)ts:asc `time$10000000?10000000
q)
q)\ts ts bin ts+00:10
160 201326768

Related

How to model very large work queues in Akka?

I am writing a scala script to download all items from the hacker news API. There are ~12M items, each being a JSON of ~200 bytes.
I identified the following issues:
Storing the data: I tried to save each item as a single JSON file, but it became very hard just to barely list them (using Linux, ext4 file system). So I changed it to just append JSON items to multiple (100) files (by taking the item's id module 100).
Keeping track of what has been downloaded, because I want to be able to stop/continue the application. First I tried writing the downloaded ids to a textfile, but it turned out a little bit buggy. So now I just read all the items and collect the ids. (It works.)
All this is done with 1 Master actor and an arbitrary number of Worker actors (tens). The Master has a Queue[Int] and pops it and Workers ask for work.
The problem I am having is fairly simple but I haven't been able to solve it in a nice way.
I can collect the ids from items already downloaded in a list. But what I really need is the complement to that set; I need all the items I have not downloaded, up to the highest item id.
I tried using a range (1 to maxItemId) and subtracting the set of done jobs but it is really slow. reaaaaaaally slow.
Now I am using a Stream, and when a worker asks for a job, I check if the stream's (the next job) has already been done. If so, I give it to the Worker. Otherwise I check the next one.
The problem with this approach is that I can not put jobs back at the stream if they fail. That would be easy with the Queue; but then again I am having trouble just setting up the queue with millions of items.
What could be a better approach to this? I don't think the issues here are trivial, this is a very large number of tasks to perform and keep track of, but it shouldn't be so hard as well.
Thanks!
As far as I understood your question, I think you don't need a very complicated data structure here.
Assuming your ids are sequential from 1 to maxItemId, you can use an array of Boolean with maxItemId size to keep track of processed items. You initialize this array by reading the processed ids. And you find the next job by searching for the next false entry.
Assuming that your maxItemId is around 12M, iterating over all items is pretty much instantaneous.

Database and item orders (general)

I'm right now experimenting with a nodejs based experimental app, where I will be putting in a list of books and it will be posted on a forum automatically every x minutes.
Now my question is about order of these things posted.
I use mongodb (not sure if this changes the question or not) and I just add a new entry for every item to be posted. Normally, things are posted in the exact order I add them.
However, for the web interface of this experimental thing, I made a re-ordering interaction where I can simply drag and drop elements to reorder them.
My question is: how can I reflect this change to the database?
Or more in general terms, how can I order stuff in general, in databases?
For instance if I drag the 1000th item to 1st order, everything below needs to be edited (in db) between 1 and 1000 the entries. This does not seem like a valid and proper solution to me.
Any enlightenment is appreciated.
An elegant way might be lexicographic sorting. Introduce a String attribute for each item. Make the initial length of the values large enough to accomodate the estimated number of items. E.g., if you expect 1000 items, let the keys be baa, bab, bac, ... bba, bbb, bbc, ...
Then, when an item is moved from where it is to another place between two items, assign a value to the sorting attribute of the moved item that is somewhere equidistant (lexicographically) to those items. So to move an item between dei and dej, give it the value deim. To move an item between fadd and fado, give it the value fadi.
Keys starting with a were initially not used to leave space for elements that get dragged before the first one. Never use the key a, as it will be impossible to move an element before this one.
Of course, the characters used may vary according to the sort order provided by the database.
This solution should work fine as long as elements are not reordered extremely frequently. In a worst case scenario, this may lead to longer and longer attribute values. But if the movements are somewhat equally distributed, the length of values should stay reasonable.

Are "swap move factories" worth the effort?

I noticed that for problems such as Cloudbalancing, move factories exist to generate moves and swaps. A "move move" transfers a cloud process from one computer to another. A "swap move" swaps any two processes from their respective computers.
I am developing a timetabling application.
A subjectTeacherHour (a combination of subject and teacher) have
only a subset of Periods to which they may be assigned. If Jane teaches 6 hours at a class, there are 6 subjectTeacherHours each which have to be allocated a Period, from a possible 30 Periods of that class ;unlike the cloudbalance example, where a process can move to any computer.
Only one subjectTeacherHour may be allocated a Period (naturally).
It tries to place subjectTeacherHour to eligible Periods , till an optimal combination is found.
Pros
The manual seems to recommend it.
...However, as the traveling tournament example proves, if you can remove
a hard constraint by using a certain set of big moves, you can win
performance and scalability...
...The `[version with big moves] evaluates a lot less unfeasible
solutions, which enables it to outperform and outscale the simple
version....
...It's generally a good idea to use several selectors, mixing fine
grained moves and course grained moves:...
While only one subjectTeacher may be allocated to Period, the solver must temporarily break such a constraint to discover that swapping two certain Period allocations lead to a better solution. A swap move "removes this brick wall" between those two states.
So a swap move can help lead to better solutions much faster.
Cons
A subjectTeacher have only a subset of Periods to which they may be assigned. So finding intersecting (common) hours between any two subjectTeachers is a bit tough (but doable in an elegant way: Good algorithm/technique to find overlapping values from objects' properties? ) .
Will it only give me only small gains in time and optimality?
I am also worried about crazy interactions having two kinds of moves may cause, leading to getting stuck at a bad solution.
Swap moves are crucial.
Consider 2 courses assigned to a room which is fully booked. Without swapping, it would have to break a hard constraint to move 1 course to a conflicted room and chose that move as the step (which is unlikely).
You can use the build-in generic swap MoveFactory. If you write your own, you can say the swap move isDoable() false when your moving either sides to an ineligible period.

dijit.form.select dropdown is very slow

Loading 3000 values into dijit.form.select control takes longer time. browser gets hanged even for 500 values. How to over come this problem?
Any assistance would be really appreciated.
Thanks,
Karthihck k.
Loading 3,000 of anything into a web page is always going to be slow.
Although there are twisted ways to overcome this limitation, it may not be worth it for your user is definitely not going to like scrolling through 3,000 items to pick one!
I'd suggest you break this drop-down list into two (or three) levels, and have no more than 20-30 choices each. In one of my own projects with thousands of list items, I had to go with four levels, otherwise performance gets abysmal.
If you only have one long list to work with, consider breaking it down by the starting letter into 26 groups, like a phone list. At least you'll have only 100-200 in each group.
Now, if you really really want to load such a long list, consider not using dijit.form.Select as it is just a simple wrapper for the <select> tag. You're essentially inserting one <option> tag at a time, killing performance. You have two choices:
Create the list of <option> tags yourself off-line, then insert into the <select> element in one go.
Consider dijit.form.FilteringSelect instead.
Now, I definitely don't endorse doing the above. You've been warned!

How to approximate processing time?

It's common to see messages like "Installation will take 10 min aprox." , etc in desktop applications. So, I wonder how can I calculate an approximate of how much time a certain process will take. Off course I won't install anything but I want to update some internal data and depending on the user usage this might take some time.
Is this possible in a iPhone app? How Cocoa guys do this, would it be the same way in iPhone apps?
Thanks in advance.
UPDATE: I want to rewrite/edit some files on disk, most of the time these files are not the same size so I cannot use timers for the first iteration and calculate the rest from that.
Is there any API that helps on calculating this?
If you have some list of things to process, each "thing" - usually better to measure a group of 10 or so "things" - is a unit of work. Your goal is to see how long it takes to process a single group and report the estimated time to completion.
One way is to create an NSDate at the start of each group and a new one at the end (the top and bottom of your for loop) for each group. Multiply the difference in seconds by however many groups you have left (minus the one you just processed) and that should be a reasonable estimate of the time remaining.
Of course this gets more complicated if one "thing" takes a lot longer to process than another "thing" - the above approach assumes all things take the same amount of time. In this case, however, you may need to keep track of an average window (across the last n "things" or groups thereof).
A more detailed response would require more details about your model and what work you're performing.