I have a build a service in MarkLogic and the service is consumed(GET Method) by a downstream application.
In the REST endpoint, we have four parameters like startDate,endDate,seqStart and seqLength .
Total number of data which has to be send through the REST endpoint is 1.5M and we are sending it as batches of 25,000
I have noticed the Elapsed time is different for two executions which is having same batch size with different sequence start
1)seq start=1 seqLength=25000 ElapsedTime=40s
2)seq start=100000 seqLength=25000 ElapsedTime=70s
Why am I getting different Elapsed value for REST calls having same seqLength with different seqStart
I am using fn:subsequence in my CTS query. Is it normal behavior or do I need to make any changes in the service.
This is actually a common issue not only in MarkLogic but many other DBMS and search engine systems too.
You can locally run a query like this to verify it:
fn:subsequence(cts:search(fn:doc(), cts:true-query(), 1, 10))
And then compare the elapsed time to a query such as:
fn:subsequence(cts:search(fn:doc(), cts:true-query(), 1000000, 10))
Essentially the problem is that MarkLogic has to solve the entire query and then individually generate and page through each result until it finishes building the results of the page/batch you desire.
The only way to speed this up is to calculate the total number of pages/batches and then iterate through in reverse order once you seek a page higher than the mid-point of your result set of pages.
But, pages toward the center of the result set will still always be slowest.
Something like the following should work to return pages closer to the end of the result set faster by building the pages in reverse order:
let $pageNumber := 1000
let $resultCount := xdmp:estimate(cts:search(cts:true-query()))
let $pageSize := 25000
let $totalPages := $resultCount div $pageSize
let $middlePage := $totalPages div 2
let $reverseOrder := if($pageNumber gt $middlePage) then fn:true() else fn:false()
let $searchOrder := if($reverseOrder) then "ascending" else "descending"
let $start := if($reverseOrder) then ($middlePage * $pageSize) - ($pageNumber * $pageSize - $middlePage * $pageSize) else $pageNumber * $pageSize
let $end := $start + $pageSize
return fn:subsequence(cts:search(fn:doc(), cts:true-query(), ($searchOrder)), $start, $end)
There are some other novel tricks you can use as well to build your export faster.
If the data set you are exporting is completely static or unsorted you could put a unique incremental ID into each document such as 1, 2, etc... Then you would just need to run:
cts:search(fn:doc(), cta:and-query((cts:element-range-query(xs:QName("id"), ">=" $start), (cts:element-range-query(xs:QName("id"), "<=" $end))
That would return just results that belong in the page.
If the set is unsorted or that is to say order doesn't matter then this approach is valid for a bulk export. If order does matter then the time and difficulty it takes to keep the ID up to date is probably not worth it unless changes to the data set is highly infrequent.
Another approach you can look at is using smaller batches, running multiple workers/exporters in parallel and then stitching the export back together yourself after its completed. Sounds like you're doing something similar to this already. I'm just suggesting you continue to further scale it out with more parallel workers. The problem you may run into is that you may get an incomplete export if the data set changes before the export finishes.
Related
I'm trying to find a Query Language for our product team, so they can create "red-flags" based on complex queries of a collection.
as they are not familiar with code, i've tried looking at JsonIQ solution but it seems it's un-maintained and couldn't find a simple solution to MongoDB.
So they a simple alternative? can mongo "stages" query accomplish something like the following example (if so, how?)
itemCount = number of total contributionItems
if itemCount>5
foreach item
if (number of items with the same party)/itemCount>0.8
save that party as party1
PH1=party1
for each contributionItem if (contributionItem.party != party1)
add item to array.
PH2=array[item.party]
JSONiq, as a language, is alive and maintained. The specification is not updated often because it is stable. There are a few implementations available documented on the language website and these may vary over time (I am unsure if any currently supports MongoDB specifically though).
A JSONiq version of your query, as I understand it, would look like:
let $contribution-items := collection("contribution-items")
let $count := count($contribution-items)
where $count gt 5
let $party1 :=
for $item in $contribution-items
group by $party := $item.party
where count($item) gt (0.8 * $count)
return $party
where exists($party1)
return [ $contribution-items[$$.party ne $party1] ]
I'm trying to inject users to a scenario in such a way that it will keep inserting user until every single entry of the feed file is used since the feed file contains log in information. I would like all the users in the feed file to log in. Right now all I could think of is two possible approaches.
Here I insert the number of rows in the feedfile at once.
scenario("Verified_Login")
.exec(LoginScenario.scn)
.inject(atOnceUsers(number_of_entries_in_feedfile))
Here I insert a very high time duration, for example, 100 seconds and then make the feedfile circular.
scenario("Verified_Login")
.exec(LoginScenario.scn)
.inject(atOnceUsers(1),constantUsersPerSec(1) during(100 seconds)
The problem with the first approach is I have to find the number of entries in the feed file which can be tedious as there could be thousands there. The problem with the second is that entries could and probably will be repeated. So is there a way to keep injecting users till feed file runs out of entries?
According to this source, from last year, Stéphane Landelle - who is the leading contributor of gatling, says that you must provide enough data for a simulation to complete using this method.
The post I linked from Stéphane does suggest to simply read the length of the file and use that to drive the amount of users, as you have already mentioned in your question.
I suggest you read the post as it will give you an alternate method to achieving what you want. Seems to be as close as you will ever get unless things have changed.
Here is their code.
val systemsIdentifier = jdbcFeeder(databaseUrl, databaseUser, databasePassword, sql_systemsIdentifier)
val count = new AtomicInteger(systemsIdentifier.records.size).asLongAs(_ => count.getAndIncrement < systemsIdentifier.records.size)
val comScn = scenario("My scenario")
.repeat(systemsIdentifier.records.size / count) {
feed(systemsIdentifier)
.exec(performActionsChain)
}
setUp(comScn.inject(rampUsers(count) over (60 seconds))).protocols(httpConf)
I am using MSXML v3.0 in a VB 6.0 application. The application calculates sum of an attribute of all nodes using for each loop as shown below
Set subNodes = docXML.selectNodes("//Transaction")
For Each subNode In subNodes
total = total + Val(subNode.selectSingleNode("Amount").nodeTypedValue)
Next
This loop is taking too much time, sometime it takes 15-20 minutes for 60 thousand nodes.
I am looking for XPath/DOM solution to eliminate this loop, probably
docXML.selectNodes("//Transaction").Sum("Amount")
or
docXML.selectNodes("Sum(//Transaction/Amount)")
Any suggestion is welcomed to get this sum faster.
// Open the XML.
docNav = new XPathDocument(#"c:\books.xml");
// Create a navigator to query with XPath.
nav = docNav.CreateNavigator();
// Find the sum
// This expression uses standard XPath syntax.
strExpression = "sum(/bookstore/book/price)";
// Use the Evaluate method to return the evaluated expression.
Console.WriteLine("The price sum of the books are {0}", nav.Evaluate(strExpression));
source: http://support.microsoft.com/kb/308333
Any solution that uses the XPath // pseudo-operator on an XML document with 60000+ nodes is going to be quite slow, because //x causes a complete traversal of the tree starting at the root of the document.
The solution can be speeded up significantly, if a more exact XPath expression is used, that doesn't include the // pseudo-operator.
If you know the structure of the XML document, always use a specific chain of location steps -- never //.
If you provide a small example, showing the specific structure of the document, then many people will be able to provide a faster solution than any solution that uses //.
For example, if it is known that all Transaction elements can be selected using this XPath expression:
/x/y/Transaction
then the evaluation of
sum(/x/y/Transaction/Amount)
is likely to be significantly faster than Sum(//Transaction/Amount)
Update:
The OP has revealed in a comment that the structure of the XML file is quite simple.
Accordingly, I tried with a 60000 Transaction nodes XML document the following:
/*/*/Amount
With .NET XslCompiledTransform (Yes, I used XSLT as the host for the XPath engine) this took 220ms (milliseconds), that means 0.22 seconds, to produce the sum.
With MSXML3 it takes 334 seconds.
With MSXML6 it takes 76 seconds -- still quite slow.
Conclusion: This is a bug in MSXML3 -- try to upgrade to another XPath engine, such as the one offered by .NET.
I'm having performance problems when inserting many rows into a GTK treeview (using PyGTK) - or when modifying many rows. The problem is that the model seems to get resorted after each change (insert/modification). This causes the GUI to hang for multiple seconds. Leaving the model unsorted by commenting out model.set_sort_column_id(SOME_ROW_INDEX, gtk.SORT_ASCENDING) eliminates these problems.
Therefore, I would like to disable the sorting while I'm inserting or modifying the model, and re-enable it afterwards. Unfortunately, sorting can't be disabled with model.set_sort_column_id(-1, gtk.SORT_ASCENDING). Using the freeze/thaw functions doesn't work either:
treeview.freeze_child_notify()
try:
for row in model:
# ... change something in row ...
finally:
treeview.thaw_child_notify()
So, how can I disable the sorting? Or is there a better method for bulk inserts/modifications?
Solution
Thanks to the information and links bobince provided in his answer, I checked out some of the alternatives:
1) Dummy sorting
tv.freeze_child_notify()
sortSettings = model.get_sort_column_id()
model.set_default_sort_func(lambda *unused: 0) # <-- can also use None but that is slower!
# model.set_default_sort_func(lambda *unused: 1) <-- slow
# model.set_default_sort_func(lambda *unused: -1) <-- crash (access violation in gtk_tree_store_move_after?!)
model.set_sort_column_id(-1, gtk.SORT_ASCENDING)
# change rows
model.set_sort_column_id(*sortSettings)
tv.thaw_child_notify()
This brought the time down from about 11 seconds to 2 seconds. Wow! But could be better, this was only for 1000 rows.
2) Removing model while updating
tv.set_model(None)
# change rows
tv.set_model(model)
No noticable difference, 11 seconds.
3) Dummy sorting and the cool generator trick from the PyGTK FAQ
def gen():
tv.freeze_child_notify()
sortSettings = model.get_sort_column_id()
model.set_default_sort_func(lambda *unused: 0)
model.set_sort_column_id(-1, gtk.SORT_ASCENDING)
i = 0
for row in rowsToChange:
i += 1
# change something
if i % 200 == 0:
# freeze/thaw not really necessary here as sorting is wrong because of the
# default sort function
yield True
model.set_sort_column_id(*sortSettings)
tv.thaw_child_notify()
yield False
g = gen()
if g.next(): # run once now, remaining iterations when idle
gobject.idle_add(g.next)
The result: The same estimated 2 seconds as in solution 1), but the GUI reacts during this time. I prefer this method. The modulo 200 can be tweaked to make the GUI more or less reactive if needed.
Maybe it's even possible to subclass gtk.TreeStore to get better results? Haven't tried that yet, but option 3) is good enough for now.
Sounds like you're nearly there. See the FAQ for further notes. In particular, you should also set the default_sort_order (you can now use None as well as the dummy compare lambda in that example, for better performance) to ensure there is no sorting, and remove the model from the treeview for the duration of the operations.
If it's a lot of changes you may be better off creating and setting a complete new model.
I'm making a code generation script for UN/LOCODE system and the database has unique 3 letter/number codes in every country. So for example the database contains "EE TLL", EE being the country (Estonia) and TLL the unique code inside Estonia, "AR TLL" can also exist (the country code and the 3 letter/number code are stored separately). Codes are in capital letters.
The database is fairly big and already contains a huge number of locations, the user has also the possibility of entering the 3 letter/number him/herself (which will be checked against the database before submission automatically).
Finally neither 0 or 1 may be used (possible confusion with O and I).
What I'm searching for is the most efficient way to pick the next available code when none is provided.
What I've came up with:
I'd check with AAA till 999, but then for each code it would require a new query (slow?).
I could store all the 40000 possibilities in an array and subtract all the used codes that are already in the database... but that uses too much memory IMO (not sure what I'm talking about here actually, maybe 40000 isn't such a big number).
Generate a random code and hope it doesn't exist yet and see if it does, if it does start over again. That's just risk taking.
Is there some magic MySQL query/PHP script that can get me the next available code?
I will go with number 2, it is simple and 40000 is not a big number.
To make it more efficient, you can store a number representing each 3-letter code. The conversion should be trivial because you have a total of 34 (A-Z, 2-9) letters.
I would for option 1 (i.e. do a sequential search), adding a table that gives the last assigned code per country (i.e. such that AAA..code are all assigned already). When assigning a new code through sequential scan, that table gets updated; for user-assigned codes, it remains unmodified.
If you don't want to issue repeated queries, you can also write this scan as a stored routine.
To simplify iteration, it might be better to treat the three-letter codes as numbers (as Shawn Hsiao suggests), i.e. give a meaning to A-Z = 0..25, and 2..9 = 26..33. Then, XYZ is the number X*34^2+Y*34+Z == 23*1156+24*34+25 == 27429. This should be doable using standard MySQL functions, in particular using CONV.
I went with the 2nd option. I was also able to make a script that will try to match as close as possible the country name, for example for Tartu it will try to match T** then TA* and if possible TAR, if not it will try TAT as T is the next letter after R in Tartu.
The code is quite extensive, I'll just post the part that takes the first possible code:
$allowed = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ23456789';
$length = strlen($allowed);
$codes = array();
// store all possibilities in a huge array
for($i=0;$i<$length;$i++)
for($j=0;$j<$length;$j++)
for($k=0;$k<$length;$k++)
$codes[] = substr($allowed, $i, 1).substr($allowed, $j, 1).substr($allowed, $k, 1);
$used = array();
$query = mysql_query("SELECT code FROM location WHERE country = '$country'");
while ($result = mysql_fetch_array($query))
$used[] = $result['code'];
$remaining = array_diff($codes, $used);
$code = $remaining[0];
Thanks for your opinion, this will be the key to transport codes all over the world :)