MarkLogic cts:cluster Inconsistent Results - cluster-analysis

I am trying to use the MarkLogic cts:cluster() function to create groups of documents based on their contents. It usually seems to work OK, but each time that I use the function on the same set of nodes and the same configuration, I get different clusters returned, and quite often I get a single large cluster returned, indicating that the function just gave up I guess.
My question is why do I get different results each time I use the function on the same set of nodes with the same configuration settings? There must be some third, random factor that is influencing the results, but I have not seen any mention of it, nor what its purpose might be. Any ideas?

Related

Postgres query plan changes for identical query when using pg_hint_plan

(Postgres 11.7)
I'm using the Rows pg_hint_plan hint to dynamically fix a bad row-count estimate.
My query accepts an array of arguments, which get unnested and joined onto the rest of the query as a predicate. By default, the query planner always assumes this array-argument contains 100 records, whereas in reality this number could be very different. This bad estimate was resulting in poor query plans. I set the number of rows definitively from within the calling application, by changing the hint text per query.
This approach seems to work sometimes, but I see some strange behaviour testing the query (in DBeaver).
If I start with a brand new connection, when I explain the query (or indeed just run it), the hint seems to be ignored for the first 6 executions, but thereafter it starts getting interpreted correctly. This is consistently reproducible: I see the offending row count estimates change on the 7th execution on a new connection.
More interestingly, the query also uses some (immutable) functions to do some lookup operations. If I remove these and replace them with an equivalent CTE or sub-select, this strange behaviour seems to disappear, and the hints are evaluated correctly all the time, even on a brand new connection.
What could be causing it to not honour the pg_hint_plan hints until after 6 requests have been made in that session? Why does the presence of the functions have a bearing on the hints?
Since you are using JDBC, try setting the prepareThreshold connection parameter to 1, as detailed in the documentation.
That will make the driver use a server prepared statement as soon as possible, ond it seems like this extension only works in that case.

Unit tests producing different results when using PostgreSQL

Been working on a module that is working pretty well when using MySQL, but when I try and run the unit tests I get an error when testing under PostgreSQL (using Travis).
The module itself is here: https://github.com/silvercommerce/taxable-currency
An example failed build is here: https://travis-ci.org/silvercommerce/taxable-currency/jobs/546838724
I don't have a huge amount of experience using PostgreSQL, but I am not really sure why this might be happening? The only thing I could think that might cause this is that I am trying to manually set the ID's in my fixtures file and maybe PostgreSQL not support this?
If this is not the case, does anyone have an idea what might be causing this issue?
Edit: I have looked again into this and the errors appear to be because of this assertion, which should be finding the Tax Rate vat but instead finds the Tax Rate reduced
I am guessing there is an issue in my logic that is causing the incorrect rate to be returned, though I am unsure why...
In the end it appears that Postgres has different default sorting to MySQL (https://www.postgresql.org/docs/9.1/queries-order.html). The line of interest is:
The actual order in that case will depend on the scan and join plan types and the order on disk, but it must not be relied on
In the end I didn't really need to test a list with multiple items, so instead I just removed the additional items.
If you are working on something that needs to support MySQL and Postgres though, you might need to consider defining a consistent sort order as part of your query.

google dataprep (clouddataprep by trifacta) tip: jobs will not be able to run if they are to large

During my cloud dataprep adventures I have come across yet another very annoying bug.
The problem occurs when creating complex flow structures which need to be connected through reference datasets. If a certain limit is crossed in performing a number of unions or a joins with these sets, dataflow is unable to start a job.
I have had a lot of contact with support and they are working on the issue:
"Our Systems Engineer Team was able to determine the root cause resulting into the failed job. They mentioned that the job is too large. That means that the recipe (combined from all datasets) is too big, and Dataflow rejects it. Our engineering team is still investigating approaches to address this.
A workaround is to split the job into two smaller jobs. The first run the flow for the data enrichment, and then use the output as input in the other flow. While it is not ideal, this would be a working solution for the time being."
I ran into the same problem and have a fairly educated guess as to the answer. Keep in mind that DataPrep simply takes all your GUI based inputs and translates it into Apache Beam code. When you pass in a reference data set, it probably writes some AB code that turns the reference data set into a side-input (https://beam.apache.org/documentation/programming-guide/). DataFlow will perform a Parellel Do (ParDo) function where it takes each element from a PCollection, stuffs it into a worker node, and then applies the side-input data for transformation.
So I am pretty sure if the reference sets get too big (which can happen with Joins), the underlying code will take an element from dataset A, pass it to a function with side-input B...but if side-input B is very big, it won't be able to fit into the worker memory. Take a look at the Stackdriver logs for your job to investigate if this is the case. If you see 'GC (Allocation Failure)' in your logs this is a sign of not enough memory.
You can try doing this: suppose you have two CSV files to read in and process, file A is 4 GB and file B is also 4 GB. If you kick off a job to perform some type of Join, it will very quickly outgrow the worker memory and puke. If you CAN, see if you can pre-process in a way where one of the files is in the MB range and just grow the other file.
If your data structures don't lend themselves to that option, you could do what the Sys Engs suggested, split one file up into many small chunks and then feed it to the recipe iteratively against the other larger file.
Another option to test is specifying the compute type for the workers. You can iteratively grow the compute type larger and larger to see if it finally pushes through.
The other option is to code it all up yourself in Apache Beam, test locally, then port to Google Cloud DataFlow.
Hopefully these guys fix the problem soon, they don't make it easy to ask them questions, that's for sure.

Minizinc, Gecode, how to get an identical solutions across distributed servers, with multi-solution model?

I am using minizinc and gecode to solve a minimization problem in a distributed fashion. I have multiple distributed servers that solve the same model with identical input and I want all the servers to get the same solution.
The problem is that model has multiple solutions, which periodically causes servers to come up with different solutions independently. It is not significant which solution will be chosen, as long as it is identical among all servers. I am also using "-p" arguments with gecode to use multiple threads (if it is relevant).
Is there a way that I could address this issue?
For example, I was thinking about outputting all the solutions and then sort them alphanumerically on each server.
Thanks!
If the search strategy in the model does not contain randomisation, then, assuming all versioning is the same, a single thread executing of Gecode should always return the same answer for the same model and instance data. It does not matter if it's on a different node. Using single threaded execution is the easiest way of ensuring that the same solution is found on all nodes.
If you are however want to use multiple threads, no such guarantee can be made. Due to the concurrency of the program, the execution path can be different every run and a different solution might be found each time.
Your suggestion of sorting the solution is possible, but will come at a price. There are two ways of doing this. You can either find all solutions, using the -a flag, and sort them afterwards or you can change your model to force the solution to be the first solution if you would sort them. This second option can be achieved by changing the search strategy. Both these solutions can be very costly and might (more than) exponentially increase the runtime.
If you are concerned about runtime at all, then I suggest you take Patrick Trentin's advice and run the model on a master node and distribute the solution. This will be the most efficient in computational time and most likely as efficient in runtime.

When's the time to create dedicated collections in MongoDB to avoid difficult queries?

I am asking a question that I assume does not have a simple black and white question but the principal of which I'm asking is clear.
Sample situation:
Lets say I have a collection of 1 million books, and I consistently want to always pull the top 100 rated.
Let's assume that I need to perform an aggregate function every time I perform this query which makes it a little expensive.
It is reasonable, that instead of running the query for every request (100-1000 a second), I would create a dedicated collection that only stores the top 100 books that gets updated every minute or so, thus instead of running a difficult query a 100 times every second, I only run it once a minute, and instead pull from a small collection of books that only holds the 100 books and that requires no query (just get everything).
That is the principal I am questioning.
Should I create a dedicated collection for EVERY query that is often
used?
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough
to leave as is?
Is there any guidelines for best practice in those types of
situations?
Is there a point where if a query runs so often and the data doesn't
change very often that I should keep the data in the server's memory
for direct access? Even if it's a lot of data? How much is too much?
Lastly,
Is there a way in MongoDB to cache results?
If so, how can I tell it to fetch the cached result, and when to regenerate the cache?
Thank you all.
Before getting to collection specifics, one does have to differentiate between "real-time data" vis-a-vis data which does not require immediate and real-time presenting of information. The rules for "real-time" systems are obviously much different.
Now to your example starting from the end. The cache of query results. The answer is not only for MongoDB. Data architects often use Redis, or memcached (or other cache systems) to hold all types of information. This though, obviously, is a function of how much memory is available to your system and the DB. You do not want to cripple the DB by giving your cache too much of available memory, and you do not want your cache to be useless by giving it too little.
In the book case, of 100 top ones, since it is certainly not a real time endeavor, it would make sense to cache the query and feed that cache out to requests. You could update the cache based upon a cron job or based upon an update flag (which you create to inform your program that the 100 have been updated) and then the system will run an $aggregate in the background.
Now to the first few points:
Should I create a dedicated collection for EVERY query that is often used?
Yes and no. It depends on the amount of data which has to be searched to $aggregate your response. And again, it also depends upon your memory limitations and btw let me add the whole server setup in terms of speed, cores and memory. MHO - cache is much better, as it avoids reading from the data all the time.
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough to leave as is?
I dont think anyone can really black and white answer to that question for your system. Is a complicated query just an $aggregate? Or is it $unwind and then a whole slew of $group etc. options following? this is really up to the dataset and how much information must actually be read and sifted and manipulated. It will effect your IO and, yes, again, the memory.
Is there a point where if a query runs so often and the data doesn't change very often that I should keep the data in the server's memory for direct access? Even if it's a lot of data? How much is too much?
See answers above this is directly connected to your other questions.
Finally:
Is there any guidelines for best practice in those types of situations?
The best you can do here is to time the procedures in your code, monitor memory usage and limits, look at the IO, study actual reads and writes on the collections.
Hope this helps.
Use a cache to store objects. For example in Redis use Redis Lists
Redis Lists are simply lists of strings, sorted by insertion order
Then set expiry to either a timeout or a specific time
Now whenever you have a miss in Redis, run the query in MongoDB and re-populate your cache. Also since cache resids in memory therefore your fetches will be extremely fast as compared to dedicated collections in MongoDB.
In addition to that, you don't have to keep have a dedicated machine, just deploy it within your application machine.