Why does this ScalaQuery statement only delete the odd rows? - scala

When attempting to delete a batch of records, only the odd rows are deleted!
val byUser = Orders.createFinderBy(_.userID)
byUser(id).mutate(_.delete)
If I instead print the record, I get the correct number of rows.
byUser(id).mutate{x => x.echo}
I worked around the issue like this, which generates the desired SQL.
(for{o <- Orders if o.userID is id.bind } yield o).delete
But, why or how does the mutate version affect only the odd rows?

I've had a dig around in the source code and it seems to be as #RexKerr says - an iterator is used to process the elements, applying the deletions as it iterates (the while loop in the mutate method here):
https://github.com/rjmac/scala-query/blob/master/src/main/scala/org/scalaquery/MutatingInvoker.scala
Interestingly there is a previousAfterDelete flag that can be used to force the iterator backwards after each deletion. This appears to be set to true for Access databases (see the AccessQueryInvoker class) but not others:
https://github.com/rjmac/scala-query/blob/master/src/main/scala/org/scalaquery/ql/extended/AccessDriver.scala
I would recommend downloading the sources and debugging the code. Perhaps this flag should be set for the database vendor you are using. I'd also consider filing a bug report:
http://scalaquery.org/community.html
PS. I know this is an old question but answered it just in case anyone else has had this problem

Related

Get count(null) as zero in Grafana - InfluxDB data source

Is it possible to set the value of count to zero when the result to which count is applied is null.
SELECT count(status) FROM ( SELECT last("P-status") AS "status" FROM "autogen"."Pl" WHERE ("Platform" = 'Database-plat' AND "P-status" = 'ERROR') AND time >= now() - 1m GROUP BY time(500ms), "Node" fill(0) )
In this case if the inner query returns null (for all the Node), count doesnt give any value , since fill will be ignored. I need the value to be zero, so that if i have to perform any other operation on the returned result, it can be done.
If possible, how can it be done?
I know this is an old question, but as it still is very relevant and I have been struggling with this problem for over a year now, I'd like to answer to this unanswered question with the current status of this issue according to my research:
The Github issues here and here imply that this problem is known since 2016 and not really understood by the contributers as a problem, as there are questionable rationales for the implementation (like "it's not a bug, it's a feature, because ambiguities of multiple series") that can easily be answered with special rules for unique series identification, but there has not been much activity any more despite heavy interest of the user community. Another point is that they have published version 2.x, which relies more on their new querying language (Flux), so it is very likely they have more or less practically abandoned the 1.x branch with InfluxQL (maybe except for QL backwards compatibility in 2.x and some minor updates, not sure).
Meanwhile I updated Grafana several times, but I had to stick with InfluxDB 1.x for a couple of reasons and the Flux support changed at some point (deprecated Flux plugin, but Flux included in standard InfluxDB plugin, but latter doesn't really work), so that Flux in Grafana is basically not working any more for a while now. I hoped for a better handling of the counting problem there, but now I'm out of luck regarding counting anything in InfluxDB reliably. I even tried some tricks with sum() function, fancy grouping and dummy values that I need to resubtract again and whatnot, but it always boiled down to the same conclusion: InfluxDB can do a lot, but counting just doesn't work.
It's very unsatisfying, but there doesn't seem to be a way to achieve the "eager" goal of counting data points without a system of bloated queries, excessive use of strange rules, dummy values and an insecurity that any query might break any time or break if you need to query only a specific time frame (where any dummy value workaround might not work). And regarding the priority given, this might not be fixed in the near future.

Dataset.unpersist() unexpectedly affects count of other RDD's

I ran into a strange problem where calling unpersist() on one Dataset affects the count of another Dataset in the same block of code. Unfortunately this happens during a complex long-running job with many Datasets so I can't summarize the whole thing here. I know this makes for a difficult question, however let me try to sketch it out. What I'm looking for is some confirmation that this behavior is unexpected and any ideas about why it may be occurring or how we can avoid it.
Edit: This problem as reported occurs on Spark 2.1.1, but does not occur on 2.1.0. The problem is 100% repeatable but only in my project with 1000's of lines of code and data, I'm working to try to distill it down to a concise example but have not yet been able, I will post any updates or re-submit my question if I find something. The fact that the exact same code and data works in 2.1.0 but not 2.1.1 leads me to believe it is due to something within Spark.
val claims:Dataset = // read claims from file
val accounts:Dataset = // read accounts from file
val providers:Dataset = // read providers from file
val payers:Dataset = // read payers from file
val claimsWithAccount:Dataset = // join claims and accounts
val claimsWithProvider:Dataset = // join claims and providers
val claimsWithPayer:Dataset = // join claimsWithProvider and payers
claimsWithPayer.persist(StorageLevel.MEMORY_AND_DISK)
log.info("claimsWithPayer = " + claimsWithPayer.count()) // 46
// This is considered unnecessary intermediate data and can leave the cache
claimsWithAccount.unpersist()
log.info("claimsWithPayer = " + claimsWithPayer.count()) // 41
Essentially, calling unpersist() on one of the intermediate data sets in a series of joins affects the number of rows in one of the later data sets, as reported by Dataset.count().
My understanding is that unpersist() should remove data from the cache but that it shouldn't affect the count or contents of other data sets? This is especially surprising since I explicitly persist claimsWithPayer before I unpersist the other data.
I believe the behaviour you experience is related to the change that is for "UNCACHE TABLE should un-cache all cached plans that refer to this table".
I think you may find more information in SPARK-21478 Unpersist a DF also unpersists related DFs where Xiao Li said:
This is by design. We do not want to use the invalid cached data.

False Error 522: circular reference

I have a large spreadsheet: 700+ rows, each having references to the previous row. I use reference functions: ROW(), COLUMN() and INDIRECT(), ADDRESS(). (Yes, I have considered fixing values every 50-100 rows to reduce calculation trail.)
Until recently I used OpenOffice.org and it worked fine. LibreOffice, however, when the file is opened, seems to give up after some rows and further calculations become Error 522. Sometimes a change makes it re-calculate it all and errors disappear and doesn't reappear when I undo the change. I have also found out about Ctrl-Shift-F9 (must be re-calculate), which also makes errors disappear.
Even though the file has been saved and re-saved by LibreOffice several times it still reports false Error 522 when I open the file, so it doesn't seem to be compatibility problem.
Is the problem that a very long branched out calculation trail makes the software think it will never get to the initial values and therefore it must be circular? (Which my idea of fixing values would solve.) Or could there be something else I may have missed?
UPDATE
I don't see how INDEX() would help. I want to refer to a cell immediately above or a cell from a row immediately above. Cell d46 could point to d45 or b45 or $a45, and that would work when copying a row, but not when inserting or deleting a row: If you insert a row just above, the references pointing 1 row above would start pointing 2 rows above, so each time I would have to edit the formulae. The row (each row) contains several references to the row just above, so I thought the easiest way would be INDIRECT(ADDRESS(ROW()-1,COLUMN())) for the same column or INDIRECT(ADDRESS(ROW()-1,1)) for column A... Any better solutions?
I do not know the specifics of the problem, but it sounds like it would help to simplify the formulas, as you suggested.
Another possibility is to write macros to handle some of the calculation work. Besides Basic, macros can be written in Java, which you seem to be familiar with. Macros can be called from a spreadsheet function, or called when the document is loaded.
It may also help to use a more powerful tool such as LibreOffice Base with MySQL. Often spreadsheets that need a lot of INDIRECT() and ADDRESS() are really using database-type logic.

Find First and First Difference in Progress 4GL

I'm not clear about below queries and curious to know what is the different between them even though both retrieves same results. (Database used sports2000).
FOR EACH Customer WHERE State = "NH",
FIRST Order OF Customer:
DISPLAY Customer.Cust-Num NAME Order-Num Order-Date.
END.
FOR EACH Customer WHERE State = "NH":
FIND FIRST Order OF Customer NO-ERROR.
IF AVAILABLE Order THEN
DISPLAY Customer.Cust-Num NAME Order-Num Order-Date.
END.
Please explain me
Regards
Suga
As AquaAlex says your first snippet is a join (the "," part of the syntax makes it a join) and has all of the pros and cons he mentions. There is, however, a significant additional "con" -- the join is being made with FIRST and FOR ... FIRST should never be used.
FOR LAST - Query, giving wrong result
It will eventually bite you in the butt.
FIND FIRST is not much better.
The fundamental problem with both statements is that they imply that there is an order which your desired record is the FIRST instance of. But no part of the statement specifies that order. So in the event that there is more than one record that satisfies the query you have no idea which record you will actually get. That might be ok if the only reason that you are doing this is to probe to see if there is one or more records and you have no intention of actually using the record buffer. But if that is the case then CAN-FIND() would be a better statement to be using.
There is a myth that FIND FIRST is supposedly faster. If you believe this, or know someone who does, I urge you to test it. It is not true. It is true that in the case where FIND returns a large set of records adding FIRST is faster -- but that is not apples to apples. That is throwing away the bushel after randomly grabbing an apple. And if you code like that your apple now has magical properties which will lead to impossible to cure bugs.
OF is also problematic. OF implies a WHERE clause based on the compiler guessing that fields with the same name in both tables and which are part of a unique index can be used to join the tables. That may seem reasonable, and perhaps it is, but it obscures the code and makes the maintenance programmer's job much more difficult. It makes a good demo but should never be used in real life.
Your first statement is a join statement, which means less network traffic. And you will only receive records where both the customer and order record exist so do not need to do any further checks. (MORE EFFICIENT)
The second statement will retrieve each customer and then for each customer found it will do a find on order. Because there may not be an order you need to do an additional statement (If Available) as well. This is a less efficient way to retrieve the records and will result in much more unwanted network traffic and more statements being executed.

Why is ScalaCheck discarding so many generated values in my specification?

I have written a ScalaCheck test case within Specs2. The test case gives up because too many tests were discarded. However, it doesn't tell me why they were discarded. How can I find out why?
Set a breakpoint on the org.scalacheck.Gen.fail method and see what is calling it.
Incidentally, in my case the problem was twofold:
I had set maxDiscarded to a value (1) that was too small, because I was being too optimistic - I didn't realise that ScalaCheck would start at a collection of size 0 by default even if I asked for a non-empty collection (I don't know why it does this).
I was generating collections of size 1 and up, even though, as I later realised, they should have been of size 2 and up for what I was trying to test - which was causing further discards in later generators based on that generator.