Concise way to filter on two child attributes in ArangoDB (AQL / Spring Data ArangoDB) - spring-data

In ArangoDB I have documents in a trip collection which is related to documents in a driver collection via edges in a tripToDriver collection, and the trip documents are also related to documents in a departure collection via edges in a departureToTrip collection.
To fetch trips where their driver has a given idNumber and their associated departure has a startTime after a supplied date/time, I've successfully written the following AQL:
FOR doc IN trip
LET drivers = (FOR v IN 1..1 OUTBOUND doc tripToDriver RETURN v)
LET departures = (FOR v in 1..1 INBOUND doc departureToTrip RETURN v
FILTER drivers[0].idNumber == '999999-9999' AND departures[0].startTime >= '2018-07-30'
RETURN doc
But I wonder if there is a more concise / elegant way to achieve the same results?
A related question, since I'm using Spring Data ArangoDB: Is it possible to achieve this result with derived queries?
For a single relation I was able to create a query like:
Iterable<Trip> findTripsByDriversIdNumber( String driverId ); but haven't had luck incorporating the departure relation into this signature (maybe because it's inbound?).

First of all your query only works if you have only one connected driver/departure for every trip. You're fetching all linked drivers but only check the first found one.
If this is your model it is totally ok, but I would recommend to do the idNumber/startTime check within the sub queries. Then, because we only need to know that at least one driver/departure fits our filter condition, we add a LIMIT 1 to the sub query and return only a true. This is enough we need for our FITLER in our main query.
FOR doc IN trip
FILTER (FOR v IN 1..1 OUTBOUND doc tripToDriver FILTER v.idNumber == #idNumber LIMIT 1 RETURN true)[0]
FILTER (FOR v IN 1..1 INBOUND doc departureToTrip FILTER v.startTime >= #startTime LIMIT 1 RETURN true)[0]
RETURN doc
I tested to solve your case with a derived query. It would work, if there wasn't a bug in the current arangodb-spring-data release. The bug is already fixed but not yet released. You can already test it using a snapshot version of arangodb-spring-data (1.3.1-SNAPSHOT or 2.3.1-SNAPSHOT depending on your Spring Data version, see supported versions).
The following derived query method worked for me with the snapshot version.
Iterable<Trip> findByDriversIdNumberAndDeparturesStartTimeGreaterThanEqual(String idNumber, LocalDate startTime);
To make the dervied query work you need the following annotated fields in your class Trip
#Relations(edges = TripToDriver.class, direction = Direction.OUTBOUND, maxDepth = 1)
private Collection<Driver> drivers;
#Relations(edges = DepartureToTrip.class, direction = Direction.INBOUND, maxDepth = 1)
private Collection<Departure> departures;
I also created an working example project on github.

Related

How to obtain the output of the pipeline and perform read&write to Cloud Firestore

I am using Apache Beam to take log from Pub/Sub which contains information of pageview traffic. Each page contains unique ID and when one log of pageview traffic come from the Pub/Sub, Cloud Dataflow will collect them in a constant windowed manner and count them. At the end of combiner, we will get something like this:
12345, 2
12456, 1
15213, 1
...
As I know, ParDo is a Beam transform for generic parallel processing. After combine, I wish to implement a transform that write query to Cloud Firestore to get the existing pageview ID, take the current view count, perform addition on it and perform write operating to update the view count one by one from the combined output as shown above. Any suggestion?
Below is my code so far for the UpdateViewCount. When I get the query, it seems impossible to have a for loop to get the query (it will be only one row of query since the pageview is unique tho)
class UpdateIntoFireStore(beam.DoFn):
def process(self, element):
listingid, count = element
doc_ref = db.collection('listings').where('listingid', u'==', '12345')
try:
docs = doc_ref.get()
for doc in docs:
print doc
except NotFound:
print(u'No such document!')
I solved it. There is no need to put a loop to retrieve the data and I should retrieve the particular ID with document name.
doc_ref = db.collection(u'listings').document(listingid)
try:
doc = doc_ref.get()
doc_dict = doc.to_dict()
self.cur_count = doc_dict[u'count']
doc_ref.update({
u'count': self.cur_count + count
})
except NotFound:
doc_ref.set({'count': count})

Thinking Sphinx indexing performance

I have a large index definition that takes too long to index. I suspect the main problem is caused by the many LEFT OUTER JOINs generated.
I saw this question, but can't find documentation about using source: :query, which seems to be part of the solution.
My index definition and the resulting query can be found here: https://gist.github.com/jonsgold/fdd7660bf8bc98897612
How can I optimize the generated query to run faster during indexing?
The 'standard' sphinx solution to this would be to use ranged queries.
http://sphinxsearch.com/docs/current.html#ex-ranged-queries
... splitting up the query into lots of small parts, so the database server has a better chance of being able to run the query (rather than one huge query)
But I have no idea how to actully enable that in Thinking Sphinx. Can't see anything in the documentation. Could help you edit the sphinx.conf, but also not sure how TS will cope with you manually editing the config file.
This is the solution that worked best (from the linked question). Basically, you can remove a piece of the main query sql_query and define it separately as a sql_joined_field in the sphinx.conf file.
It's important to add all relevant sql conditions to each sql_joined_field (such as sharding indexes by modulo on the ID). Here's the new definition:
ThinkingSphinx::Index.define(
:incident,
with: :active_record,
delta?: false,
delta_processor: ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
) do
indexes "SELECT incidents.id * 51 + 7 AS id, sites.name AS site FROM incidents LEFT OUTER JOIN sites ON sites.id = site_id WHERE incidents.deleted = 0 AND EXISTS (SELECT id FROM accounts WHERE accounts.status = 'enabled' AND incidents.account_id = id) ORDER BY id", as: :site, source: :query
...
has
...
end
ThinkingSphinx::Index.define(
:incident,
with: :active_record,
delta?: true,
delta_processor: ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
) do
indexes "SELECT incidents.id * 51 + 7 AS id, sites.name AS site FROM incidents LEFT OUTER JOIN sites ON sites.id = site_id WHERE incidents.deleted = 0 AND incidents.delta = 1 AND EXISTS (SELECT id FROM accounts WHERE accounts.status = 'enabled' AND incidents.account_id = id) ORDER BY id", as: :site, source: :query
...
has
...
end
The magic that defines the field site as a separate query is the option source: :query at the end of the line.
Notice the core index definition has the parameter delta?: false, while the delta index definition has the parameter delta?: true. That's so I could use the condition WHERE incidents.delta = 1 in the delta index and filter out irrelevant records.
I found sharding didn't perform any better, so I reverted to one unified index.
See the whole index definition here: https://gist.github.com/jonsgold/05e2aea640320ee9d8b2.
Important to remember!
The Sphinx document ID offset must be handled manually. That is, whenever an index for another model is added or removed, my calculated document ID will change. This must be updated.
So, in my example, if I added an index for a different model (not :incident), I would have to run rake ts:configure to find out my new offset and change incidents.id * 51 + 7 accordingly.

Slick: Filtering all records which have a joda DateTime date equal to today

One way to achieve it would be like this:
val now = DateTime.now
val today = now.toLocalDate
val tomorrow = today.plusDays(1)
val startOfToday = today.toDateTimeAtStartOfDay(now.getZone)
val startOfTomorrow = tomorrow.toDateTimeAtStartOfDay(now.getZone)
val todayLogItems = logItems.filter(logItem =>
logItem.MyDateTime >= startOfToday && logItem.MyDateTime < startOfTomorrow
).list
Is there any way to write the query in a more concise way? Something on the lines of:
logItems.filter(_.MyDateTime.toDate == DateTime.now.toDate).list
I'm asking this because in LINQ to NHibernate that is achievable (Fetching records by date with only day part comparison using nhibernate).
Unless the Slick joda mapper adds support for comparisons you are out of luck unless you add it yourself. For giving it a shot these may be helpful pointers:
* http://slick.typesafe.com/doc/2.0.0/userdefined.html
* http://slick.typesafe.com/doc/2.0.0/api/#scala.slick.lifted.ExtensionMethods
* https://github.com/slick/slick/blob/2.0.0/src/main/scala/scala/slick/lifted/ExtensionMethods.scala
I create a ticket to look into it in Slick at some point: https://github.com/slick/slick/issues/627
You're confusing matters by working with LocalDateTimes instead of using LocalDates directly:
val today = LocalDate.now
val todayLogItems = logItems.filter(_.MyDateTime.toLocalDate isEqual today)
UPDATE
A Major clarification is needed on the question here, Slick was only mentioned in passing, by way of a tag.
However... Slick is central to this question, which hinges on the fact that filter operation is actually into an SQL query by way of PlainColumnExtensionMethods
I'm not overly familiar with the library, but this must surely mean that you're restricted to just operations which can be executed in SQL. As this is a Column[DateTime] you must therefore compare it to another DateTime.
As for the LINQ example, it seems to recommend first fetching everything and then proceeding as per my example above (performing the comparison in Scala and not in SQL). This is an option, but I suspect you won't want the performance cost that it entails.
UPDATE 2 (just to clarify)
There is no answer.
There's no guarantee that your underlying database has the ability to do an equality check between dates and timestamps, slick therefore can't rely on such an ability existing.
You're stuck between a rock and a hard place. Either do the range check between timestamps as you already are, or pull everything from the query and filter it in Scala - with the heavy performance cost that this would likely involve.
FINAL UPDATE
To refer to the Linq/NHibernate question you referenced, here are a few quotes:
You can also use the date function from Criteria, via SqlFunction
It depends on the LINQ provider
I'm not sure if NHibernate LINQ provider supports...
So the answers there seem to be either:
Relying on NHibernate to push the date coercion logic into the DB, perhaps silently crippling performance (by fetching all records and filtering locally) if this is not possible
Relying on you to write custom SQL logic
The best-case scenario is that NHibernate could translate date/timestamp comparisons into timestamp range checks. Doing something like that is quite a deep question about how Slick (and slick-joda-mapper) can handle comparisons, the fact that you'd use it in a filter is incidental.
You'd need an extremely compelling use-case to write a feature like this yourself, given the risk for creating complicated bugs. You'd be better off:
splitting the column into separate date/time columns
adding the date as a calculated column (maybe in a view)
using custom SQL (or a stored proc) for the query
sticking with the range check
using a helper function
In the case of a helper:
def equalsDate(dt: LocalDate) = {
val start = dt.toDateTimeAtStartOfDay()
val end = dt.plusDays(1).toDateTimeAtStartOfDay()
(col: Column[DateTime]) => {
col >= start && col < end
}
}
val isToday = equalsDate(LocalDate.now)
val todayLogItems = logItems.filter(x => isToday(x.MyDateTime))

Fetch object by plain SQL query with SORM

Is it possible to fetch items by plain SQL query instead of building query by DSL using SORM?
For example is there an API for making something like
val metallica = Db.query[Artist].fromString("SELECT * FROM artist WHERE name = ?", "Metallica").fetchOne() // Option[Artist]
instead of
val metallica = Db.query[Artist].whereEqual("name", "Metallica").fetchOne() // Option[Artist]
Since populating an entity with collections and other structured values involves fetching data from multiple tables in an unjoinable way, the API for fetching it directly will most probably never get exposed. However another approach to this problem is currently being considered.
Here's how it could be implemented:
val artists : Seq[Artist]
= Db.fetchWithSql[Artist]("SELECT id FROM artist WHERE name = ?", "Metallica")
If this issue gets a notable support either here or, even better, here, it will probably get implemented in the next minor release.
Update
implemented in 0.3.1
If you want to fetch only one object (by 2 and more arguments) you can
also do the following:
by using Sorm Querier
Db.query[Artist].where(Querier.And(Querier.Equal("name", "Name"), Querier.Equal("surname", "surname"))).fetchOne()
or just
Db.query[Artist].whereEqual("name", "Name").whereEqual( "surname","surname").fetchOne()

QueryDSL: querying relations and properties

I'm using QueryDSL with JPA.
I want to query some properties of an entity, it's like this:
QPost post = QPost.post;
JPAQuery q = new JPAQuery(em);
List<Object[]> rows = q.from(post).where(...).list(post.id, post.name);
It works fine.
If i want to query a relation property, e.g. comments of a post:
List<Set<Comment>> rows = q.from(post).where(...).list(post.comments);
It's also fine.
But when I want to query relation and simple properties together, e.g.
List<Object[]> rows = q.from(post).where(...).list(post.id, post.name, post.comments);
Then something went wrong, generiting a bad SQL syntax.
Then I realized that it's not possible to query them together in one SQL statement.
Is it possible that QueryDSL would somehow deal with relations and generate additional queries (just like what hibernate does with lazy relations), and load the results in?
Or should I just query twice, and then merge both result lists?
P.S. what i actually want is each post with its comments' ids. So a function to concat each post's comment ids is better, is this kind of expressin possible?
q.list(post.id, post.name, post.comments.all().id.join())
and generate a subquery sql like (select group_concat(c.id) from comments as c inner join post where c.id = post.id)
Querydsl JPA is restricted to the expressivity of JPQL, so what you are asking for is not possible with Querydsl JPA. You can though try to express it with Querydsl SQL. It should be possible. Also as you don't project entities, but literals and collections it might work just fine.
Alternatively you can load the Posts with only the Comment ids loaded and then project the id, name and comment ids to something else. This should work when accessors are annotated.
The simplest thing would be to query for Posts and use fetchJoin for comments, but I'm assuming that's too slow for you use case.
I think you ought to simply project required properties of posts and comments and group the results by hand (if required). E.g.
QPost post=...;
QComment comment=..;
List<Tuple> rows = q.from(post)
// Or leftJoin if you want also posts without comments
.innerJoin(comment).on(comment.postId.eq(post.id))
.orderBy(post.id) // Could be used to optimize grouping
.list(new QTuple(post.id, post.name, comment.id));
Map<Long, PostWithComments> results=...;
for (Tuple row : rows) {
PostWithComments res = results.get(row.get(post.id));
if (res == null) {
res = new PostWithComments(row.get(post.id), row.get(post.name));
results.put(res.getPostId(), res);
}
res.addCommentId(row.get(comment.id));
}
NOTE: You cannot use limit nor offset with this kind of queries.
As an alternative, it might be possible to tune your mappings so that 1) Comments are always lazy proxies so that (with property access) Comment.getId() is possible without initializing the actual object and 2) using batch fetch* on Post.comments to optimize collection fetching. This way you could just query for Posts and then access id's of their comments with little performance hit. In most cases you shouldn't even need those lazy proxies unless your Comment is very fat. That kind of code would certainly look nicer without low level row handling and you could also use limit and offset in your queries. Just keep an eye on your query log to make sure everything works as intended.
*) Batch fetching isn't directly supported by JPA, but Hibernate supports it through mapping and Eclipselink through query hints.
Maybe some day Querydsl will support this kind of results grouping post processing out-of-box...