Merge vertex list in gremlin orientDb - orientdb

I am a newbie in the graph databases world, and I made a query to get leaves of the tree, and I also have a list of Ids. I want to merge both lists of leaves and remove duplicates in a new one to sum property of each. I cannot merge the first 2 sets of vertex
g.V().hasLabel('Group').has('GroupId','G001').repeat(
outE().inV()
).emit().hasLabel('User').as('UsersList1')
.V().has('UserId', within('001','002')).as('UsersList2')
.select('UsersList1','UsersList2').dedup().values('petitions').sum().unfold()
Regards

There are several things wrong in your query:
you call V().has('UserId', within('001','002')) for every user that was found by the first part of the traversal
the traversal could emit more than just the leafs
select('UsersList1','UsersList2') creates pairs of users
values('petitions') tries to access the property petitions of each pair, this will always fail
The correct approach would be:
g.V().has('User', 'UserId', within('001','002')).fold().
union(unfold(),
V().has('Group','GroupId','G001').
repeat(out()).until(hasLabel('User'))).
dedup().
values('petitions').sum()

I didn't test it, but I think the following will do:
g.V().union(
hasLabel('Group').has('GroupId','G001').repeat(
outE().inV()
).until(hasLabel('User')),
has('UserId', within('001','002')))
.dedup().values('petitions').sum()
In order to get only the tree leaves, it is better to use until. Using emit will output all inner tree nodes as well.
union merges the two inner traversals.

Related

Querying on multiple LINKMAP items with OrientDB SQL

I have a class that contains a LINKMAP field called links. This class is used recursively to create arbitrary hierarchical groupings (something like the time-series example, but not with the fixed year/month/day structure).
A query like this:
select expand(links['2017'].links['07'].links['15'].links['10'].links) from data where key='AAA'
Returns the actual records contained in the last layer of "links". This works exactly as expected.
But a query like this (note the 10,11 in the second to last layer of "links"):
select expand(links['2017'].links['07'].links['15'].links['10','11'].links) from data where key='AAA'
Returns two rows of the last layer of "links" instead:
{"1000":"#23:0","1001":"#24:0","1002":"#23:1"}
{"1003":"#24:1","1004":"#23:2"}
Using unionAll or intersect (with or without UNWIND) results in this single record:
[{"1000":"#23:0","1001":"#24:0","1002":"#23:1"},{"1003":"#24:1","1004":"#23:2"}]
But nothing I've tried (including various attempts at "compound" SELECTs) will get the expand to work as it does with the original example (i.e. return the actual records represented in the last LINKMAP).
Is there a SQL syntax that will achieve this?
Note: Even this (slightly modified) example from the ODB docs does not result in a list of linked records:
select expand(records) from
(select unionAll(years['2017'].links['07'].links['15'].links['10'].links, years['2017'].links['07'].links['15'].links['11'].links) as records from data where key='AAA')
Ref: https://orientdb.com/docs/2.2/Time-series-use-case.html
I'm not sure of what you want to achieve, but I think it's worth trying with values():
select expand(links['2017'].links['07'].links['15'].links['10','11'].links.values()) from data where key='AAA'

Gremlin query to find the count of a label for all the nodes

Sample query
The following query returns me the count of a label say
"Asset " for a particular id (0) has >>>
g.V().hasId(0).repeat(out()).emit().hasLabel('Asset').count()
But I need to find the count for all the nodes that are present in the graph with a condition as above.
I am able to do it individually but my requirement is to get the count for all the nodes that has that label say 'Asset'.
So I am expecting some thing like
{ v[0]:2
{v[1]:1}
{v[2]:1}
}
where v[1] and v[2] has a node under them with a label say "Asset" respectively, making the overall count v[0] =2 .
There's a few ways you could do it. It's maybe a little weird, but you could use group()
g.V().
group().
by().
by(repeat(out()).emit().hasLabel('Asset').count())
or you could do it with select() and then you don't build a big Map in memory:
g.V().as('v').
map(repeat(out()).emit().hasLabel('Asset').count()).as('count').
select('v','count')
if you want to maintain hierarchy you could use tree():
g.V(0).
repeat(out()).emit().
tree().
by(project('v','count').
by().
by(repeat(out()).emit().hasLabel('Asset')).select(values))
Basically you get a tree from vertex 0 and then apply a project() over that to build that structure per vertex in the tree. I had a different way to do it using union but I found a possible bug and had to come up with a different method (actually Gremlin Guru, Daniel Kuppitz, came up with the above approach). I think the use of project is more natural and readable so definitely the better way. Of course as Mr. Kuppitz pointed out, with project you create an unnecessary Map (which you just get rid of with select(values)). The use of union would be better in that sense.

Titan - Cassandra: Process entire set of vertices of a given type without running out of memory

I'm new to Titan and looking for the best way to iterate over the entire set of vertices with a given label without running out of memory. I come from a strong SQL background so I am still working on switching my way of thinking away from SQL-type thinking. Let's say I have 1 million profile vertices. I would like to iterate over each one and perform some type of statistical analysis of the information linked to each profile. I don't really care how long the entire analysis process takes, but I need to iterate over all of the profiles. In SQL I would do SELECT * FROM MY_TABLE, using a scroll-sensitive result, fetch the next result, grab and process the info linked to that row, then fetch the next result. I also don't care if the result is real-time accurate as it is just for gathering general stats, so if a new profile is added during iteration and I miss it, that's ok.
Even if there is a way to grab all the values for a given property, that would probably work too because then I could go through that list and grab each vertex by its ID for example.
I believe titan does lazy loading so you should be able to just iterate over the whole graph:
GraphTraversal<Vertex, Vertex> it = graph.traversal().V();
while(it.hasNext()){
Vertex v = it.next():
//Do what you want here
}
Another option would be to use the range step so that you explicitly choose the range of vertices you need. For example:
List<Vertex> vertices = graph.traversal().V().range(0, 3).toList();
//Do what you want with your batch of vertices.
With regards to getting vertices of a specific type you can query vertices based on their internal properties. For example if you have and internal property "TYPE" which defined the type you are interested in. You can query for those vertices by:
graph.traversal().V().has("TYPE", "A"); //Gets vertices of type A
graph.traversal().V().has("TYPE", "B"); //Gets vertices of type B

Tableau: Create a table calculation that sums distinct string values (names) when condition is met

I am getting my data from denormalized table, where I keep names and actions (apart from other things). I want to create a calculated field that will return sum of workgroup names but only when there are more than five actions present in DB for given workgroup.
Here's how I have done it when I wanted to check if certain action has been registered for workgroup:
WINDOW_SUM(COUNTD(IF [action] = "ADD" THEN [workgroup_name] END))
When I try to do similar thing with count, I am getting "Cannot mix aggregate and non-aggregate arguments":
WINDOW_SUM(COUNTD(IF COUNT([Number of Records]) > 5 THEN [workgroup_name] END))
I know that there's problem with the IF clause, but don't know how to fix it.
How to change the IF to be valid? Maybe there's an easier way to do it, that I am missing?
EDIT:
(after Inox's response)
I know that my problem is mixing aggregate with non-aggregate fields. I can't use filter to do it, because I want to use it later as a part of more complicated view - filtering would destroy the whole idea.
No, the problem is to mix aggregated arguments (e.g., sum, count) with non aggregate ones (e.g., any field directly). And that's what you're doing mixing COUNT([Number of Records]) with [workgroup_name]
If your goal is to know how many workgroup_name (unique) has more than 5 records (seems like that by the idea of your code), I think it's easier to filter then count.
So first you drag workgroup_name to Filter, go to tab conditions, select By field, Number of Records, Count, >, 5
This way you'll filter only the workgroup_name that has more than 5 records.
Now you can go with a simple COUNTD(workgroup_name)
EDIT: After clarification
Okay, than you need to add a marker that is fixed in your database. So table calculations won't help you.
By definition table calculation depends on the fields that are on the worksheet (and how you decide to use those fields to partition or address), and it's only calculated AFTER being called in a sheet. That way, each time you call the function it will recalculate, and for some analysis you may want to do, the fields you need to make the table calculation correct won't be there.
Same thing applies to aggregations (counts, sums,...), the aggregation depends, well, on the level of aggregation you have.
In this case it's better that you manipulate your data prior to connecting it to Tableau. I don't see a direct way (a single calculated field that would solve your problem). What can be done is to generate a db from Tableau (with the aggregation of number of records for each workgroup_name) then export it to csv or mdb and then reconnect it to Tableau. But if you can manipulate your database outside Tableau, it's usually a better solution

groupBy toList element order

I have a RichPipe with several fields, let's say:
'sex
'weight
'age
I need to group by 'sex and then get a list of tuples ('weight and 'age). I then want to do a scanLeft operation on the list for each group and get a pipe with 'sex and 'result. I currently do this by doing
pipe.groupBy('sex) {_.toList('weight -> 'weights).toList('age - 'ages)}
and then zipping the two lists together. I'm not sure this is the best possible way, and also I'm not sure if the order of the values in the lists is the same, so that when I zip the two lists the tuples don't get mixed up with wrong values. I found nothing about this in the documentation.
Ok, so it looks like I've answered my own question.
You can simply do
pipe.groupBy('sex) {_.toList[(Int, Int)](('weight, 'age) -> 'list)}
which results in a list of tuples. Would've saved me a lot of time if the Fields API Reference mentioned this.