Gremlin query search by search key across multiple vertex properties - titan

I am trying to write a search query where in the input is some search key and the requirement is to search among the vertices where the given input key matches the value of two or more property keys of the vertex. For example assuming that I have user vertices in my graph db with the following property keys:
1) User first name
2) User last name
3) User email
Now given a search key 'xyz' I have to search across the user vertices where any of the above three property keys matches the value 'xyz'. This is how I have approached the problem.
g.V.has('ENTITY_TYPE', 'USER').or(_().has('USER_EMAIL' , TEXT.REGEX , '.*xyz.*') , _().has('USER_FNAME' , TEXT.REGEX , '.*xyz.*''USER_EMAIL' , TEXT.REGEX , '.*xyz.*') , _().has('USER_LNAME' , TEXT.REGEX , '.*xyz.*')).dedup();
I have created the required mixed indices (three separate mixed indices) for USER_EMAIL, USER_FNAME and USER_LNAME as follows:
key = m.makePropertyKey("USER_EMAIL").dataType(String.class).make();
m.buildIndex("serachbyemail",Vertex.class).addKey(key).buildMixedIndex("search");
key = m.makePropertyKey("USER_FNAME").dataType(String.class).make();
m.buildIndex("searchbyfname",Vertex.class).addKey(key).buildMixedIndex("search");
key = m.makePropertyKey("USER_LNAME").dataType(String.class).make();
m.buildIndex("typemixed",Vertex.class).addKey(key).buildMixedIndex("search");
This works fine. But I want to know if this is the best approach to this kind of problem? Or is there a better way to do this? Also I am using gremlin java api to write the above query. I am using dedup() to remove the duplicate vertices.

The 3 indices won't help to answer your query efficiently. Better create a single index that covers all of the 3 fields (that doesn't mean, that your query has to have a condition for all fields) and issue a direct index query:
Sample graph:
g = TitanFactory.open("conf/titan-cassandra-es.properties")
m = g.getManagementSystem()
user = m.makeVertexLabel("USER").make()
email = m.makePropertyKey("USER_EMAIL").dataType(String.class).make()
fname = m.makePropertyKey("USER_FNAME").dataType(String.class).make()
lname = m.makePropertyKey("USER_LNAME").dataType(String.class).make()
m.buildIndex("users", Vertex.class).addKey(email).addKey(fname).addKey(lname).indexOnly(user).buildMixedIndex("search")
m.commit()
ElementHelper.setProperties(g.addVertexWithLabel("USER"), "USER_EMAIL", "foo#bar.com", "USER_FNAME", "foo", "USER_LNAME", "bar")
ElementHelper.setProperties(g.addVertexWithLabel("USER"), "USER_EMAIL", "foo#xyz.com", "USER_FNAME", "foo", "USER_LNAME", "bar")
ElementHelper.setProperties(g.addVertexWithLabel("USER"), "USER_EMAIL", "abc#bar.com", "USER_FNAME", "foo", "USER_LNAME", "xyz")
ElementHelper.setProperties(g.addVertexWithLabel("USER"), "USER_EMAIL", "foo#baz.com", "USER_FNAME", "xyz", "USER_LNAME", "bar")
ElementHelper.setProperties(g.addVertexWithLabel("USER"), "USER_EMAIL", "xyz#bar.com", "USER_FNAME", "xyz", "USER_LNAME", "xyz")
g.commit()
Direct index query:
gremlin> g.indexQuery("users", 'v."USER_EMAIL":/.*xyz.*/ v."USER_FNAME":/.*xyz.*/ v."USER_LNAME":/.*xyz.*/').vertices()*.getElement()._().map()
==>{USER_FNAME=xyz, USER_LNAME=xyz, USER_EMAIL=xyz#bar.com}
==>{USER_FNAME=xyz, USER_LNAME=bar, USER_EMAIL=foo#baz.com}
==>{USER_FNAME=foo, USER_LNAME=xyz, USER_EMAIL=abc#bar.com}
==>{USER_FNAME=foo, USER_LNAME=bar, USER_EMAIL=foo#xyz.com}
As you can see I also replaced ENTITY_TYPE with a vertex label. The label can help to keep your index as small as possible. If, for example, another type of vertices (e.g. PROFILE) also uses the property USER_EMAIL, it wouldn't make it into the index (if it was created using .indexOnly(user)).

Related

How to parameterize a column for aggregation in Power BI desktop?

I have users who would like to be able to modify what columns a table aggregates by. My issue is that I seem unable to do this in Power BI. I basically want to be able to do the following in SQL:
SELECT
<OrgLevel1>,
<OrgLevel2>,
SUM([Revenue])
FROM [Data]
GROUP BY
<OrgLevel1>,
<OrgLevel2>
;
where the user can change <OrgLevel1> and/or <OrgLevel2> to be any of { "(All)", [Department], [Product] }.
The issue may be related to this post: https://community.powerbi.com/t5/Desktop/Calculated-Column-Table-Change-Dynamically-According-to-Slicer/m-p/655991#M314800
Here's a link to a workbook that illustrates this issue, TestParameterizeGroupby.pbix (hosted by Google Drive). I've also included field definitions below with screenshots. Thanks for any help.
TestParameterizeGroupby.pbix
Link: TestParameterizeGroupby.pbix (hosted by Google Drive)
Problem
[Org Level 1] and [Org Level 2] fields are not recalculating from the users' selection. Only the default values are shown.
Expected result in table
"Org Level 1", "Org Level 2", "Revenue"
"(All)", "(All)", 28
Note
The purpose is to have parameterizable organization level fields so that the report user can aggregate by all, department, product, or both in either order.
Table and column definitions
'Data' = DATATABLE(
"Department",
STRING,
"Product",
STRING,
"Revenue",
DOUBLE,
{
{"DeptA", "ProdX", 5.0},
{"DeptA", "ProdY", 6.0},
{"DeptB", "ProdX", 10.0},
{"DeptB", "ProdY", 7.0}
}
)
'Data'[Org Level 1] = SWITCH(
'Org Level 1 Parameter'[Org Level 1 Parameter Value],
0,
"(All)",
1,
[Department],
2,
[Product]
)
// Problem: [Org Level 1] and [Org Level 2] fields are not recalculating from the users' selection. Only the default values are shown.
'Org Level 1' = DATATABLE(
"Org Level 1",
STRING,
"Org Level 1 Parameter",
INTEGER,
{
{"(0) (All)", 0},
{"(1) Department", 1},
{"(2) Product", 2}
}
)
'Org Level 1 Parameter'[Org Level 1 Parameter] = GENERATESERIES(0, 2, 1)
'Org Level 1 Parameter'[Org Level 1 Parameter Value] = SELECTEDVALUE('Org Level 1 Parameter'[Org Level 1 Parameter], 1)
Table 'Org Level 1' has a 1-1 relationship with 'Org Level 1 Parameter' on column [Org Level 1 Parameter].
The user selects the value for 'Data'[Org Level 1] by selecting the value for 'Org Level 1'[Org Level 1].
Tables and columns for [Org Level 2] are defined in the same way as [Org Level 1].
Screenshots
Report view:
Data view:
Model view:
Cross-reference to post in Power BI forum:
Power BI Forum: How to parameterize a column for aggregation
One solution to this is to add two list values parameters and use their values in Power Query M code to modify the database query. Lets assume that you have a table Data with columns Department, Product and Revenue. For simplicity I will add one more column, named Dummy Column, with all rows having the same value (e.g. null). I will explain why later in this post. So the table looks like this:
Then in your report specify a query when adding this table to your model (lets assume we will import it, but in general you can do this in DirectQuery too):
Now if you look the M code you will see the above query there:
Source = Sql.Database(".", "StackOverflow", [Query=" select ....
Now define couple of parameters, that the end-user can use to select how the data should be aggregated. Lets name them Level 1 and Level 2:
The value of a parameter can be used in M by parameter name, and & is used to concatenate strings. So if there is a parameter Name with value Samuel, the expression "Hello, " & Name & "!" will be evaluated as Hello, Samuel!. The idea is to check the value of our parameters and modify the database query accordingly.
In the select part, we will replace the name of the field selected, or we will put '' (empty string) in case of <All> (I surrounded parameter values with brackets to be more easily to distinguish parameter values from database field names). So the expression should look like:
"select " & (if #"Level 1" = "<Department>" then "Department" else ..." (and so on)
Because there is a space in our parameter's name, we need to surround it with #" and ", so Level1 can be referenced simply as Level1 in the code, but Level 1 becomes #"Level 1".
The group by part is a bit trickier. We should add a comma between field names, add or not field name, or even omit the group by at all (in case both parameters are set to <All>). To simplify this, I added one dummy column, with all rows having the same value (e.g. null) and always group by this column. This way building the group by clause is way more simpler - in case the parameter value is not <All>, we should add , fieldname. So the code could look like this:
"group by DummyColumn" & (if #"Level 1" = "<Department>" then ", Department" else ..." (and so on)
So the final M code is this:
let
Source = Sql.Database(".", "StackOverflow", [Query="select#(lf) " & (if #"Level 1" = "<Department>" then "Department" else if #"Level 1" = "<Product>" then "Product" else "''") & " as [Org Level 1]#(lf) , " & (if #"Level 2" = "<Department>" then "Department" else if #"Level 2" = "<Product>" then "Product" else "''") & " as [Org Level 2]#(lf) , SUM(Revenue) as Revenue#(lf)from Data#(lf)group by DummyColumn" & (if #"Level 1" = "<Department>" then ", Department" else if #"Level 1" = "<Product>" then ", Product" else "") & (if #"Level 2" = "<Department>" then ", Department" else if #"Level 2" = "<Product>" then ", Product" else "")])
in
Source
Now the end-user can change parameter values, by clicking Edit Queries -> Edit Parameters:
And select how to group the data:
By default, Power BI Desktop will warn you first time, when particular query is executed:
If you want to turn this off, go to File -> Options and settings -> Options -> (GLOBAL) Security and make sure Require user approval for new native database queries is not selected:
When the end-user changes parameter values, the data will change too, e.g.:
Or:
And so on...
This trick works well in Power BI Desktop when every user has its own copy of the .pbix file. However, if you publish it, first changing parameter values is not very convenient (you must go to datasat's settings) and more important, changing parameter values affects all users, which are looking at this report. You can also use it to modify Table.Group statements generated by Power Query Editor, in case you want to aggregate the data in Power BI, but changing the database query is easier and more flexible.
If you want to enable this scenario for concurrent multi-users scenarios for published reports, you can use slicers and What-if parameters. Unfortunately, What-if parameters can be numeric (you can't define the list of values there), so you can use measures to "decode" the int value of the parameter and write some DAX code to perform different aggregations accordingly. It is more work, but if it is needed, it can be made too.

Extracting all keys from a JSON object unless a certain key has a value

My Postgres jsonb-foo isn't that great but I'd appreciate some help with a query I am trying to put together.
I have this rudimentary query to extract the name of all keys in the _doc's 'answers' key. The jsonb data looks something like this
_doc = {
"answers": {
"baz": true,
"qux": true
"other": "How do i find this"
}
}
and a query might look this this:
SELECT ss.foo, count(DISTINCT (ss.bar)) FROM (
SELECT (_doc::jsonb -> 'bar')::text as bar,
jsonb_object_keys(_doc::jsonb -> 'answers' -> 'foo') as foo
FROM public."table_name"
) ss
WHERE ss.foo IS NOT NULL
GROUP BY ss.foo;
So really the output here would be the number of times each key of answers appears.
("baz" = 1, "qux" = 1, "other" = 1)
Here is my problem, I want to get the number of times each key appears, apart from in the case of other. In that case I want to get the number of times its contents appears. So I want the result to be
("baz" = 1, "qux" = 1, "How do i find this" = 1)
If possible I would love some help structuring this query.
Thank you
demo:db<>fiddle for several json records
demo:db<>fiddle for one json records which has the same key twice (strictly not recommended!)
Using the json_each_text() function to get the key/value pairs. After that take the keys or the value of other, selecting through a CASE clause
SELECT
CASE WHEN elems.key = 'other' THEN elems.value
ELSE elems.key
END AS key,
COUNT(*)
FROM data,
json_each_text(jsondata -> 'answers') AS elems
GROUP BY 1

OrientDB - Creating an edge using rid's from index queries

I'm trying to create edges between existing vertices queried by their indexed IDs, similar to the first answer here, but using this index lookup query instead of the label query:
CREATE EDGE cite
FROM
(SELECT FROM index:<className>.<indexName> WHERE key = "<keyString>")
TO
(SELECT FROM index:<className>.<indexName> WHERE key = "<keyString>")
This gives me the following error: com.orientechnologies.orient.core.exception.OCommandExecutionException: Source vertex '#-1:-1' not exists
Possibly relevant:
When I just query SELECT FROM index:<className>.<indexName> WHERE key = "<keyString>" by itself it returns an array object structured like:
[ { '#type': 'd',
key: '<keyString>',
rid: { cluster: <actual cluster>, position: <actual position> }
#rid: { cluster: -1, position: -1 } } ]
I'm guessing that the error has something to do with the CREATE EDGE query using the #rid instead of the rid but I'm not sure.
The query successfully creates the edges if I simply use the #<actual cluster>:<actual position> instead of the SELECT subquery.
Any ideas what I might be doing wrong?
Edit: In the interest of replicability, I have the same problem in the GratefulDeadConcerts database when I (1) add a property name to the V class schema, (2) create a unique nameIndex index of V using the name property under V, and then (3) use the following query:
create edge followed_by from (select from index:nameIndex where key = 'HEY BO DIDDLEY') to (select from index:nameIndex where key = 'IM A MAN')
Why don't you query the class directly?
CREATE EDGE cite
FROM
(select from Class where field = '<keyString>')
TO
(select from Class where field = '<keyString>')
Select from index return a temp document as result set with key,and rid
you can try but i don't know if it will work
SELECT expand(rid) FROM index:<className>.<indexName> WHERE key = "<keyString>"

Select most reviewed courses starting from courses having at least 2 reviews

I'm using Flask-SQLAlchemy with PostgreSQL. I have the following two models:
class Course(db.Model):
id = db.Column(db.Integer, primary_key = True )
course_name =db.Column(db.String(120))
course_description = db.Column(db.Text)
course_reviews = db.relationship('Review', backref ='course', lazy ='dynamic')
class Review(db.Model):
__table_args__ = ( db.UniqueConstraint('course_id', 'user_id'), { } )
id = db.Column(db.Integer, primary_key = True )
review_date = db.Column(db.DateTime)#default=db.func.now()
review_comment = db.Column(db.Text)
rating = db.Column(db.SmallInteger)
course_id = db.Column(db.Integer, db.ForeignKey('course.id') )
user_id = db.Column(db.Integer, db.ForeignKey('user.id') )
I want to select the courses that are most reviewed starting with at least two reviews. The following SQLAlchemy query worked fine with SQlite:
most_rated_courses = db.session.query(models.Review, func.count(models.Review.course_id)).group_by(models.Review.course_id).\
having(func.count(models.Review.course_id) >1) \ .order_by(func.count(models.Review.course_id).desc()).all()
But when I switched to PostgreSQL in production it gives me the following error:
ProgrammingError: (ProgrammingError) column "review.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT review.id AS review_id, review.review_date AS review_...
^
'SELECT review.id AS review_id, review.review_date AS review_review_date, review.review_comment AS review_review_comment, review.rating AS review_rating, review.course_id AS review_course_id, review.user_id AS review_user_id, count(review.course_id) AS count_1 \nFROM review GROUP BY review.course_id \nHAVING count(review.course_id) > %(count_2)s ORDER BY count(review.course_id) DESC' {'count_2': 1}
I tried to fix the query by adding models.Review in the GROUP BY clause but it did not work:
most_rated_courses = db.session.query(models.Review, func.count(models.Review.course_id)).group_by(models.Review.course_id).\
having(func.count(models.Review.course_id) >1) \.order_by(func.count(models.Review.course_id).desc()).all()
Can anyone please help me with this issue. Thanks a lot
SQLite and MySQL both have the behavior that they allow a query that has aggregates (like count()) without applying GROUP BY to all other columns - which in terms of standard SQL is invalid, because if more than one row is present in that aggregated group, it has to pick the first one it sees for return, which is essentially random.
So your query for Review basically returns to you the first "Review" row for each distinct course id - like for course id 3, if you had seven "Review" rows, it's just choosing an essentially random "Review" row within the group of "course_id=3". I gather the answer you really want, "Course", is available here because you can take that semi-randomly selected Review object and just call ".course" on it, giving you the correct Course, but this is a backwards way to go.
But once you get on a proper database like Postgresql you need to use correct SQL. The data you need from the "review" table is just the course_id and the count, nothing else, so query just for that (first assume we don't actually need to display the counts, that's in a minute):
most_rated_course_ids = session.query(
Review.course_id,
).\
group_by(Review.course_id).\
having(func.count(Review.course_id) > 1).\
order_by(func.count(Review.course_id).desc()).\
all()
but that's not your Course object - you want to take that list of ids and apply it to the course table. We first need to keep our list of course ids as a SQL construct, instead of loading the data - that is, turn it into a derived table by converting the query into a subquery (change the word .all() to .subquery()):
most_rated_course_id_subquery = session.query(
Review.course_id,
).\
group_by(Review.course_id).\
having(func.count(Review.course_id) > 1).\
order_by(func.count(Review.course_id).desc()).\
subquery()
one simple way to link that to Course is to use an IN:
courses = session.query(Course).filter(
Course.id.in_(most_rated_course_id_subquery)).all()
but that's essentially going to throw away the "ORDER BY" you're looking for and also doesn't give us any nice way of actually reporting on those counts along with the course results. We need to have that count along with our Course so that we can report it and also order by it. For this we use a JOIN from the "course" table to our derived table. SQLAlchemy is smart enough to know to join on the "course_id" foreign key if we just call join():
courses = session.query(Course).join(most_rated_course_id_subquery).all()
then to get at the count, we need to add that to the columns returned by our subquery along with a label so we can refer to it:
most_rated_course_id_subquery = session.query(
Review.course_id,
func.count(Review.course_id).label("count")
).\
group_by(Review.course_id).\
having(func.count(Review.course_id) > 1).\
subquery()
courses = session.query(
Course, most_rated_course_id_subquery.c.count
).join(
most_rated_course_id_subquery
).order_by(
most_rated_course_id_subquery.c.count.desc()
).all()
A great article I like to point out to people about GROUP BY and this kind of query is SQL GROUP BY techniques which points out the common need for the "select from A join to (subquery of B with aggregate/GROUP BY)" pattern.

JPQL: sort queryresult by "best matches" possible?

I have the following question/problem:
I'm using JPQL (JPA 2.0 and eclipselink) and I wanna create a query that gives me the results sorted the following way:
At first the results sorted ascending by the best matches. After that should appear the inferior matches.
My objects are based on a simple class called 'Person' with the attributes:
{String Id,
String forename,
String name}
For example if I'm searching for "Picol" the result should look like:
[{129, Picol, Newman}, {23, Johnny, Picol},{454, Picolori, Newta}, {4774, Picolatus, Larimus}...]
PS: I already thought about using two queries, the first is searching with "equals" and the second with "like", although I'm not quite sure how to connect both queryresults...?
Hope for your help and thanks in advance,
Florian
If, as your question seem to imply, you only have two groups (first group : forename or name equals searched string; second group : forename or name contains searched string), and if all the persons of a given group have the same "match score", then using two queries is indeed a good solution.
First query :
select p from Person p where p.foreName = :param or p.name = :param
Second query :
select p from Person p where (p.foreName like :paramSurroundedWithPercent
or p.name like :paramSurroundedWithPercent)
and p.foreName != :param
and p.name != :param
Execute both queries (each returning a List<Person>), and add all the elements of the second list to the first one (using the addAll() method)