In my OrientDB database I have a document class A which has 4 fields and a relationship:
id
parentId
source
terminal
Rel
I need to select the source of the root element of all the elements that have a terminal set. An example would be:
A1: (Rel: NULL, id: 1, parentId: NULL, source: test)
A2: (Rel: A1, id: 2, parentId: 1)
A3: (Rel: A2, id: 3, parentId: 2)
A4: (Rel: A3, id: 4, parentId: 3, terminal: no1)
A5: (Rel: A4, id: 5, parentId: 4)
What I need to get is:
terminal: no1, source: test
What I can do right now is get all the sources, but I do not know to which terminals they belong to:
SELECT source FROM (TRAVERSE A.Rel FROM (SELECT FROM A WHERE terminal IS NOT NULL) WHILE $depth <= 99) WHERE parentId IS NULL
I tried playing with LET but was not able to make it work the way I wanted to.
EDIT
SELECT FROM A
orientdb {GratefulDeadConcerts}> select from A
----+-----+----+--------+------+--------+-----
# |#RID |id |terminal|source|parentId|Rel
----+-----+----+--------+------+--------+-----
0 |#15:0|1 |null |test |null |null
1 |#15:1|2 |null |null |1 |#15:0
2 |#15:2|3 |null |null |2 |#15:1
3 |#15:3|4 |no1 |null |3 |#15:2
4 |#15:4|5 |null |null |4 |#15:3
----+-----+----+--------+------+--------+-----
First try with LET
orientdb {GratefulDeadConcerts}> SELECT source, $terminal FROM ( TRAVERSE A.Rel FROM ( SELECT FROM A WHERE terminal IS NOT NULL LET $parent.$parent.$terminal = terminal ) ) WHERE parentId IS NULL
----+-----+------
# |#RID |source
----+-----+------
0 |#-2:1|test
----+-----+------
Try this (sorry without the database it could be hard)
SELECT source, $terminal FROM (
TRAVERSE A.Rel FROM (
SELECT FROM A WHERE terminal IS NOT NULL LET $parent.$parent.$t = terminal
) WHILE parentId IS NOT NULL
) WHERE parentId IS NULL
Related
I have a dataframe, there is a enum field(value are 0 or 1) named A, another one field B, I would like to implement below scenario:
if `B` is null:
count(when `A` is 0) and set a column name `xx`
count(when `A` is 1) and set a column name `yy`
if `B` is not null:
count(when `A` is 0) and set a column name `zz`
count(when `A` is 1) and set a column name `mm`
how can I do it by spark scala?
It's possible to conditionally populate columns in this way, however the final output DataFrame requires an expected schema.
Assuming all of the scenarios you detailed are possible in one DataFrame, I would suggest creating each of the four columns: "xx", "yy", "zz" and "mm" and conditionally populating them.
In the below example I've populated the values with either "found" or "", primarily to make it easy to see where the values are populated. Using true and false here, or another enum, would likely make more sense in the real world.
Starting with a DataFrame (since you didn't specify the type that "B" is I have gone for a Option[String] (nullable) for this example:
val df = List(
(0, None),
(1, None),
(0, Some("hello")),
(1, Some("world"))
).toDF("A", "B")
df.show(false)
gives:
+---+-----+
|A |B |
+---+-----+
|0 |null |
|1 |null |
|0 |hello|
|1 |world|
+---+-----+
and to create the columns:
df
.withColumn("xx", when(col("B").isNull && col("A") === 0, "found").otherwise(""))
.withColumn("yy", when(col("B").isNull && col("A") === 1, "found").otherwise(""))
.withColumn("zz", when(col("B").isNotNull && col("A") === 0, "found").otherwise(""))
.withColumn("mm", when(col("B").isNotNull && col("A") === 1, "found").otherwise(""))
.show(false)
gives:
+---+-----+-----+-----+-----+-----+
|A |B |xx |yy |zz |mm |
+---+-----+-----+-----+-----+-----+
|0 |null |found| | | |
|1 |null | |found| | |
|0 |hello| | |found| |
|1 |world| | | |found|
+---+-----+-----+-----+-----+-----+
I have the following working solution in a databricks notebook as test.
var maxcol = udf((col1: Long, col2: Long, col3: Long) => {
var res = ""
if (col1 > col2 && col1 > col3) res = "col1"
else if (col2 > col1 && col2 > col3) res = "col2"
else res = "col3"
res
})
val someDF = Seq(
(8, 10, 12, "bat"),
(64, 61, 59, "mouse"),
(-27, -30, -15, "horse")
).toDF("number1", "number2", "number3", "word")
.withColumn("maxColVal", greatest("number1", "number2", "number3"))
.withColumn("maxColVal_Name", maxcol(col("number1"), col("number2"), col("number3")))
display(someDF)
Is there any way to make this generic? I have a usecase to make variable columns pass to this UDF and still get the max column name as output corresponding to the column having max value.
Unlike above where I have hard coded the column names 'col1', 'col2' and 'col3' in the UDF.
Use below:
val df = List((1,2,3,5,"a"),(4,2,3,1,"a"),(1,20,3,1,"a"),(1,22,22,2,"a")).toDF("mycol1","mycol2","mycol3","mycol4","mycol5")
//list all your columns among which you want to find the max value
val colGroup = List(df("mycol1"),df("mycol2"),df("mycol3"),df("mycol4"))
//list column value -> column name of the columns among which you want to find max value column NAME
val colGroupMap = List(df("mycol1"),lit("mycol1"),
df("mycol2"),lit("mycol2"),
df("mycol3"),lit("mycol3"),
df("mycol4"),lit("mycol4"))
var maxcol = udf((colVal: Map[Int,String]) => {
colVal.max._2 //you can easily find the column name of the max column value
})
df.withColumn("maxColValue",greatest(colGroup:_*)).withColumn("maxColVal_Name",maxcol(map(colGroupMap:_*))).show(false)
+------+------+------+------+------+-----------+--------------+
|mycol1|mycol2|mycol3|mycol4|mycol5|maxColValue|maxColVal_Name|
+------+------+------+------+------+-----------+--------------+
|1 |2 |3 |5 |a |5 |mycol4 |
|4 |2 |3 |1 |a |4 |mycol1 |
|1 |20 |3 |1 |a |20 |mycol2 |
|1 |22 |22 |2 |a |22 |mycol3 |
+------+------+------+------+------+-----------+--------------+
I have spark dataframe like this
id1 id2 attrname attr_value attr_valuelist
1 2 test Yes Yes, No
2 1 test1 No Yes, No
3 2 test2 value1 val1, Value1,value2
4 1 test3 3 0, 1, 2
5 3 test4 0 0, 1, 2
11 2 test Yes Yes, No
22 1 test1 No1 Yes, No
33 2 test2 value0 val1, Value1,value2
44 1 test3 11 0, 1, 2
55 3 test4 0 0, 1, 2
val df = sqlContext.sql("select id1, id2, attrname, attr_value, attr_valuelist from dftable)
i want to check attr_value in attr_valuelist if it does not exists then take only those rows
id1 id2 attrname attr_value attr_valuelist
4 1 test3 3 0, 1, 2
22 1 test1 No1 Yes, No
33 2 test2 value0 val1, Value1,value2
44 1 test3 11 0, 1, 2
you can simply do the following with contains in your dataframe
import org.apache.spark.sql.functions._
df.filter(!(col("attr_valuelist").contains(col("attr_value")))).show(false)
you should have following output
+---+---+--------+----------+-------------------+
|id1|id2|attrname|attr_value|attr_valuelist |
+---+---+--------+----------+-------------------+
|3 |2 |test2 |value1 |val1, Value1,value2|
|4 |1 |test3 |3 |0, 1, 2 |
|22 |1 |test1 |No1 |Yes, No |
|33 |2 |test2 |value0 |val1, Value1,value2|
|44 |1 |test3 |11 |0, 1, 2 |
+---+---+--------+----------+-------------------+
If you want to ignore the case letters then you can simply user lower function as
df.filter(!(lower(col("attr_valuelist")).contains(lower(col("attr_value"))))).show(false)
you should have
+---+---+--------+----------+-------------------+
|id1|id2|attrname|attr_value|attr_valuelist |
+---+---+--------+----------+-------------------+
|4 |1 |test3 |3 |0, 1, 2 |
|22 |1 |test1 |No1 |Yes, No |
|33 |2 |test2 |value0 |val1, Value1,value2|
|44 |1 |test3 |11 |0, 1, 2 |
+---+---+--------+----------+-------------------+
You can define a custom function, user defined function in Spark, where you can test if a value from a column is contained in the value of the other column, like this:
def contains = udf((attr: String, attrList: String) => attrList.contains(attr))
def notContains = udf((attr: String, attrList: String) => !attrList.contains(attr))
you can tweak contains function how you want, and then you can select from your dataframe like this
df.where(contains(df("attr_value", df("attr_valuelist")))
df.where(notContains(df("attr_value", df("attr_valuelist")))
I want to select all vertices that are connected to another vertex. I am currently using the traverse function in OrientDB. Consider the following example:
> create class professor extends V
> create class course extends V
> insert into professor set name='Smith'
Inserted record 'professor#14:0{name:Smith} v1'
> insert into course set name='Calculus'
Inserted record 'course#15:0{name:Calculus} v1'
> create class teaches extends E
> create edge teaches from #14:0 to #15:0
Created edge '[teaches#16:0{out:#14:0,in:#15:0} v3]'
Now when I try to traverse to find the course(s) that professor Smith teaches I use the following command:
> traverse out_teaches from #15:0
----+-----+---------+-----+-----------+-----+-----
# |#RID |#CLASS |name |out_teaches|out |in
----+-----+---------+-----+-----------+-----+-----
0 |#14:0|professor|Smith|[size=1] |null |null
1 |#16:0|teaches |null |null |#14:0|#15:0
----+-----+---------+-----+-----------+-----+-----
Why does this return to me the edge and not the vertex (course) that I am looking for? What is the appropriate command to return to me the vertex? I want the record for 'Calculus' to be returned.
I expanded your graph a bit to try your query.
If you want to know only the connected vertices to some starting vertex by the edge 'teaches' you should use SELECT EXPAND (OUT / IN / BOTH) because TRAVERSE is more useful if you wish to explore the graph at different depths (in my case "Smith" has the #rid #11:0):
select expand(out('teaches')) from (select from Professor where name='Smith')
----+-----+------+------------+----------+----------
# |#RID |#CLASS|name |in_teaches|in_follows
----+-----+------+------------+----------+----------
0 |#12:0|course|Calculus |[size=1] |[size=1]
1 |#12:1|course|Astrophysics|[size=1] |[size=1]
2 |#12:2|course|Law |[size=2] |[size=1]
----+-----+------+------------+----------+----------
or with select expand(out('teaches')) from #11:0 you will obtain the same result:
----+-----+------+------------+----------+----------
# |#RID |#CLASS|name |in_teaches|in_follows
----+-----+------+------------+----------+----------
0 |#12:0|course|Calculus |[size=1] |[size=1]
1 |#12:1|course|Astrophysics|[size=1] |[size=1]
2 |#12:2|course|Law |[size=2] |[size=1]
----+-----+------+------------+----------+----------
or you can obtain all the connected vertices to the professor "Smith"
select expand(out()) from professor where name="Smith"
----+-----+----------+------------+----------+----------+------------+----------
# |#RID |#CLASS |name |in_teaches|in_follows|in_studiesAt|in_worksAt
----+-----+----------+------------+----------+----------+------------+----------
0 |#12:0|course |Calculus |[size=1] |[size=1] |null |null
1 |#12:1|course |Astrophysics|[size=1] |[size=1] |null |null
2 |#12:2|course |Law |[size=2] |[size=1] |null |null
3 |#16:0|university|Cambridge |null |null |[size=1] |[size=1]
----+-----+----------+------------+----------+----------+------------+----------
Your query traverse out_teaches from #11:0 seems to list the starting vertex and all of the connected edges with relative IN and OUT vertices:
----+-----+---------+-----+-----------+-----------+-----+-----
# |#RID |#CLASS |name |out_teaches|out_worksAt|out |in
----+-----+---------+-----+-----------+-----------+-----+-----
0 |#11:0|professor|Smith|[size=3] |[size=1] |null |null
1 |#13:0|teaches |null |null |null |#11:0|#12:0
2 |#13:1|teaches |null |null |null |#11:0|#12:1
3 |#13:2|teaches |null |null |null |#11:0|#12:2
----+-----+---------+-----+-----------+-----------+-----+-----
I tried also traverse out_teaches from professor and the result is similar to the previous query:
----+-----+---------+-----+-----------+-----------+-----+-----
# |#RID |#CLASS |name |out_teaches|out_worksAt|out |in
----+-----+---------+-----+-----------+-----------+-----+-----
0 |#11:0|professor|Smith|[size=3] |[size=1] |null |null
1 |#13:0|teaches |null |null |null |#11:0|#12:0
2 |#13:1|teaches |null |null |null |#11:0|#12:1
3 |#13:2|teaches |null |null |null |#11:0|#12:2
4 |#11:1|professor|Green|[size=1] |[size=1] |null |null
5 |#13:3|teaches |null |null |null |#11:1|#12:2
----+-----+---------+-----+-----------+-----------+-----+-----
The correct syntax for selecting the courses (at least in OrientDB 2.1) would be based on out('teaches'). For example:
> select expand(out('teaches')) from (select from Professor where name='Smith')
----+-----+------+--------+----------
# |#RID |#CLASS|name |in_teaches
----+-----+------+--------+----------
0 |#12:0|Course|Calculus|[size=1]
----+-----+------+--------+----------
That is, there's just one vertex, as expected.
Please note that 'traverse' is used for a different purpose. It involves an iterative procedure for traversing graphs.
out_teaches
"out_teaches" is a reference to an edge. Using OrientDB 2.1.7, the response I obtained for your "out_teaches" query is as follows:
> select expand(out_teaches) from (select from Professor where name='Smith')
----+-----+-------+-----+-----
# |#RID |#CLASS |out |in
----+-----+-------+-----+-----
0 |#13:0|teaches|#11:0|#12:0
----+-----+-------+-----+-----
Again, this is what one would expect - an edge.
Your query is working fine for me.
In my case I have the rid as #11:0 for professor, #12:0 for course and #13:0 for teaches
Just rerun your query once again or try the below :
traverse both('teaches') from #12:0
I am struggling, maybe the simplest problem ever. My SQL knowledge pretty much limits me from achieving this. I am trying to build an sql query that should show JobTitle, Note and NoteType. Here is the thing, First job doesn't have any note but we should see it in the results. System notes never and ever should be displayed. An expected result should look like this
Result:
--------------------------------------------
|ID |Title |Note |NoteType |
--------------------------------------------
|1 |FirstJob |NULL |NULL |
|2 |SecondJob |CustomNot1|1 |
|2 |SecondJob |CustomNot2|1 |
|3 |ThirdJob |NULL |NULL |
--------------------------------------------
.
My query (doesn't work, doesn't display third job)
SELECT J.ID, J.Title, N.Note, N.NoteType
FROM JOB J
LEFT OUTER JOIN NOTE N ON N.JobId = J.ID
WHERE N.NoteType IS NULL OR N.NoteType = 1
My Tables:
My JOB Table
----------------------
|ID |Title |
----------------------
|1 |FirstJob |
|2 |SecondJob |
|3 |ThirdJob |
----------------------
My NOTE Table
--------------------------------------------
|ID |JobId |Note |NoteType |
--------------------------------------------
|1 |2 |CustomNot1|1 |
|2 |2 |CustomNot2|1 |
|3 |2 |SystemNot1|2 |
|4 |2 |SystemNot3|2 |
|5 |3 |SystemNot1|2 |
--------------------------------------------
This can't be true together (NoteType can't be NULL as well as 1 at the same time):
WHERE N.NoteType IS NULL AND N.NoteType = 1
You may want to use OR instead to check if NoteType is either NULL or 1.
WHERE N.NoteType IS NULL OR N.NoteType = 1
EDIT: With corrected query, your third job will not be retrieved as JOB_ID is matching but its the row getting filtered out because of the where condition.
Try below as work around to get the third job with null values.
SELECT J.ID, J.Title, N.Note, N.NoteType
FROM JOB J
LEFT OUTER JOIN
( SELECT JOBID NOTE, NOTETYPE FROM NOTE
WHERE N.NoteType IS NULL OR N.NoteType = 1) N
ON N.JobId = J.ID
just exclude the systemNotes and use a sub-select:
select * from job j
left outer join (
select * from note where notetype!=2
) n
on j.id=n.jobid;
if you include the joined table into where then left outer join might work as an inner join.