df_hrrchy
|lefId |Lineage |
|-------|--------------------------------------|
|36326 |["36326","36465","36976","36091","82"]|
|36121 |["36121","36908","36976","36091","82"]|
|36380 |["36380","36465","36976","36091","82"]|
|36448 |["36448","36465","36976","36091","82"]|
|36683 |["36683","36465","36976","36091","82"]|
|36949 |["36949","36908","36976","36091","82"]|
|37349 |["37349","36908","36976","36091","82"]|
|37026 |["37026","36908","36976","36091","82"]|
|36879 |["36879","36465","36976","36091","82"]|
df_trans
|tranID | T_Id |
|-----------|-------------------------------------------------------------------------|
|1000540 |["36121","36326","37349","36949","36380","37026","36448","36683","36879"]|
df_creds
|T_Id |T_val |T_Goal |Parent_T_Id |Parent_Val |parent_Goal|
|-------|-------|-------|---------------|----------------|-----------|
|36448 |100 |1 |36465 |200 |1 |
|36465 |200 |1 |36976 |300 |2 |
|36326 |90 |1 |36465 |200 |1 |
|36091 |500 |19 |82 |600 |4 |
|36121 |90 |1 |36908 |200 |1 |
|36683 |90 |1 |36465 |200 |1 |
|36908 |200 |1 |36976 |300 |2 |
|36949 |90 |1 |36908 |200 |1 |
|36976 |300 |2 |36091 |500 |19 |
|37026 |90 |1 |36908 |200 |1 |
|37349 |100 |1 |36908 |200 |1 |
|36879 |90 |1 |36465 |200 |1 |
|36380 |90 |1 |36465 |200 |1 |
Desired Result
T_id
children
T_Val
T_Goal
parent_T_id
parent_Goal
trans_id
36091
["36976"]
500
19
82
4
1000540
36465
["36448","36326","36683","36879","36380"]
200
1
36976
2
1000540
36908
["36121","36949","37026","37349"]
200
1
36976
2
1000540
36976
["36465","36908"]
300
2
36091
19
1000540
36683
null
90
1
36465
1
1000540
37026
null
90
1
36908
1
1000540
36448
null
100
1
36465
1
1000540
36949
null
90
1
36908
1
1000540
36326
null
90
1
36465
1
1000540
36380
null
90
1
36465
1
1000540
36879
null
90
1
36465
1
1000540
36121
null
90
1
36908
1
1000540
37349
null
100
1
36908
1
1000540
Code Tried
from pyspark.sql import functions as F
from pyspark.sql import DataFrame
from pyspark.sql.functions import explode, collect_set, expr, col, collect_list,array_contains, lit
from functools import reduce
for row in df_transactions.rdd.toLocalIterator():
# def find_nodemap(row):
dfs = []
df_hy_set = (df_hrrchy.filter(df_hrrchy. lefId.isin(row["T_ds"]))
.select(explode("Lineage").alias("Terrs"))
.agg(collect_set(col("Terrs")).alias("hierarchy_list"))
.select(F.lit(row["trans_id"]).alias("trans_id "),"hierarchy_list")
)
df_childrens = (df_creds.join(df_ hy _set, expr("array_contains(hierarchy_list, T_id)"))
.select("T_id", "T_Val","T_Goal","parent_T_id", "parent_Goal", "trans _id" )
.groupBy("parent_T_id").agg(collect_list("T_id").alias("children"))
)
df_filter_creds = (df_creds.join(df_ hy _set, expr("array_contains(hierarchy_list, T_id)"))
.select ("T_id", "T_val","T_Goal","parent_T_id", "parent_Goal”, "trans_id")
)
df_nodemap = (df_filter_ creds.alias("A").join(df_childrens.alias("B"), col("A.T_id") == col("B.parent_T_id"), "left")
.select("A.T_id","B.children", "A.T_val","A.terr_Goal","A.parent_T_id", "A.parent_Goal", "A.trans_ id")
)
display(df_nodemap)
# dfs.append(df_nodemap)
# df = reduce(DataFrame.union, dfs)
# display(df)
# # display(df)
My problem - Its a bad design. df_trans is having millions of data and looping through dataframe , its taking forever. Without looping can I do it. I tried couple of other methods, not able to get the desired result.
You certainly need to process entire DataFrame in batch, not iterate row by row.
Key points are to "reverse" df_hrrchy, ie. from parent lineage obtain list of children for every T_Id:
val df_children = df_hrrchy.withColumn("children", slice($"Lineage", lit(1), size($"Lineage") - 1))
.withColumn("parents", slice($"Lineage", 2, 999999))
.select(explode(arrays_zip($"children", $"parents")).as("rels"))
.distinct
.groupBy($"rels.parents".as("T_Id"))
.agg(collect_set($"rels.children").as("children"))
df_children.show(false)
+-----+-----------------------------------+
|T_Id |children |
+-----+-----------------------------------+
|36091|[36976] |
|36465|[36448, 36380, 36326, 36879, 36683]|
|36976|[36465, 36908] |
|82 |[36091] |
|36908|[36949, 37349, 36121, 37026] |
+-----+-----------------------------------+
then expand list of T_Ids in df_trans and also include all T_Ids from the hierarchy:
val df_trans_map = df_trans.withColumn("T_Id", explode($"T_Id"))
.join(df_hrrchy, array_contains($"Lineage", $"T_Id"))
.select($"tranID", explode($"Lineage").as("T_Id"))
.distinct
df_trans_map.show(false)
+-------+-----+
|tranID |T_Id |
+-------+-----+
|1000540|36976|
|1000540|82 |
|1000540|36091|
|1000540|36465|
|1000540|36326|
|1000540|36121|
|1000540|36908|
|1000540|36380|
|1000540|36448|
|1000540|36683|
|1000540|36949|
|1000540|37349|
|1000540|37026|
|1000540|36879|
+-------+-----+
With this it is just a simple join to obtain final result:
df_trans_map.join(df_creds, Seq("T_Id"))
.join(df_children, Seq("T_Id"), "left_outer")
.show(false)
+-----+-------+-----+------+-----------+----------+-----------+-----------------------------------+
|T_Id |tranID |T_val|T_Goal|Parent_T_Id|Parent_Val|parent_Goal|children |
+-----+-------+-----+------+-----------+----------+-----------+-----------------------------------+
|36976|1000540|300 |2 |36091 |500 |19 |[36465, 36908] |
|36091|1000540|500 |19 |82 |600 |4 |[36976] |
|36465|1000540|200 |1 |36976 |300 |2 |[36448, 36380, 36326, 36879, 36683]|
|36326|1000540|90 |1 |36465 |200 |1 |null |
|36121|1000540|90 |1 |36908 |200 |1 |null |
|36908|1000540|200 |1 |36976 |300 |2 |[36949, 37349, 36121, 37026] |
|36380|1000540|90 |1 |36465 |200 |1 |null |
|36448|1000540|100 |1 |36465 |200 |1 |null |
|36683|1000540|90 |1 |36465 |200 |1 |null |
|36949|1000540|90 |1 |36908 |200 |1 |null |
|37349|1000540|100 |1 |36908 |200 |1 |null |
|37026|1000540|90 |1 |36908 |200 |1 |null |
|36879|1000540|90 |1 |36465 |200 |1 |null |
+-----+-------+-----+------+-----------+----------+-----------+-----------------------------------+
You need to re-write this to use the full cluster, using a localIterator means that you aren't fully utilizing the cluster for shared work.
Below code was not run as you didn't provide a workable data set to test. If you do I'll run the code to make sure it's sound.
from pyspark.sql import functions as F
from pyspark.sql import DataFrame
from pyspark.sql.functions import explode, collect_set, expr, col, collect_list,array_contains, lit
from functools import reduce
#uses explode I know this will create a lot of short lived records but the flip side is it will use the entire cluster to complete the work instead of the driver.
df_trans_expld = df_trans.select( df_trans.tranID, explode(df_trans.T_Id).alias("T_Id") )
#uses explode
df_hrrchy_expld = df_hrrchy.select( df_hrrchy.leftId, explode( df_hrrchy.Lineage ).alias("Lineage") )
#uses exploded data to join which is the same as a filter.
df_hy_set = df_trans_expld.join( df_hrrchy_expld, df_hrrchy_expld.lefId === df_trans_expld.T_id, "left").select( "trans_id" ).agg(collect_set(col("Lineage")).alias("hierarchy_list"))
.select(F.lit(col("trans_id")).alias("trans_id "),"hierarchy_list")
#logic unchanged from here down
df_childrens = (df_creds.join(df_hy_set, expr("array_contains(hierarchy_list, T_id)"))
.select("T_id", "T_Val","T_Goal","parent_T_id", "parent_Goal", "trans _id" )
.groupBy("parent_T_id").agg(collect_list("T_id").alias("children"))
)
df_filter_creds = (df_creds.join(ddf_hy_set, expr("array_contains(hierarchy_list, T_id)"))
.select ("T_id", "T_val","T_Goal","parent_T_id", "parent_Goal”, "trans_id")
)
df_nodemap = (df_filter_creds.alias("A").join(df_childrens.alias("B"), col("A.T_id") == col("B.parent_T_id"), "left")
.select("A.T_id","B.children", "A.T_val","A.terr_Goal","A.parent_T_id", "A.parent_Goal", "A.trans_ id")
)
# no need to append/union data as it's now just one dataframe df_nodemap
I'd have to look into this more but I'm pretty sure you are pulling all the data through the driver(with your existing code), which will really slow things down, this will make use of all executors to complete the work.
There may be another optimization to get rid of the array_contains (and use a join instead). I'd have to look at the explain to see if you could get even more performance out of it. Don't remember off the top of my head, you are avoiding a shuffle so it may be better as is.
OrientDB's console.sh has an INDEXES command, which gives a list of all existing indexes, like so:
+----+-------------------+-----------------+-------+------------+-------+-----------------+
|# |NAME |TYPE |RECORDS|CLASS |COLLATE|FIELDS |
+----+-------------------+-----------------+-------+------------+-------+-----------------+
|0 |dictionary |DICTIONARY |0 | |default| |
|1 |OFunction.name |UNIQUE_HASH_INDEX|11 |OFunction |default|name(STRING) |
|2 |ORole.name |UNIQUE |3 |ORole |ci |name(STRING) |
|3 |OUser.name |UNIQUE |1 |OUser |ci |name(STRING) |
|4 |UserRole.Desc |UNIQUE |3 |UserRole |default|Desc(STRING) |
+----+-------------------+-----------------+-------+------------+-------+-----------------+
| |TOTAL | |18 | | | |
+----+-------------------+-----------------+-------+------------+-------+-----------------+
Is there a way to get this information via the API (or a SQL query)?
I contacted OrientDB directly, and #lvca told me about the "metadata:indexmanager" class that contains the index information I was looking for:
select expand(indexes) from metadata:indexmanager
Here's an up to date link to the documentation:
https://orientdb.com/docs/last/SQL.html#query-the-available-indexes
With this query you can get all metadata:
SELECT expand(classes) from metadata:schema
I obtain a resultant dataframe after performing some computations over it.Say the dataframe is result. When i write it to Amazon S3 there are specific cells which are shown blank. The top 5 of my result dataframe is:
_________________________________________________________
|var30 |var31 |var32 |var33 |var34 |var35 |var36|
--------------------------------------------------------
|-0.00586|0.13821 |0 | |1 | | |
|3.87635 |2.86702 |2.51963 |8 |11 |2 |14 |
|3.78279 |2.54833 |2.45881 | |2 | | |
|-0.10092|0 |0 |1 |1 |3 |1 |
|8.08797 |6.14486 |5.25718 | |5 | | |
---------------------------------------------------------
But when i run result.show() command i am able to see the values.
_________________________________________________________
|var30 |var31 |var32 |var33 |var34 |var35 |var36|
--------------------------------------------------------
|-0.00586|0.13821 |0 |2 |1 |1 |6 |
|3.87635 |2.86702 |2.51963 |8 |11 |2 |14 |
|3.78279 |2.54833 |2.45881 |2 |2 |2 |12 |
|-0.10092|0 |0 |1 |1 |3 |1 |
|8.08797 |6.14486 |5.25718 |20 |5 |5 |34 |
---------------------------------------------------------
Also, the blank are shown in same cells every time i run it.
Use this to save data to your s3
DataFrame.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("s3n://Yourpath")
For anyone who might have come across this issue, I can tell what worked for me.
I was joining 1 data frame ( let's say inputDF) with another df ( delta DF) based on some logic and storing in an output data frame (outDF). I was getting same error where by I could see a record in outDF.show() but while writing this dataFrame into a hive table OR persisting the outDF ( using outDF.persist(StorageLevel.MEMORY_AND_DISC)) I wasn't able to see that particular record.
SOLUTION:- I persisted the inputDF ( inputDF.persist(StorageLevel.MEMORY_AND_DISC)) before joining it with deltaDF. After that outDF.show() output was consistent with the hive table where outDF was written.
P.S:- I am not sure how this solved the issue. Would be awesome if someone could explain this, but the above worked for me.
I'm using OrientDB's UI/Query tool to analyze some graph data, and I've spent a couple of days unsuccessfully trying to unwind two arrays.
The unwind clause works just fine for one array but I can't seem to get the output I'm looking for when trying to unwind two arrays.
Here's a simplified example of my data:
#class | amt | storeID | customerID
transaction $4 1 1
transaction $2 1 1
transaction $6 1 4
transaction $3 1 4
transaction $2 2 1
transaction $7 2 1
transaction $8 2 2
transaction $3 2 2
transaction $4 2 3
transaction $9 2 3
transaction $10 3 4
transaction $3 3 4
transaction $4 3 5
transaction $10 3 5
Each customer is a document with the following information:
#class | customerID | State
customer 1 NY
customer 2 NJ
customer 3 PA
customer 4 NY
customer 5 NY
Each store is a document with the following information:
#class | storeID | State | Zip
store 1 NY 1
store 2 NJ 3
store 3 NY 2
Assuming I did not have storeID (nor wanted to create it), I want to recover a flattened table with the following distinct values: name of the store, city, account numbers, and the sum of spent.
The query would hopefully generate something like the table below (for a given depth value).
State | Zip | customerID
NY 1 4
NY 1 5
NY 2 1
NY 2 4
NJ 3 1
NJ 3 2
NJ 3 3
I've tried various expand/flatten/unwind operations but I can't seem to get my query to work.
Here's the query I have that recovers the State and Zip as two arrays and flattens the customerID:
SELECT out().State as State,
out().Zip as Zip,
customerID
FROM ( SELECT EXPAND(IN())
FROM (TRAVERSE * FROM
( SELECT FROM transaction)
)
) ;
Which yields,
State | Zip | customerID
[NY, NY, NJ, NJ] [1,1,2,2] 1
[NY, NY, NJ, NJ] [1,1,2,2] 1
[NY, NY, PA, PA] [1,1,3,3] 4
[NY, NY, PA, PA] [1,1,3,3] 4
... .... ....
Which is not what I'm looking for. Can someone provide a little help on how I can flatten/unwind these two arrays all together?
I tried your case with this structure (based on your example):
I used this queries to retrieve State, Zip and customerID (not as array):
Query 1:
SELECT State, Zip, in('transaction').customerID AS customerID FROM Store
ORDER BY Zip UNWIND customerID
----+------+-----+----+----------
# |#CLASS|State|Zip |customerID
----+------+-----+----+----------
0 |null |NY |1 |1
1 |null |NY |1 |1
2 |null |NY |1 |4
3 |null |NY |1 |4
4 |null |NY |2 |4
5 |null |NY |2 |4
6 |null |NY |2 |5
7 |null |NY |2 |5
8 |null |NJ |3 |1
9 |null |NJ |3 |1
10 |null |NJ |3 |2
11 |null |NJ |3 |2
12 |null |NJ |3 |3
13 |null |NJ |3 |3
----+------+-----+----+----------
Query 2:
SELECT inV('transaction').State AS State, inV('transaction').Zip AS Zip,
outV('transaction').customerID AS customerID FROM transaction ORDER BY Zip
----+------+-----+----+----------
# |#CLASS|State|Zip |customerID
----+------+-----+----+----------
0 |null |NY |1 |1
1 |null |NY |1 |1
2 |null |NY |1 |4
3 |null |NY |1 |4
4 |null |NY |2 |4
5 |null |NY |2 |4
6 |null |NY |2 |5
7 |null |NY |2 |5
8 |null |NJ |3 |1
9 |null |NJ |3 |1
10 |null |NJ |3 |2
11 |null |NJ |3 |2
12 |null |NJ |3 |3
13 |null |NJ |3 |3
----+------+-----+----+----------
EDITED
In the following example, with the query you'll be able to retrieve the average and the total spent for every storeID (based on each customerID):
SELECT customerID, storeID, avg(amt) AS averagePerStore, sum(amt) AS totalPerStore
FROM transaction GROUP BY customerID,storeID ORDER BY customerID
----+------+----------+-------+---------------+-------------
# |#CLASS|customerID|storeID|averagePerStore|totalPerStore
----+------+----------+-------+---------------+-------------
0 |null |1 |1 |3.0 |6.0
1 |null |1 |2 |4.5 |9.0
2 |null |2 |2 |5.5 |11.0
3 |null |3 |2 |6.5 |13.0
4 |null |4 |1 |4.5 |9.0
5 |null |4 |3 |6.5 |13.0
6 |null |5 |3 |7.0 |14.0
----+------+----------+-------+---------------+-------------
Hope it helps
I'm attempting to create a report which will pull data from multiple inventory databases, and based on the item counts in the various systems provide a Cost Estimate from another data source. The report has two levels of grouping, and I need to sum the cost values within each group only (not a total for all groups).
As an illustration, my result set looks like this
MODEL |COMPONENT |Sys1 |Sys2 |Sys3 |Sys1ID |Match |Cost
=====================================================================
Car | | | | | | |620 <--- Sum of component costs
|Wheel |4 |8 |10 | |Sys1 |40
|wheel1
|wheel2
|wheel3
|Brakes |0 |9 |11 | |Sys2 |80
|Horn |0 |0 |50 | |Sys3 |500
---------------------------------------------------------------------
Truck | | | | | | |980 <--- Sum of component costs
|Wheel |0 |0 |10 | |Sys3 |400
|Brakes |0 |9 |11 | |Sys2 |80
|Horn |0 |0 |50 | |Sys3 |500
The Table shows the Unique ID for the any matching items in Sys1, and the Sys1 column contains an COUNT() aggregate for these items. The Sys2 and Sys3 return material counts based on a lookup of Model+Component in two other datasources.
The Match column, indicates which inventory to service from, in order of priority, based on whether assets exist (Sys1, then Sys2, then Sys3). Finally, the Cost column is populated based on a lookup to a fourth dataset, which is formatted as :
MODEL |COMPONENT |System |Cost
=====================================
Car |Wheel |Sys1 |40
Car |Wheel |Sys2 |50
Car |Wheel |Sys3 |300
Car |Brakes |Sys1 |60
Car |Brakes |Sys2 |80
Car |Brakes |Sys3 |900
...
Everything is set up and working, apart from the Sum of Costs within the group. Does anyone know how to resolve this?
For reference, my original attempt to sum these used this expression :
=SUM(ReportItem!Estimate_Cost.Value)
which yields the following error message :
The Value expression for the textrun 'Textbox39.Paragraphs[0].TextRuns[0]'
uses an aggregate function on a report item. Aggregate functions can be used
only on report items contained in page headers and footers.