Spark Scala input empty values according result from self joined dataframe query - scala

I struggle to write my spark scala code to fill rows for which the coverage is empty using self join with conditions.
This is the data :
+----+--------------+----------+--------+
| ID | date_in_days | coverage | values |
+----+--------------+----------+--------+
| 1 | 2020-09-01 | | 0.128 |
| 1 | 2020-09-03 | 0 | 0.358 |
| 1 | 2020-09-04 | 0 | 0.035 |
| 1 | 2020-09-05 | | |
| 1 | 2020-09-06 | | |
| 1 | 2020-09-19 | | |
| 1 | 2020-09-12 | | |
| 1 | 2020-09-18 | | |
| 1 | 2020-09-11 | | |
| 1 | 2020-09-16 | | |
| 1 | 2020-09-21 | 13 | 0.554 |
| 1 | 2020-09-23 | | |
| 1 | 2020-09-30 | | |
+----+--------------+----------+--------+
Expected result :
+----+--------------+----------+--------+
| ID | date_in_day | coverage | values |
+----+--------------+----------+--------+
| 1 | 2020-09-01 | -1 | 0.128 |
| 1 | 2020-09-03 | 0 | 0.358 |
| 1 | 2020-09-04 | 0 | 0.035 |
| 1 | 2020-09-05 | 0 | |
| 1 | 2020-09-06 | 0 | |
| 1 | 2020-09-19 | 0 | |
| 1 | 2020-09-12 | 0 | |
| 1 | 2020-09-18 | 0 | |
| 1 | 2020-09-11 | 0 | |
| 1 | 2020-09-16 | 0 | |
| 1 | 2020-09-21 | 13 | 0.554 |
| 1 | 2020-09-23 | -1 | |
| 1 | 2020-09-30 | -1 | |
What I am trying to do:
For each different ID (Dataframe partitioned by ID) sorted by date
Use case : row coverage column is null let's call it rowEmptycoverage:
Find in the DF the first row with date_in_days > rowEmptycoverage.date_in_days and with coverage >= 0. Let's call it rowFirstDateGreater
Then if rowFirstDateGreater.values > 500 set rowEmptycoverage.coverage to 0. Set it to -1 otherwise.
I am kind of lost in mixing when join where...

I am assuming that you mean values > 0.500 and not values > 500. Also the logic remains unclear. Here I am assuming that you are searching in the order of the column date_in_days and not in the order of the dataframe.
In any case we can refine the solution to match your exact need. The overall idea is to use a Window to fetch the next date for which the coverage is not null, check if values meet the desired criteria and update coverage.
It goes as follows:
val win = Window.partitionBy("ID").orderBy("date_in_days")
.rangeBetween(Window.currentRow, Window.unboundedFollowing)
df
// creating a struct binding coverage and values
.withColumn("cov_str", when('coverage isNull, lit(null))
.otherwise(struct('coverage, 'values)))
// finding the first row (starting from the current date, in order of
// date_in_days) for which the coverage is not null
.withColumn("next_cov_str", first('cov_str, ignoreNulls=true) over win)
// updating coverage. We keep the original value if not null, put 0 if values
// meets the criteria (that you can change) and -1 otherwise.
.withColumn("coverage", coalesce(
'coverage,
when($"next_cov_str.values" > 0.500, lit(0)),
lit(-1)
))
.show(false)
+---+-------------------+--------+------+-----------+------------+
|ID |date_in_days |coverage|values|cov_str |next_cov_str|
+---+-------------------+--------+------+-----------+------------+
|1 |2020-09-01 00:00:00|-1 |0.128 |null |[0, 0.358] |
|1 |2020-09-03 00:00:00|0 |0.358 |[0, 0.358] |[0, 0.358] |
|1 |2020-09-04 00:00:00|0 |0.035 |[0, 0.035] |[0, 0.035] |
|1 |2020-09-05 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-06 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-11 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-12 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-16 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-18 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-19 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-21 00:00:00|13 |0.554 |[13, 0.554]|[13, 0.554] |
|1 |2020-09-23 00:00:00|-1 |null |null |null |
|1 |2020-09-30 00:00:00|-1 |null |null |null |
+---+-------------------+--------+------+-----------+------------+
You may then use drop("cov_str", "next_cov_str") but I leave them here for clarity.

Related

DB2/AS400 SQL Pivot

I have a problem with pivot tables ....
I don't understand what to do ...
My table is as follows:
|CODART|MONTH|QT |
|------|-----|----|
|ART1 |1 |100 |
|ART2 |1 |30 |
|ART3 |1 |30 |
|ART1 |2 |10 |
|ART4 |2 |40 |
|ART3 |4 |50 |
|ART5 |4 |60 |
I would like to get a summary table by month:
|CODART|1 |2 |3 |4 |5 |6 |7 |8 |9 |10 |11 |12 |
|------|---|---|---|---|---|---|---|---|---|---|---|---|
|ART1 |100|10 | | | | | | | | | | |
|ART2 |30 | | | | | | | | | | | |
|ART3 |30 | | |50 | | | | | | | | |
|ART4 | |2 | | | | | | | | | | |
|ART5 | | | |60 | | | | | | | | |
|TOTAL |160|12 | |110| | | | | | | | |
Too many requests? :-)
Thanks for the support
WITH MYTAB (CODART, MONTH, QT) AS
(
VALUES
('ART1', 1, 100)
, ('ART2', 1, 30)
, ('ART3', 1, 30)
, ('ART1', 2, 10)
, ('ART4', 2, 40)
, ('ART3', 4, 50)
, ('ART5', 4, 60)
)
SELECT
CASE GROUPING (CODART) WHEN 0 THEN CODART ELSE 'TOTAL' END AS CODART
, SUM (CASE MONTH WHEN 1 THEN QT END) AS "1"
, SUM (CASE MONTH WHEN 2 THEN QT END) AS "2"
, SUM (CASE MONTH WHEN 3 THEN QT END) AS "3"
, SUM (CASE MONTH WHEN 4 THEN QT END) AS "4"
---
, SUM (CASE MONTH WHEN 12 THEN QT END) AS "12"
FROM MYTAB T
GROUP BY ROLLUP (T.CODART)
ORDER BY GROUPING (T.CODART), T.CODART
CODART
1
2
3
4
12
ART1
100
10
ART2
30
ART3
30
50
ART4
40
ART5
60
TOTAL
160
50
110

How to add some values in a dataframe in Scala Spark?

Here is the dataframe I have for now, suppose there are totally 4 days{1,2,3,4}:
+-------------+----------+------+
| key | Time | Value|
+-------------+----------+------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 4 | 3 |
| 2 | 2 | 4 |
| 2 | 3 | 5 |
+-------------+----------+------+
And what I want is
+-------------+----------+------+
| key | Time | Value|
+-------------+----------+------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | null |
| 1 | 4 | 3 |
| 2 | 1 | null |
| 2 | 2 | 4 |
| 2 | 3 | 5 |
| 2 | 4 | null |
+-------------+----------+------+
If there is some ways that can help me get this?
Say df1 is our main table:
+---+----+-----+
|key|Time|Value|
+---+----+-----+
|1 |1 |1 |
|1 |2 |2 |
|1 |4 |3 |
|2 |2 |4 |
|2 |3 |5 |
+---+----+-----+
We can use the following transformations:
val data = df1
// we first group by and aggregate the values to a sequence between 1 and 4 (your number)
.groupBy("key")
.agg(sequence(lit(1), lit(4)).as("Time"))
// we explode the sequence, thus creating all 'Time' per 'key'
.withColumn("Time", explode(col("Time")))
// finally, we join with our main table on 'key' and 'Time'
.join(df1, Seq("key", "Time"), "left")
To get this output:
+---+----+-----+
|key|Time|Value|
+---+----+-----+
|1 |1 |1 |
|1 |2 |2 |
|1 |3 |null |
|1 |4 |3 |
|2 |1 |null |
|2 |2 |4 |
|2 |3 |5 |
|2 |4 |null |
+---+----+-----+
Which should be what you are looking for, good luck!

Converting dataframe column into onehotencoder like columns

I am trying to find the solution to convert specific column into onehotencoder type columns. For example
-------------
Content|type|
-------------
alpha | A |
beta | B |
gamma | C |
theta | A |
zeta | C |
neta | B |
-------------
And, what I am trying to do is following.
----------------------------
Content|type_A|type_B|type_C|
----------------------------
alpha | 1 | 0 | 0 |
beta | 0 | 1 | 0 |
gamma | 0 | 0 | 1 |
theta | 1 | 0 | 0 |
zeta | 0 | 0 | 1 |
neta | 0 | 1 | 0 |
-----------------------------
I think pivot is what you are looking for
val df = Seq(
("alpha", "A"),
("beta", "B"),
("gamma", "C"),
("theta", "A"),
("zeta", "C"),
("neta", "B")
).toDF("Content", "type")
val result = df.groupBy("Content")
.pivot("type")
.agg(count("type"))
.na.fill(0)
Output:
+-------+---+---+---+
|Content|A |B |C |
+-------+---+---+---+
|neta |0 |1 |0 |
|beta |0 |1 |0 |
|gamma |0 |0 |1 |
|theta |1 |0 |0 |
|zeta |0 |0 |1 |
|alpha |1 |0 |0 |
+-------+---+---+---+

Postgres select from table and spread evenly

I have a 2 tables. First table contains information of the object, second table contains related objects. Second table objects have 4 types( lets call em A,B,C,D).
I need a query that does something like this
|table1 object id | A |value for A|B | value for B| C | value for C|D | vlaue for D|
| 1 | 12| cat | 13| dog | 2 | house | 43| car |
| 1 | 5 | lion | | | | | | |
The column "table1 object id" in real table is multiple columns of data from table 1(for single object its all the same, just repeated on multiple rows because of table 2).
Where 2nd table is in form
|type|value|table 1 object id| id |
|A |cat | 1 | 12|
|B |dog | 1 | 13|
|C |house| 1 | 2 |
|D |car | 1 | 43 |
|A |lion | 1 | 5 |
I hope this is clear enough of the thing i want.
I have tryed using AND and OR and JOIN. This does not seem like something that can be done with crosstab.
EDIT
Table 2
|type|value|table 1 object id| id |
|A |cat | 1 | 12|
|B |dog | 1 | 13|
|C |house| 1 | 2 |
|D |car | 1 | 43 |
|A |lion | 1 | 5 |
|C |wolf | 2 | 6 |
Table 1
| id | value1 | value 2|value 3|
| 1 | hello | test | hmmm |
| 2 | bye | test2 | hmm2 |
Result
|value1| value2| value3| A| value| B |value| C|value | D | value|
|hello | test | hmmm |12| cat | 13| dog |2 | house | 23| car |
|hello | test | hmmm |5 | lion | | | | | | |
|bye | test2 | hmm2 | | | | |6 | wolf | | |
I hope this explains bit bettter of what I want to achieve.

NDepend query methods/types in framework assembly being used by other assemblies/types

I am trying to determine which types or methods in a base framework assembly are being used by other assemblies in the application system. I cannot seem to find a straight-cut query to do that.
What i have to do is first determine which assemblies are directly using the framework assembly, then manually list them in a second query
SELECT TYPES FROM ASSEMBLIES "IBM.Data.DB2"
WHERE IsDirectlyUsedBy "ASSEMBLY:FirstDirectUsedByAssebmly"
OR IsDirectlyUsedBy "ASSEMBLY:SecondDirectUsedByAssebmly"
OR IsDirectlyUsedBy "ASSEMBLY:ThirdDirectUsedByAssebmly"
OR IsDirectlyUsedBy "ASSEMBLY:FourthDirectUsedByAssebmly"
Is there a better/faster way to query for this?
Additionally, the query results are focused on the matched types only. The Dependency graph or matrix exported only shows details of those. I do not know how to render a graph that shows those types or methods plus show the dependent types/methods from other assemblies that are consuming them?
UPDATE
I cannot use a query like
SELECT METHODS/TYPES WHERE IsPublic AND !CouldBeInternal
because the results return very queer results of using obfuscated types within the IBM.Data.DB2 assembly.
SELECT TYPES
FROM ASSEMBLIES "IBM.Data.DB2"
WHERE IsPublic AND !CouldBeInternal
48 items
--------------------------------------------------+--------------+
types |# IL instructi|
|ons |
--------------------------------------------------+--------------+
IBM.Data.DB2.ae+m |0 |
| |
IBM.Data.DB2.ae+x |0 |
| |
IBM.Data.DB2.ae+f |0 |
| |
IBM.Data.DB2.ae+ac |0 |
| |
IBM.Data.DB2.ae+aa |0 |
| |
IBM.Data.DB2.ae+u |0 |
| |
IBM.Data.DB2.ae+z |0 |
| |
IBM.Data.DB2.ae+e |0 |
| |
IBM.Data.DB2.ae+b |0 |
| |
IBM.Data.DB2.ae+g |0 |
| |
IBM.Data.DB2.ae+ab |0 |
| |
IBM.Data.DB2.ae+h |0 |
| |
IBM.Data.DB2.ae+r |0 |
| |
IBM.Data.DB2.ae+p |0 |
| |
IBM.Data.DB2.ae+ad |0 |
| |
IBM.Data.DB2.ae+i |0 |
| |
IBM.Data.DB2.ae+j |0 |
| |
IBM.Data.DB2.ae+t |0 |
| |
IBM.Data.DB2.ae+af |0 |
| |
IBM.Data.DB2.ae+k |0 |
| |
IBM.Data.DB2.ae+l |0 |
| |
IBM.Data.DB2.ae+y |0 |
| |
IBM.Data.DB2.ae+a |0 |
| |
IBM.Data.DB2.ae+q |0 |
| |
IBM.Data.DB2.ae+n |0 |
| |
IBM.Data.DB2.ae+d |0 |
| |
IBM.Data.DB2.ae+c |0 |
| |
IBM.Data.DB2.ae+ae |0 |
| |
IBM.Data.DB2.ae+o |0 |
| |
IBM.Data.DB2.ae+w |0 |
| |
IBM.Data.DB2.ae+s |0 |
| |
IBM.Data.DB2.ae+v |0 |
| |
IBM.Data.DB2.DB2Command |2 527 |
| |
IBM.Data.DB2.DB2Connection |3 246 |
| |
IBM.Data.DB2.DB2DataAdapter |520 |
| |
IBM.Data.DB2.DB2DataReader |4 220 |
| |
IBM.Data.DB2.DB2_UDF_PLATFORM |0 |
| |
IBM.Data.DB2.DB2Enumerator+DB2EnumInstance |19 |
| |
IBM.Data.DB2.DB2Enumerator+DB2EnumDatabase |15 |
| |
IBM.Data.DB2.DB2Error |98 |
| |
IBM.Data.DB2.DB2ErrorCollection |55 |
| |
IBM.Data.DB2.DB2Exception |185 |
| |
IBM.Data.DB2.DB2Parameter |1 853 |
| |
IBM.Data.DB2.DB2ParameterCollection |1 383 |
| |
IBM.Data.DB2.DB2RowUpdatedEventHandler |0 |
| |
IBM.Data.DB2.DB2RowUpdatedEventArgs |14 |
| |
IBM.Data.DB2.DB2Type |0 |
| |
IBM.Data.DB2.DB2XmlReader |500 |
| |
--------------------------------------------------+--------------+
Sum: |14 635 |
| |
Average: |304.9 |
| |
Minimum: |0 |
| |
Maximum: |4 220 |
| |
Standard deviation: |868.22 |
| |
Variance: |753 808 |
| |
--------------------------------------------------+--------------+
Our code does not use those types and enums directly.
This query returns the methods (respectively the types), that are public and could not be internal. Hence, it returns the methods/types that are indeed used outside of their declaring assembly.
SELECT METHODS/TYPES WHERE IsPublic AND !CouldBeInternal