fields comparison in org-tables - org-mode

I want to check wheather a column and a row is the same like this:
| | A | B | C |
| A | X | 0 | 0 |
| B | 0 | X | 0 |
| C | 0 | 0 | X |
If I use the following formula:
#TBLFM: #<<$<<..#>$> = if ($1==#1,X,0)
then I get the following:
| | A | B | C |
| A | X | A = B ? X : 0 | A = C ? X : 0 |
| B | B = A ? X : 0 | X | B = C ? X : 0 |
| C | C = A ? X : 0 | C = B ? X : 0 | X |
Any ideas whats going wrong?

Your formula is comparing symbols, so A==A is always true. But the result for A==B is the whole symbolic expression.
Adding quotes to your row/column headers treats them as strings:
| | A | B | C |
| A | X | 0 | 0 |
| B | 0 | X | 0 |
| C | 0 | 0 | X |
#+TBLFM: #<<$<<..#>$> = if ("$1"=="#1",X,0)


Conditionally lag value over multiple rows

I am trying to find cases where one type of error causes multiple sequential instances of a second type of error on a vehicle. For example, if there are two vehicles, 'a' and 'b', and vehicle a has an error of type 1 ('error_1') on day 0, it can cause errors of type 2 ('error_2') on days 1, 2, 3, and 4. I want to create a variable named cascading_error that shows every consecutive error_2 following an error_1. Note that in the case of vehicle b, it is possible to have an error_2 without a preceding error_1, in which case the value for cascading_error should be 0.
Here's what I've tried:
vals = [('a',0,1,0),('a',1,0,1),('a',2,0,1),('a',3,0,1),('b',0,0,0),('b',1,0,0),('b',2,0,1), ('b',3,0,1)]
df = spark.createDataFrame(vals, ['vehicle','day','error_1','error_2'])
w = Window.partitionBy('vehicle').orderBy('day')
df = df.withColumn('cascading_error', F.lag(df.error_1).over(w) * df.error_2)
df = df.withColumn('cascading_error', F.when((F.lag(df.cascading_error).over(w)==1) & (df.error_2==1), F.lit(1)).otherwise(df.cascading_error))
This is my result
| vehicle | day | error_1 | error_2 | cascading_error |
| ------- | --- | ------- | ------- | --------------- |
| a | 0 | 1 | 0 | null |
| a | 1 | 0 | 1 | 1 |
| a | 2 | 0 | 1 | 1 |
| a | 3 | 0 | 1 | 0 |
| a | 4 | 0 | 1 | 0 |
| b | 0 | 0 | 0 | null |
| b | 1 | 0 | 0 | 0 |
| b | 2 | 0 | 1 | 0 |
| b | 3 | 0 | 1 | 0 |
The code is generating the correct cascading_error value on days 1 and 2 for vehicle a, but not on days 3 and 4, which should also be 1. It seems that the logic of combining cascading_error with error_2 to update cascading_error only works for a single row, not sequential ones.

SQL query help: Calculate max of previous rows in the same query

I want to find for each row(where B = C = D = 1), the max of A among its previous rows(where B = C = D = 1) excluding its row after its ordered in chronological order.
Data in table looks like this:
|Grp id | B | C | D | A | time |
+-------+---- +-----+-----+------+------+
| 111 | 1 | 0 | 0 | 52 | t |
| 111 | 1 | 1 | 1 | 33 | t+1 |
| 111 | 0 | 1 | 0 | 34 | t+2 |
| 111 | 1 | 1 | 1 | 22 | t+3 |
| 111 | 0 | 0 | 0 | 12 | t+4 |
| 222 | 1 | 1 | 1 | 16 | t |
| 222 | 1 | 0 | 0 | 18 | t2+1 |
| 222 | 1 | 1 | 0 | 13 | t2+2 |
| 222 | 1 | 1 | 1 | 12 | t2+3 |
| 222 | 1 | 1 | 1 | 09 | t2+4 |
| 222 | 1 | 1 | 1 | 22 | t2+5 |
| 222 | 1 | 1 | 1 | 19 | t2+6 |
Above table is resultant of below query. Its obtained after left joins as below. Joins are necessary according to my project requirement.
SELECT Grp id, B, C, D, A, time, xxx
FROM "DCR" dcr
LEFT JOIN "DCM" dcm ON "Id" = dcm."DCRID"
LEFT JOIN "DC" dc ON dc."Id" = dcm."DCID"
ORDER BY dcr."time"
Result column needs to be evaluated based on formula I mentioned above. It needs to be calculated in same pass as we need to consider only its previous rows. Above xxx needs to be replaced by a subquery/statement to obtain the result.
And the result table should look like this:
|Grp id | B | C | D | A | time |Result|
+-------+---- +-----+-----+------+------+------+
| 111 | 1 | 0 | 0 | 52 | t | - |
| 111 | 1 | 1 | 1 | 33 | t+1 | - |
| 111 | 1 | 1 | 1 | 34 | t+2 | 33 |
| 111 | 1 | 1 | 1 | 22 | t+3 | 34 |
| 111 | 0 | 0 | 0 | 12 | t+4 | - |
| 222 | 1 | 1 | 1 | 16 | t | - |
| 222 | 1 | 0 | 0 | 18 | t2+1 | - |
| 222 | 1 | 1 | 0 | 13 | t2+2 | - |
| 222 | 1 | 1 | 1 | 12 | t2+3 | 16 |
| 222 | 1 | 1 | 1 | 09 | t2+4 | 16 |
| 222 | 1 | 1 | 1 | 22 | t2+5 | 16 |
| 222 | 1 | 1 | 1 | 19 | t2+6 | 22 |
The column could be computed with a window function:
CASE WHEN b = 1 AND c = 1 AND d = 1
THEN max(a) FILTER (WHERE b = 1 AND c = 1 AND d = 1)
I didn't test it.

how to get multiple rows from one row in spark scala [duplicate]

This question already has an answer here:
Flattening Rows in Spark
(1 answer)
Closed 5 years ago.
I have a dataframe in spark like below and I want to convert all the column in different rows with respect to first column id.
| id code1 code2 code3 code4 code5 |
| 1 A B C D E |
| 1 M N O P Q |
| 1 P Q R S T |
| 2 P A C D F |
| 2 S D F R G |
I want the output like below format
| id code |
| 1 A |
| 1 B |
| 1 C |
| 1 D |
| 1 E |
| 1 M |
| 1 N |
| 1 O |
| 1 P |
| 1 Q |
| 1 P |
| 1 Q |
| 1 R |
| 1 S |
| 1 T |
| 2 P |
| 2 A |
| 2 C |
| 2 D |
| 2 F |
| 2 S |
| 2 D |
| 2 F |
| 2 R |
| 2 G |
Can anyone please help me here how I will get the above output with spark and scala.
using array, explode and drop functions should have you the desired output as
df.withColumn("code", explode(array("code1", "code2", "code3", "code4", "code5")))
.drop("code1", "code2", "code3", "code4", "code5")
as defined by undefined_variable, you can just use select$"id", explode(array("code1", "code2", "code3", "code4", "code5")).as("code"))"id"),explode(concat_ws(",",Seq(col(code1),col("code2"),col("code3"),col("code4"),col("code5")))))
Basically idea is first concat all required columns and then explode it

pyspark piplineRDD fit to Dataframe column

Before everything i'm new guy in python and spark world.
I have homework from university but i stuck in one place.
I make clusterization from my data and now i have my clusters in PipelinedRDD
aftre this:
cluster = r: kmeansModelMllib.predict(r))
cluster = [2,1,2,0,0,0,1,2]
now now i have cluster and my dataframe dataDf i need fit my cluster like a new column to dataDf
i Have: i Need:
+---+---+---+ +---+---+---+-------+
| x | y | z | | x | y | z |cluster|
+---+---+---+ +---+---+---+-------+
| 0 | 1 | 1 | | 0 | 1 | 1 | 2 |
| 0 | 0 | 1 | | 0 | 0 | 1 | 1 |
| 0 | 8 | 0 | | 0 | 8 | 0 | 2 |
| 0 | 8 | 0 | | 0 | 8 | 0 | 0 |
| 0 | 1 | 0 | | 0 | 1 | 0 | 0 |
+---+---+---+ +---+---+---+-------+
You can add index using zipWithIndex, join, and convert back to df.
swp = lambda x: (x[1], x[0])
cluster.zipWithIndex().map(swp).join(dataDf.rdd.zipWithIndex().map(swp)) \
.values().toDF(["cluster", "point"])
In some cases it should be possible to use zip:["cluster", "point"])
You can follow with .select("cluster", "point.*") to flatten the output.

Boolean function for True only if last True

Is there any way how to get true only if second value is true?
| A | B | Result |
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
Looks like ~A & B would suffice.