spark detect and extract a pattern in column values - scala

I have a df like this
import spark.implicits._
import org.apache.spark.sql.functions._
val latenies = Seq(
("start","304875","2021-10-25 21:26:23.486027"),
("start","304875","2021-10-25 21:26:23.486670"),
("end","304875","2021-10-25 21:26:23.487590"),
("start","304875","2021-10-25 21:26:23.509683"),
("end","304875","2021-10-25 21:26:23.509689"),
("end","304875","2021-10-25 21:26:23.510154"),
("start","201345","2021-10-25 21:26:23.510156"),
("end","201345","2021-10-25 21:26:23.510159"),
("start","201345","2021-10-25 21:26:23.510333"),
("start","201345","2021-10-25 21:26:23.510335"),
("end","201345","2021-10-25 21:26:23.513177"),
("start","201345","2021-10-25 21:26:23.513187")
)
val latenies_df = latenies.toDF("Msg_name","Id_num","TimeStamp")
.withColumn("TimeStamp", to_timestamp(col("TimeStamp")))
latenies_df.show(false)
it looks like this:
+--------+------+--------------------------+
|Msg_name|Id_num|TimeStamp |
+--------+------+--------------------------+
|start |304875|2021-10-25 21:26:23.486027|
|start |304875|2021-10-25 21:26:23.48667 |
|end |304875|2021-10-25 21:26:23.48759 |
|start |304875|2021-10-25 21:26:23.509683|
|end |304875|2021-10-25 21:26:23.509689|
|end |304875|2021-10-25 21:26:23.510154|
|start |201345|2021-10-25 21:26:23.510156|
|end |201345|2021-10-25 21:26:23.510159|
|start |201345|2021-10-25 21:26:23.510333|
|start |201345|2021-10-25 21:26:23.510335|
|end |201345|2021-10-25 21:26:23.513177|
|start |201345|2021-10-25 21:26:23.513187|
+--------+------+--------------------------+
Question: I'd want to extract a certain pattern in column Msg_name which is always when start has subsequent value of end when partitioned by Id and ordered by time.
Msg can have multiple starts one after another or ends. I only want start-end nothing between.
With this pattern I'd like to do a df as such:
|patter_name|Timestamp_start |Timestamp_end |Id_num |
| pattern1|2021-10-25 21:26:23.486670|2021-10-25 21:26:23.487590|304875 |
| pattern1|2021-10-25 21:26:23.509683|2021-10-25 21:26:23.509689|304875 |
| pattern1|2021-10-25 21:26:23.510156|2021-10-25 21:26:23.510159|201345 |
| pattern1|2021-10-25 21:26:23.510335|2021-10-25 21:26:23.513177|201345 |
What I have done is shifting the frame, which will not give me correct answer due to nature of the Msg_name column.
val window = org.apache.spark.sql.expressions.Window.partitionBy("Id_num").orderBy("TimeStamp")
val df_only_pattern = latenies_df.withColumn("TimeStamp_start", when($"Msg_name" !== lag($"Msg_name", 1).over(window), lag("TimeStamp", 1).over(window)).otherwise(lit(null)))
.withColumn("latency_time", when($"TimeStamp_start".isNotNull, round((col("TimeStamp").cast("double")-col("TimeStamp_start").cast("double")) * 1e3, 2)).otherwise(lit(null)))
.withColumnRenamed("TimeStamp", "TimeStamp_end")
.withColumn("patter_name", lit("pattern1"))
.na.drop()
df_only_pattern.orderBy("TimeStamp_start").show(false)
What this gives:
+--------+------+--------------------------+--------------------------+------------+-----------+
|Msg_name|Id_num|TimeStamp_end |TimeStamp_start |latency_time|patter_name|
+--------+------+--------------------------+--------------------------+------------+-----------+
|end |304875|2021-10-25 21:26:23.48759 |2021-10-25 21:26:23.48667 |0.92 |pattern1 |
|start |304875|2021-10-25 21:26:23.509683|2021-10-25 21:26:23.48759 |22.09 |pattern1 |
|end |304875|2021-10-25 21:26:23.509689|2021-10-25 21:26:23.509683|0.01 |pattern1 |
|end |201345|2021-10-25 21:26:23.510159|2021-10-25 21:26:23.510156|0.0 |pattern1 |
|start |201345|2021-10-25 21:26:23.510333|2021-10-25 21:26:23.510159|0.17 |pattern1 |
|end |201345|2021-10-25 21:26:23.513177|2021-10-25 21:26:23.510335|2.84 |pattern1 |
|start |201345|2021-10-25 21:26:23.513187|2021-10-25 21:26:23.513177|0.01 |pattern1 |
+--------+------+--------------------------+--------------------------+------------+-----------+
I can achieve the wanted df with python pandas with groupby and looping inside the group, which seems not possible in spark.

Messages "end" can be taken, which has "start" in previous row:
latenies_df
.withColumn("TimeStamp_start",
when(lag($"Msg_name", 1).over(window) === lit("start"), lag($"TimeStamp", 1).over(window))
.otherwise(lit(null).cast(TimestampType))
)
.where($"Msg_name" === lit("end"))
.where($"TimeStamp_start".isNotNull)
.select(
lit("pattern1").alias("patter_name"),
$"TimeStamp_start",
$"TimeStamp".alias("Timestamp_end"),
$"Id_num"
)
Result:
+-----------+--------------------------+--------------------------+------+
|patter_name|TimeStamp_start |Timestamp_end |Id_num|
+-----------+--------------------------+--------------------------+------+
|pattern1 |2021-10-25 21:26:23.48667 |2021-10-25 21:26:23.48759 |304875|
|pattern1 |2021-10-25 21:26:23.509683|2021-10-25 21:26:23.509689|304875|
|pattern1 |2021-10-25 21:26:23.510156|2021-10-25 21:26:23.510159|201345|
|pattern1 |2021-10-25 21:26:23.510335|2021-10-25 21:26:23.513177|201345|
+-----------+--------------------------+--------------------------+------+

Related

Pyspark join returning no data in output

while performing simple join on 2 data frame, pyspark returns no output data
from pyspark.sql import *
import pyspark.sql.functions as F
from pyspark.sql.functions import col
spark = SparkSession.builder.master("local").appName("test").getOrCreate()
file_path="C:\\bigdata\\pipesep_data\\Sales_ny.csv"
df=spark.read.format("csv").option('header','True').option('inferSchema', 'True').option("delimiter", '|').load(file_path)
addData=[(1,"1523 Main St","SFO","CA"),
(2,"3453 Orange St","SFO","NY"),
(3,"34 Warner St","Jersey","NJ"),
(4,"221 Cavalier St","Newark","DE"),
(5,"789 Walnut St","Sandiago","CA")
]
addColumns = ["emp_id","addline1","city","State"]
addDF = spark.createDataFrame(addData,addColumns)
addDF.show()
df.join(addDF,df["State"] == addDF["State"]).show()
Sales_ny schema
Sales_ny.csv
Output: No data in output, only columns are joined
I also tried with left,right,fullouter etc..
For me it is working fine
>>> df=spark.read.format("csv").option('header','True').option('inferSchema', 'True').option("delimiter", '|').load("/Path to/sample1.csv")
>>> df.show()
+--------+--------------------+--------+------+-------------------+-----------------+-------------+-----+-----+----+
| OrderID| Product|Quantity| Price| OrderDate| StoreAddres| City|State|Month|Hour|
+--------+--------------------+--------+------+-------------------+-----------------+-------------+-----+-----+----+
|295665.0| Macbook Pro Laptop| 1.0|1700.0|2019-12-30 00:01:00|136 Church St, Ne|New York City| 123| 12.0| 0.0|
|295666.0| LG Washing Machine| 1.0| 600.0|2019-12-29 07:03:00| 562 2nd St, Ne|New York City| NY| 12.0| 7.0|
|295667.0|USB-C Charging Cable| 1.0| 11.95|2019-12-12 18:21:00| 277 Main St, New|New York City| NY| 12.0|18.0|
+--------+--------------------+--------+------+-------------------+-----------------+-------------+-----+-----+----+
>>> addDF.show()
+------+---------------+--------+-----+
|emp_id| addline1| city|State|
+------+---------------+--------+-----+
| 1| 1523 Main St| SFO| CA|
| 2| 3453 Orange St| SFO| NY|
| 3| 34 Warner St| Jersey| NJ|
| 4|221 Cavalier St| Newark| DE|
| 5| 789 Walnut St|Sandiago| CA|
+------+---------------+--------+-----+
>>> df.join(addDF,df["State"] == addDF["State"]).show()
+--------+--------------------+--------+-----+-------------------+----------------+-------------+-----+-----+----+------+--------------+----+-----+
| OrderID| Product|Quantity|Price| OrderDate| StoreAddres| City|State|Month|Hour|emp_id| addline1|city|State|
+--------+--------------------+--------+-----+-------------------+----------------+-------------+-----+-----+----+------+--------------+----+-----+
|295667.0|USB-C Charging Cable| 1.0|11.95|2019-12-12 18:21:00|277 Main St, New|New York City| NY| 12.0|18.0| 2|3453 Orange St| SFO| NY|
|295666.0| LG Washing Machine| 1.0|600.0|2019-12-29 07:03:00| 562 2nd St, Ne|New York City| NY| 12.0| 7.0| 2|3453 Orange St| SFO| NY|
+--------+--------------------+--------+-----+-------------------+----------------+-------------+-----+-----+----+------+--------------+----+-----+
I think your df.State have spaces.
you can use below code and remove space then perform join
>>> from pyspark.sql.functions import *
>>> df=df.withColumn('State',trim(df.State))

Fix query to resolve to_char and or string comparison issue in scala databricks 2.4.3

I've processed parquet file and created the following data frame in scala spark 2.4.3.
+-----------+------------+-----------+--------------+-----------+
| itemno|requestMonth|requestYear|totalRequested|requestDate|
+-----------+------------+-----------+--------------+-----------+
| 7512365| 2| 2014| 110.0| 2014-02-01|
| 7519278| 4| 2013| 96.0| 2013-04-01|
|5436134-070| 12| 2013| 8.0| 2013-12-01|
| 7547385| 1| 2014| 89.0| 2014-01-01|
| 0453978| 9| 2014| 18.0| 2014-09-01|
| 7558402| 10| 2014| 260.0| 2014-10-01|
|5437662-070| 7| 2013| 78.0| 2013-07-01|
| 3089858| 11| 2014| 5.0| 2014-11-01|
| 7181584| 2| 2017| 4.0| 2017-02-01|
| 7081417| 3| 2017| 15.0| 2017-03-01|
| 5814215| 4| 2017| 35.0| 2017-04-01|
| 7178940| 10| 2014| 5.0| 2014-10-01|
| 0450636| 1| 2015| 7.0| 2015-01-01|
| 5133406| 5| 2014| 46.0| 2014-05-01|
| 2204858| 12| 2015| 34.0| 2015-12-01|
| 1824299| 5| 2015| 1.0| 2015-05-01|
|5437474-620| 8| 2015| 4.0| 2015-08-01|
| 3086317| 9| 2014| 1.0| 2014-09-01|
| 2204331| 3| 2015| 2.0| 2015-03-01|
| 5334160| 1| 2018| 2.0| 2018-01-01|
+-----------+------------+-----------+--------------+-----------+
To derive a new feature, I am trying to apply logic and rearrange data frame as following
itemno – as it is in above-mentioned data frame
startDate - the start of the season
endDate - the end of the season
totalRequested - number of parts requested in that season
percetageOfRequests - totalRequested in current season / total over this plus 3 previous seasons (4 total seasons)
//seasons date for reference
Spring: 1 March to 31 May.
Summer: 1 June to 31 August.
Autumn: 1 September to 30 November.
Winter: 1 December to 28 February.
What I did:
I tried following two logics
case
when to_char(StartDate,'MMDD') between '0301' and '0531' then 'spring'
.....
.....
end as season
but it didn't work. I did to_char logic in oracle DB and it worked there but after looking around, I found spark SQL doesn't have this function. Also, I tried
import org.apache.spark.sql.functions._
val dateDF1 = orvPartRequestsDF.withColumn("MMDD", concat_ws("-", month($"requestDate"), dayofmonth($"requestDate")))
%sql
select distinct requestDate, MMDD,
case
when MMDD between '3-1' and '5-31' then 'Spring'
when MMDD between '6-1' and '8-31' then 'Summer'
when MMDD between '9-1' and '11-30' then 'Autumn'
when MMDD between '12-1' and '2-28' then 'Winter'
end as season
from temporal
and it also didn't work. Could you please let me know what I am missing here (my guess is I can't compare strings like this but I am not sure so I asked here) and how I can solve this?
After JXC solution#1 with range between
Since I was seeing some dicrepancy, I am sharing data frame again. Following is the dataframe seasonDF12
+-------+-----------+--------------+------+----------+
| itemno|requestYear|totalRequested|season|seasonCalc|
+-------+-----------+--------------+------+----------+
|0450000| 2011| 0.0|Winter| 201075|
|0450000| 2011| 0.0|Winter| 201075|
|0450000| 2011| 0.0|Spring| 201100|
|0450000| 2011| 0.0|Spring| 201100|
|0450000| 2011| 0.0|Spring| 201100|
|0450000| 2011| 0.0|Summer| 201125|
|0450000| 2011| 0.0|Summer| 201125|
|0450000| 2011| 0.0|Summer| 201125|
|0450000| 2011| 0.0|Autumn| 201150|
|0450000| 2011| 0.0|Autumn| 201150|
|0450000| 2011| 0.0|Autumn| 201150|
|0450000| 2011| 0.0|Winter| 201175|
|0450000| 2012| 3.0|Winter| 201175|
|0450000| 2012| 1.0|Winter| 201175|
|0450000| 2012| 4.0|Spring| 201200|
|0450000| 2012| 0.0|Spring| 201200|
|0450000| 2012| 0.0|Spring| 201200|
|0450000| 2012| 2.0|Summer| 201225|
|0450000| 2012| 3.0|Summer| 201225|
|0450000| 2012| 2.0|Summer| 201225|
+-------+-----------+--------------+------+----------+
to which I'll apply
val seasonDF2 = seasonDF12.selectExpr("*", """
sum(totalRequested) OVER (
PARTITION BY itemno
ORDER BY seasonCalc
RANGE BETWEEN 100 PRECEDING AND CURRENT ROW
) AS sum_totalRequested
""")
and I am seeing
look at first 40 in sum_totalRequested column. All the entries above it are 0. Not sure why it's 40. I think I already shared it but I need above dataframe to be transformed in to
itemno startDateOfSeason endDateOfSeason sum_totalRequestedBySeason (totalrequestedinCurrentSeason/totalRequestedinlast 3 + current season.)
Final output will be like this:
itemno startDateOfSeason endDateOfSeason season sum_totalRequestedBySeason (totalrequestedinCurrentSeason/totalRequestedinlast 3 + current season.)
123 12/01/2018 02/28/2019 winter 12 12/12+ 36 (36 from previous three seasons)
123 03/01/2019 05/31/2019 spring 24 24/24 + 45 (45 from previous three seasons)
Edit-2: adjusted to calculate the sum groupby seasons first and then the Window aggregate sum:
Edit-1: Based on the comments, the named season is not required. we can set Spring, Summer, Autumn, Winter as 0, 25, 50 and 75 respectively and the season will be an integer added up by year(requestDate)*100 so that we can use rangeBetween (offset=-100 for current + the previous 3 seasons) in Window aggregate functions:
Note: below are pyspark code:
df.createOrReplaceTempView("df_table")
df1 = spark.sql("""
WITH t1 AS ( SELECT *
, year(requestDate) as YY
, date_format(requestDate, "MMdd") as MMDD
FROM df_table )
, t2 AS ( SELECT *,
CASE
WHEN MMDD BETWEEN '0301' AND '0531' THEN
named_struct(
'startDateOfSeason', date(concat_ws('-', YY, '03-01'))
, 'endDateOfSeason', date(concat_ws('-', YY, '05-31'))
, 'season', 'spring'
, 'label', int(YY)*100
)
WHEN MMDD BETWEEN '0601' AND '0831' THEN
named_struct(
'startDateOfSeason', date(concat_ws('-', YY, '06-01'))
, 'endDateOfSeason', date(concat_ws('-', YY, '08-31'))
, 'season', 'summer'
, 'label', int(YY)*100 + 25
)
WHEN MMDD BETWEEN '0901' AND '1130' THEN
named_struct(
'startDateOfSeason', date(concat_ws('-', YY, '09-01'))
, 'endDateOfSeason', date(concat_ws('-', YY, '11-30'))
, 'season', 'autumn'
, 'label', int(YY)*100 + 50
)
WHEN MMDD BETWEEN '1201' AND '1231' THEN
named_struct(
'startDateOfSeason', date(concat_ws('-', YY, '12-01'))
, 'endDateOfSeason', last_day(concat_ws('-', int(YY)+1, '02-28'))
, 'season', 'winter'
, 'label', int(YY)*100 + 75
)
WHEN MMDD BETWEEN '0101' AND '0229' THEN
named_struct(
'startDateOfSeason', date(concat_ws('-', int(YY)-1, '12-01'))
, 'endDateOfSeason', last_day(concat_ws('-', YY, '02-28'))
, 'season', 'winter'
, 'label', (int(YY)-1)*100 + 75
)
END AS seasons
FROM t1
)
SELECT itemno
, seasons.*
, sum(totalRequested) AS sum_totalRequestedBySeason
FROM t2
GROUP BY itemno, seasons
""")
This will get the following result:
df1.show()
+-----------+-----------------+---------------+------+------+--------------------------+
| itemno|startDateOfSeason|endDateOfSeason|season| label|sum_totalRequestedBySeason|
+-----------+-----------------+---------------+------+------+--------------------------+
|5436134-070| 2013-12-01| 2013-12-31|winter|201375| 8.0|
| 1824299| 2015-03-01| 2015-05-31|spring|201500| 1.0|
| 0453978| 2014-09-01| 2014-11-30|autumn|201450| 18.0|
| 7181584| 2017-01-01| 2017-02-28|winter|201675| 4.0|
| 7178940| 2014-09-01| 2014-11-30|autumn|201450| 5.0|
| 7547385| 2014-01-01| 2014-02-28|winter|201375| 89.0|
| 5814215| 2017-03-01| 2017-05-31|spring|201700| 35.0|
| 3086317| 2014-09-01| 2014-11-30|autumn|201450| 1.0|
| 0450636| 2015-01-01| 2015-02-28|winter|201475| 7.0|
| 2204331| 2015-03-01| 2015-05-31|spring|201500| 2.0|
|5437474-620| 2015-06-01| 2015-08-31|summer|201525| 4.0|
| 5133406| 2014-03-01| 2014-05-31|spring|201400| 46.0|
| 7081417| 2017-03-01| 2017-05-31|spring|201700| 15.0|
| 7519278| 2013-03-01| 2013-05-31|spring|201300| 96.0|
| 7558402| 2014-09-01| 2014-11-30|autumn|201450| 260.0|
| 2204858| 2015-12-01| 2015-12-31|winter|201575| 34.0|
|5437662-070| 2013-06-01| 2013-08-31|summer|201325| 78.0|
| 5334160| 2018-01-01| 2018-02-28|winter|201775| 2.0|
| 3089858| 2014-09-01| 2014-11-30|autumn|201450| 5.0|
| 7512365| 2014-01-01| 2014-02-28|winter|201375| 110.0|
+-----------+-----------------+---------------+------+------+--------------------------+
After we have the season totals, then calculate the sum of the current plus previous 3 seasons using Window aggregate function and then the ratio:
df1.selectExpr("*", """
round(sum_totalRequestedBySeason/sum(sum_totalRequestedBySeason) OVER (
PARTITION BY itemno
ORDER BY label
RANGE BETWEEN 100 PRECEDING AND CURRENT ROW
),2) AS ratio_of_current_over_current_plus_past_3_seasons
""").show()

How to find quantiles inside agg() function after groupBy in Scala SPARK

I have a dataframe, in which I want to groupBy column A then find different stats like mean, min, max, std dev and quantiles.
I am able to find min, max and mean using the following code:
df.groupBy("A").agg(min("B"), max("B"), mean("B")).show(50, false)
But I am unable to find the quantiles(0.25, 0.5, 0.75). I tried approxQuantile and percentile but it gives the following error:
error: not found: value approxQuantile
if you have Hive in classpath, you can use many UDAF like percentile_approx and stddev_samp, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)
You can call these functions using callUDF:
import ss.implicits._
import org.apache.spark.sql.functions.callUDF
val df = Seq(1.0,2.0,3.0).toDF("x")
df.groupBy()
.agg(
callUDF("percentile_approx",$"x",lit(0.5)).as("median"),
callUDF("stddev_samp",$"x").as("stdev")
)
.show()
Here is a code that I have tested on Spark 3.1
val simpleData = Seq(("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
df.show()
df.groupBy($"department")
.agg(
percentile_approx($"salary",lit(0.5), lit(10000))
)
.show(false)
Output
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
| James| Sales| NY| 90000| 34|10000|
| Michael| Sales| NY| 86000| 56|20000|
| Robert| Sales| CA| 81000| 30|23000|
| Maria| Finance| CA| 90000| 24|23000|
| Raman| Finance| CA| 99000| 40|24000|
| Scott| Finance| NY| 83000| 36|19000|
| Jen| Finance| NY| 79000| 53|15000|
| Jeff| Marketing| CA| 80000| 25|18000|
| Kumar| Marketing| NY| 91000| 50|21000|
+-------------+----------+-----+------+---+-----+
+----------+-------------------------------------+
|department|percentile_approx(salary, 0.5, 10000)|
+----------+-------------------------------------+
|Sales |86000 |
|Finance |83000 |
|Marketing |80000 |
+----------+-------------------------------------+

How to use specific UDF to restore column values?

I have a dataframe which is the following :
+---------+--------+-------+
|date |id |typ_mvt|
+---------+--------+-------+
|date_1 |5697 |C |
|date_2 |5697 |M |
|date_3 |NULL |M |
|date_4 |NULL |S |
+---------+--------+-------+
I want to restore the id (NULL) values as below :
+---------+--------+-------+
|date |id |typ_mvt|
+---------+--------+-------+
|date_1 |5697 |C |
|date_2 |5697 |M |
|date_3 |5697 |M |
|date_4 |5697 |S |
+---------+--------+-------+
Is there a way to achieve this ?
Thank you for your answers.
Bonjour Doc,
Le na.fill fait bien le taff :
val rdd = sc.parallelize(Seq(
(201901, new Integer(5697), "C"),
(201902, new Integer(5697), "M"),
(201903, null.asInstanceOf[Integer], "M"),
(201904, null.asInstanceOf[Integer], "S")
))
val df = rdd.toDF("date", "id", "typ_mvt")
import org.apache.spark.sql.functions.{lag,lead}
val window = org.apache.spark.sql.expressions.Window.orderBy("date")
val sampleId = df.filter($"id".isNotNull).select($"id").first.getInt(0)
val newDf = df.na.fill(sampleId,Seq("id"))
Sinon, j'ai trouvé le post suivant très similaire avec une bien meilleur solution :
Fill in null with previously known good value with pyspark

PySpark : how to juxtapose 2 columns?

I have two DataFrames with one columns each (300 rows each) :
df_realite.take(1)
[Row(realite=1.0)]
df_proba_classe_1.take(1)
[Row(probabilite=0.6196931600570679)]
I would like to do one DataFrame with the two columns.
I tried :
_ = spark.createDataFrame([df_realite.rdd, df_proba_classe_1.rdd] ,
schema=StructType([ StructField('realite' , FloatType() ) ,
StructField('probabilite' , FloatType() ) ]))
But
_.take(10)
gives me empty values:
[Row(realite=None, probabilite=None), Row(realite=None, probabilite=None)]
There may be a more concise way (or a way without a join), but you could always just give them both an id and join them like:
from pyspark.sql import functions
df1 = df_realite.withColumn('id', functions.monotonically_increasing_id())
df2 = df_proba_classe_1.withColumn('id', functions.monotonically_increasing_id())
df1.join(df2, on='id').select('realite', 'probabilite'))
i think this is what you are looking for and i would only recommend this method if your data is very small like it is in your case (300 rows) because collect() is not a good practice on tons of data otherwise go the join route with dummy cols and do a broadcast join so no shuffle occurs
from pyspark.sql.functions import *
from pyspark.sql.types import *
df1 = spark.range(10).select(col("id").cast("float"))
df2 = spark.range(10).select(col("id").cast("float"))
l1 = df1.rdd.flatMap(lambda x: x).collect()
l2 = df2.rdd.flatMap(lambda x: x).collect()
list_df = zip(l1, l2)
schema=StructType([ StructField('realite', FloatType() ) ,
StructField('probabilite' , FloatType() ) ])
df = spark.createDataFrame(list_df, schema=schema)
df.show()
+-------+-----------+
|realite|probabilite|
+-------+-----------+
| 0.0| 0.0|
| 1.0| 1.0|
| 2.0| 2.0|
| 3.0| 3.0|
| 4.0| 4.0|
| 5.0| 5.0|
| 6.0| 6.0|
| 7.0| 7.0|
| 8.0| 8.0|
| 9.0| 9.0|
+-------+-----------+