Convert matrix to Pyspark Dataframe - pyspark

I've a matrix which is 1000*10000 size. I want to convert this matrix into pyspark dataframe.
Can someone please tell me how to do it? This post has an example. But my number of columns is large. So, assigning column names manually will be difficult.
Thanks!

In order to create a Pyspark Dataframe, you can use the function createDataFrame()
matrix=([11,12,13,14,15],[21,22,23,24,25],[31,32,33,34,35],[41,42,43,44,45])
df=spark.createDataFrame(matrix)
df.show()
+---+---+---+---+---+
| _1| _2| _3| _4| _5|
+---+---+---+---+---+
| 11| 12| 13| 14| 15|
| 21| 22| 23| 24| 25|
| 31| 32| 33| 34| 35|
| 41| 42| 43| 44| 45|
+---+---+---+---+---+
As you can see above, the columns will be named automatically with numbers.
You can also pass your own column names to the createDataFrame() function:
columns=[ 'mycol_'+str(col) for col in range(5) ]
df=spark.createDataFrame(matrix,schema=columns)
df.show()
+-------+-------+-------+-------+-------+
|mycol_0|mycol_1|mycol_2|mycol_3|mycol_4|
+-------+-------+-------+-------+-------+
| 11| 12| 13| 14| 15|
| 21| 22| 23| 24| 25|
| 31| 32| 33| 34| 35|
| 41| 42| 43| 44| 45|
+-------+-------+-------+-------+-------+

Related

How to update Iceberg table storing time series data

I'm trying to apply some updates to an Iceberg table using pyspark. The original data in the table is:
+-------------------+---+---+
| time| A| B|
+-------------------+---+---+
|2022-12-01 00:00:00| 1| 6|
|2022-12-02 00:00:00| 2| 7|
|2022-12-03 00:00:00| 3| 8|
|2022-12-04 00:00:00| 4| 9|
|2022-12-05 00:00:00| 5| 10|
+-------------------+---+---+
And the update (stored as a temporary view) is:
+-------------------+---+---+
| time| A| C|
+-------------------+---+---+
|2022-12-04 00:00:00| 40| 90|
|2022-12-05 00:00:00| 50|100|
+-------------------+---+---+
I'd like to end up with:
+-------------------+----+---+----+
| time| A| B| C|
+-------------------+----+---+----+
|2022-12-01 00:00:00| 1| 6| NaN|
|2022-12-02 00:00:00| 2| 7| NaN|
|2022-12-03 00:00:00| 3| 8| NaN|
|2022-12-04 00:00:00| 40| 9| 90|
|2022-12-05 00:00:00| 50| 10| 100|
+-------------------+----+---+----+
As per the docs, I've tried the query:
spark.sql("MERGE INTO db.data d USING update u ON d.time = u.time"
" WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *")
but it fails because the update doesn't contain column B. Also, even if the update did contain column B, column C wouldn't get added in the result, because it isn't in the original table. Is there anything I can do to get the behaviour I'm after?
Thanks for any help.

To generate the number in between range within Pyspark data frame

Input dataset:
Output dataset:
Basically i want to add one more column "new_month" where no will be in between "dvpt_month" and "lead_month" and all other column's values will be same for the new_month generated in between these months.
I want to do it with pyspark.
You can do it by creating an array of sequence between 2 columns and then exploding that array to get rows with all values
daf=spark.createDataFrame([(12,24),(24,36),(36,48)],"col1 int,col2 int")
daf.withColumn("arr",F.sequence(F.col("col1"),F.col("col2")-1)).select("col1","col2",F.explode("arr").alias("col3")).show()
#output
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 12| 24| 12|
| 12| 24| 13|
| 12| 24| 14|
| 12| 24| 15|
| 12| 24| 16|
| 12| 24| 17|
| 12| 24| 18|
| 12| 24| 19|
| 12| 24| 20|
| 12| 24| 21|
| 12| 24| 22|
| 12| 24| 23|
| 24| 36| 24|
| 24| 36| 25|
| 24| 36| 26|
| 24| 36| 27|
| 24| 36| 28|
| 24| 36| 29|
| 24| 36| 30|
| 24| 36| 31|
+----+----+----+
only showing top 20 rows
Edit - sequence is available in spark version >=2.4.0. in earlier version can try to use range or map to generate similar array

How to split a dataframe in two dataframes based on the total number of rows in the original dataframe

Hello I am new to spark and scala and I would like to split the following dataframe:
df:
+----------+-----+------+----------+--------+
| Ts| Temp| Wind| Precipit|Humidity|
+----------+-----+------+----------+--------+
|1579647600| 10| 22| 10| 50|
|1579734000| 11| 21| 10| 55|
|1579820400| 10| 18| 15| 60|
|1579906800| 9| 23| 20| 60|
|1579993200| 8| 24| 25| 50|
|1580079600| 10| 18| 27| 60|
|1580166000| 11| 20| 30| 50|
|1580252400| 12| 17| 15| 50|
|1580338800| 10| 14| 21| 50|
|1580425200| 9| 16| 25| 60|
-----------+-----+------+----------+--------+
The resulting dataframes should be as follows:
df1:
+----------+-----+------+----------+--------+
| Ts| Temp| Wind| Precipit|Humidity|
+----------+-----+------+----------+--------+
|1579647600| 10| 22| 10| 50|
|1579734000| 11| 21| 10| 55|
|1579820400| 10| 18| 15| 60|
|1579906800| 9| 23| 20| 60|
|1579993200| 8| 24| 25| 50|
|1580079600| 10| 18| 27| 60|
|1580166000| 11| 20| 30| 50|
|1580252400| 12| 17| 15| 50|
+----------+-----+------+----------+--------+
df2:
+----------+-----+------+----------+--------+
| Ts| Temp| Wind| Precipit|Humidity|
+----------+-----+------+----------+--------+
|1580338800| 10| 14| 21| 50|
|1580425200| 9| 16| 25| 60|
-----------+-----+------+----------+--------+
where df1 having 80% of the top rows of df and df2 the 20% left.
Try with monotonically_increasing_id() function with window percent_rank() as this function preserve the order.
Example:
val df=sc.parallelize(Seq((1579647600,10,22,10,50),
(1579734000,11,21,10,55),
(1579820400,10,18,15,60),
(1579906800, 9,23,20,60),
(1579993200, 8,24,25,50),
(1580079600,10,18,27,60),
(1580166000,11,20,30,50),
(1580252400,12,17,15,50),
(1580338800,10,14,21,50),
(1580425200, 9,16,25,60)),10).toDF("Ts","Temp","Wind","Precipit","Humidity")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df1=df.withColumn("mid",monotonically_increasing_id)
val df_above_80=df1.withColumn("pr",percent_rank().over(w)).filter(col("pr") >= 0.8).drop(Seq("mid","pr"):_*)
val df_below_80=df1.withColumn("pr",percent_rank().over(w)).filter(col("pr") < 0.8).drop(Seq("mid","pr"):_*)
df_below_80.show()
/*
+----------+----+----+--------+--------+
| Ts|Temp|Wind|Precipit|Humidity|
+----------+----+----+--------+--------+
|1579647600| 10| 22| 10| 50|
|1579734000| 11| 21| 10| 55|
|1579820400| 10| 18| 15| 60|
|1579906800| 9| 23| 20| 60|
|1579993200| 8| 24| 25| 50|
|1580079600| 10| 18| 27| 60|
|1580166000| 11| 20| 30| 50|
|1580252400| 12| 17| 15| 50|
+----------+----+----+--------+--------+
*/
df_above_80.show()
/*
+----------+----+----+--------+--------+
| Ts|Temp|Wind|Precipit|Humidity|
+----------+----+----+--------+--------+
|1580338800| 10| 14| 21| 50|
|1580425200| 9| 16| 25| 60|
+----------+----+----+--------+--------+
*/
Assuming the data are randomly split:
val Array(df1, df2) = df.randomSplit(Array(0.8, 0.2))
If however, by "Top rows" you mean by the 'Ts' column in your example dataframe then you could do this:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col,percent_rank}
val window = Window.partitionBy().orderBy(df['Ts'].desc())
val df1 = df.select('*', percent_rank().over(window).alias('rank'))
.filter(col('rank') >= 0.2)
.show()
val df2 = df.select('*', percent_rank().over(window).alias('rank'))
.filter(col('rank') < 0.2)
.show()

Apache Spark visualization

I'm new to Apache Spark and trying to learn visualization in Apache Spark/Databricks at the moment. If I have the following csv datasets;
Patient.csv
+---+---------+------+---+-----------------+-----------+------------+-------------+
| Id|Post_Code|Height|Age|Health_Cover_Type|Temperature|Disease_Type|Infected_Date|
+---+---------+------+---+-----------------+-----------+------------+-------------+
| 1| 2096| 131| 22| 5| 37| 4| 891717742|
| 2| 2090| 136| 18| 5| 36| 1| 881250949|
| 3| 2004| 120| 9| 2| 36| 2| 878887136|
| 4| 2185| 155| 41| 1| 36| 1| 896029926|
| 5| 2195| 145| 25| 5| 37| 1| 887100886|
| 6| 2079| 172| 52| 2| 37| 5| 871205766|
| 7| 2006| 176| 27| 1| 37| 3| 879487476|
| 8| 2605| 129| 15| 5| 36| 1| 876343336|
| 9| 2017| 145| 19| 5| 37| 4| 897281846|
| 10| 2112| 171| 47| 5| 38| 6| 882539696|
| 11| 2112| 102| 8| 5| 36| 5| 873648586|
| 12| 2086| 151| 11| 1| 35| 1| 894724066|
| 13| 2142| 148| 22| 2| 37| 1| 889446276|
| 14| 2009| 158| 57| 5| 38| 2| 887072826|
| 15| 2103| 167| 34| 1| 37| 3| 892094506|
| 16| 2095| 168| 37| 5| 36| 1| 893400966|
| 17| 2010| 156| 20| 3| 38| 5| 897313586|
| 18| 2117| 143| 17| 5| 36| 2| 875238076|
| 19| 2204| 155| 24| 4| 38| 6| 884159506|
| 20| 2103| 138| 15| 5| 37| 4| 886765356|
+---+---------+------+---+-----------------+-----------+------------+-------------+
And coverType.csv
+--------------+-----------------+
|cover_type_key| cover_type_label|
+--------------+-----------------+
| 1| Single|
| 2| Couple|
| 3| Family|
| 4| Concession|
| 5| Disable|
+--------------+-----------------+
Which I've managed to load as DataFrames (Patient and coverType);
val PatientDF=spark.read
.format("csv")
.option("header","true")
.option("inferSchema","true")
.option("nullValue","NA")
.option("timestampFormat","yyyy-MM-dd'T'HH:mm:ss")
.option("mode","failfast")
.option("path","/spark-data/Patient.csv")
.load()
val coverTypeDF=spark.read
.format("csv")
.option("header","true")
.option("inferSchema","true")
.option("nullValue","NA")
.option("timestampFormat","yyyy-MM-dd'T'HH:mm:ss")
.option("mode","failfast")
.option("path","/spark-data/covertype.csv")
.load()
How do I generate a bar chart visualization to show the distribution of different Disease_Type in my dataset.
How do I generate a bar chart visualization to show the average Post_Code of each cover type with string labels for cover type.
How do I extract the year (YYYY) from the Infected_Date (represented in date (unix seconds since 1/1/1970 UTC)) ordering the result in decending order of the year and average age.
To display charts natively with Databricks you need to use the display function on a dataframe. For number one, we can accomplish what you'd like by aggregating the dataframe on disease type.
display(PatientDF.groupBy(Disease_Type).count())
Then you can use the charting options to build a bar chart, you can do the same for your 2nd question, but instead of .count() use .avg("Post_Code")
For the third question you need to use the year function after casting the timestamp to a date and an orderBy.
from pyspark.sql.functions import *
display(PatientDF.select(year(to_timestamp("Infected_Date")).alias("year")).orderBy("year"))

How do I replace null values of multiple columns with values from multiple different columns

I have a data frame like below
data = [
(1, None,7,10,11,19),
(1, 4,None,10,43,58),
(None, 4,7,67,88,91),
(1, None,7,78,96,32)
]
df = spark.createDataFrame(data, ["A_min", "B_min","C_min","A_max", "B_max","C_max"])
df.show()
and I would want the columns which show name as 'min' to be replaced by their equivalent max column.
Example null values of A_min column should be replaced by A_max column
It should be like the data frame below.
+-----+-----+-----+-----+-----+-----+
|A_min|B_min|C_min|A_max|B_max|C_max|
+-----+-----+-----+-----+-----+-----+
| 1| 11| 7| 10| 11| 19|
| 1| 4| 58| 10| 43| 58|
| 67| 4| 7| 67| 88| 91|
| 1| 96| 7| 78| 96| 32|
+-----+-----+-----+-----+-----+-----+
I have tried the code below by defining the columns but clearly this does not work. Really appreciate any help.
min_cols = ["A_min", "B_min","C_min"]
max_cols = ["A_max", "B_max","C_max"]
for i in min_cols
df = df.withColumn(i,when(f.col(i)=='',max_cols.otherwise(col(i))))
display(df)
Assuming you have the same number of max and min columns, you can use coalesce along with python's list comprehension to obtain your solution
from pyspark.sql.functions import coalesce
min_cols = ["A_min", "B_min","C_min"]
max_cols = ["A_max", "B_max","C_max"]
df.select(*[coalesce(df[val], df[max_cols[pos]]).alias(val) for pos, val in enumerate(min_cols)], *max_cols).show()
Output:
+-----+-----+-----+-----+-----+-----+
|A_min|B_min|C_min|A_max|B_max|C_max|
+-----+-----+-----+-----+-----+-----+
| 1| 11| 7| 10| 11| 19|
| 1| 4| 58| 10| 43| 58|
| 67| 4| 7| 67| 88| 91|
| 1| 96| 7| 78| 96| 32|
+-----+-----+-----+-----+-----+-----+