To generate the number in between range within Pyspark data frame - pyspark

Input dataset:
Output dataset:
Basically i want to add one more column "new_month" where no will be in between "dvpt_month" and "lead_month" and all other column's values will be same for the new_month generated in between these months.
I want to do it with pyspark.

You can do it by creating an array of sequence between 2 columns and then exploding that array to get rows with all values
daf=spark.createDataFrame([(12,24),(24,36),(36,48)],"col1 int,col2 int")
daf.withColumn("arr",F.sequence(F.col("col1"),F.col("col2")-1)).select("col1","col2",F.explode("arr").alias("col3")).show()
#output
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 12| 24| 12|
| 12| 24| 13|
| 12| 24| 14|
| 12| 24| 15|
| 12| 24| 16|
| 12| 24| 17|
| 12| 24| 18|
| 12| 24| 19|
| 12| 24| 20|
| 12| 24| 21|
| 12| 24| 22|
| 12| 24| 23|
| 24| 36| 24|
| 24| 36| 25|
| 24| 36| 26|
| 24| 36| 27|
| 24| 36| 28|
| 24| 36| 29|
| 24| 36| 30|
| 24| 36| 31|
+----+----+----+
only showing top 20 rows
Edit - sequence is available in spark version >=2.4.0. in earlier version can try to use range or map to generate similar array

Related

Unable to get the result from the window function

+---------------+--------+
|YearsExperience| Salary|
+---------------+--------+
| 1.1| 39343.0|
| 1.3| 46205.0|
| 1.5| 37731.0|
| 2.0| 43525.0|
| 2.2| 39891.0|
| 2.9| 56642.0|
| 3.0| 60150.0|
| 3.2| 54445.0|
| 3.2| 64445.0|
| 3.7| 57189.0|
| 3.9| 63218.0|
| 4.0| 55794.0|
| 4.0| 56957.0|
| 4.1| 57081.0|
| 4.5| 61111.0|
| 4.9| 67938.0|
| 5.1| 66029.0|
| 5.3| 83088.0|
| 5.9| 81363.0|
| 6.0| 93940.0|
| 6.8| 91738.0|
| 7.1| 98273.0|
| 7.9|101302.0|
| 8.2|113812.0|
| 8.7|109431.0|
| 9.0|105582.0|
| 9.5|116969.0|
| 9.6|112635.0|
| 10.3|122391.0|
| 10.5|121872.0|
+---------------+--------+
I want to find the top highest salary from the above data which is 122391.0
My Code
val top= Window.partitionBy("id").orderBy(col("Salary").desc)
val res= df1.withColumn("top", rank().over(top))
Result
+---------------+--------+---+---+
|YearsExperience| Salary| id|top|
+---------------+--------+---+---+
| 1.1| 39343.0| 0| 1|
| 1.3| 46205.0| 1| 1|
| 1.5| 37731.0| 2| 1|
| 2.0| 43525.0| 3| 1|
| 2.2| 39891.0| 4| 1|
| 2.9| 56642.0| 5| 1|
| 3.0| 60150.0| 6| 1|
| 3.2| 54445.0| 7| 1|
| 3.2| 64445.0| 8| 1|
| 3.7| 57189.0| 9| 1|
| 3.9| 63218.0| 10| 1|
| 4.0| 55794.0| 11| 1|
| 4.0| 56957.0| 12| 1|
| 4.1| 57081.0| 13| 1|
| 4.5| 61111.0| 14| 1|
| 4.9| 67938.0| 15| 1|
| 5.1| 66029.0| 16| 1|
| 5.3| 83088.0| 17| 1|
| 5.9| 81363.0| 18| 1|
| 6.0| 93940.0| 19| 1|
| 6.8| 91738.0| 20| 1|
| 7.1| 98273.0| 21| 1|
| 7.9|101302.0| 22| 1|
| 8.2|113812.0| 23| 1|
| 8.7|109431.0| 24| 1|
| 9.0|105582.0| 25| 1|
| 9.5|116969.0| 26| 1|
| 9.6|112635.0| 27| 1|
| 10.3|122391.0| 28| 1|
| 10.5|121872.0| 29| 1|
+---------------+--------+---+---+
Also I have choosed partioned by salary and orderby id.
<br>
But the result was same.
As you can see 122391 is coming just below the above but it should come in first position as i have done ascending.
Please help anybody find any things
Are you sure you need a window function here? The window you defined partitions the data by id, which I assume is unique, so each group produced by the window will only have one row. It looks like you want a window over the entire dataframe, which means you don't actually need one. If you just want to add a column with the max, you can get the max using an aggregation on your original dataframe and cross join with it:
val maxDF = df1.agg(max("salary").as("top"))
val res = df1.crossJoin(maxDF)

How to split a dataframe in two dataframes based on the total number of rows in the original dataframe

Hello I am new to spark and scala and I would like to split the following dataframe:
df:
+----------+-----+------+----------+--------+
| Ts| Temp| Wind| Precipit|Humidity|
+----------+-----+------+----------+--------+
|1579647600| 10| 22| 10| 50|
|1579734000| 11| 21| 10| 55|
|1579820400| 10| 18| 15| 60|
|1579906800| 9| 23| 20| 60|
|1579993200| 8| 24| 25| 50|
|1580079600| 10| 18| 27| 60|
|1580166000| 11| 20| 30| 50|
|1580252400| 12| 17| 15| 50|
|1580338800| 10| 14| 21| 50|
|1580425200| 9| 16| 25| 60|
-----------+-----+------+----------+--------+
The resulting dataframes should be as follows:
df1:
+----------+-----+------+----------+--------+
| Ts| Temp| Wind| Precipit|Humidity|
+----------+-----+------+----------+--------+
|1579647600| 10| 22| 10| 50|
|1579734000| 11| 21| 10| 55|
|1579820400| 10| 18| 15| 60|
|1579906800| 9| 23| 20| 60|
|1579993200| 8| 24| 25| 50|
|1580079600| 10| 18| 27| 60|
|1580166000| 11| 20| 30| 50|
|1580252400| 12| 17| 15| 50|
+----------+-----+------+----------+--------+
df2:
+----------+-----+------+----------+--------+
| Ts| Temp| Wind| Precipit|Humidity|
+----------+-----+------+----------+--------+
|1580338800| 10| 14| 21| 50|
|1580425200| 9| 16| 25| 60|
-----------+-----+------+----------+--------+
where df1 having 80% of the top rows of df and df2 the 20% left.
Try with monotonically_increasing_id() function with window percent_rank() as this function preserve the order.
Example:
val df=sc.parallelize(Seq((1579647600,10,22,10,50),
(1579734000,11,21,10,55),
(1579820400,10,18,15,60),
(1579906800, 9,23,20,60),
(1579993200, 8,24,25,50),
(1580079600,10,18,27,60),
(1580166000,11,20,30,50),
(1580252400,12,17,15,50),
(1580338800,10,14,21,50),
(1580425200, 9,16,25,60)),10).toDF("Ts","Temp","Wind","Precipit","Humidity")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df1=df.withColumn("mid",monotonically_increasing_id)
val df_above_80=df1.withColumn("pr",percent_rank().over(w)).filter(col("pr") >= 0.8).drop(Seq("mid","pr"):_*)
val df_below_80=df1.withColumn("pr",percent_rank().over(w)).filter(col("pr") < 0.8).drop(Seq("mid","pr"):_*)
df_below_80.show()
/*
+----------+----+----+--------+--------+
| Ts|Temp|Wind|Precipit|Humidity|
+----------+----+----+--------+--------+
|1579647600| 10| 22| 10| 50|
|1579734000| 11| 21| 10| 55|
|1579820400| 10| 18| 15| 60|
|1579906800| 9| 23| 20| 60|
|1579993200| 8| 24| 25| 50|
|1580079600| 10| 18| 27| 60|
|1580166000| 11| 20| 30| 50|
|1580252400| 12| 17| 15| 50|
+----------+----+----+--------+--------+
*/
df_above_80.show()
/*
+----------+----+----+--------+--------+
| Ts|Temp|Wind|Precipit|Humidity|
+----------+----+----+--------+--------+
|1580338800| 10| 14| 21| 50|
|1580425200| 9| 16| 25| 60|
+----------+----+----+--------+--------+
*/
Assuming the data are randomly split:
val Array(df1, df2) = df.randomSplit(Array(0.8, 0.2))
If however, by "Top rows" you mean by the 'Ts' column in your example dataframe then you could do this:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col,percent_rank}
val window = Window.partitionBy().orderBy(df['Ts'].desc())
val df1 = df.select('*', percent_rank().over(window).alias('rank'))
.filter(col('rank') >= 0.2)
.show()
val df2 = df.select('*', percent_rank().over(window).alias('rank'))
.filter(col('rank') < 0.2)
.show()

Apache Spark visualization

I'm new to Apache Spark and trying to learn visualization in Apache Spark/Databricks at the moment. If I have the following csv datasets;
Patient.csv
+---+---------+------+---+-----------------+-----------+------------+-------------+
| Id|Post_Code|Height|Age|Health_Cover_Type|Temperature|Disease_Type|Infected_Date|
+---+---------+------+---+-----------------+-----------+------------+-------------+
| 1| 2096| 131| 22| 5| 37| 4| 891717742|
| 2| 2090| 136| 18| 5| 36| 1| 881250949|
| 3| 2004| 120| 9| 2| 36| 2| 878887136|
| 4| 2185| 155| 41| 1| 36| 1| 896029926|
| 5| 2195| 145| 25| 5| 37| 1| 887100886|
| 6| 2079| 172| 52| 2| 37| 5| 871205766|
| 7| 2006| 176| 27| 1| 37| 3| 879487476|
| 8| 2605| 129| 15| 5| 36| 1| 876343336|
| 9| 2017| 145| 19| 5| 37| 4| 897281846|
| 10| 2112| 171| 47| 5| 38| 6| 882539696|
| 11| 2112| 102| 8| 5| 36| 5| 873648586|
| 12| 2086| 151| 11| 1| 35| 1| 894724066|
| 13| 2142| 148| 22| 2| 37| 1| 889446276|
| 14| 2009| 158| 57| 5| 38| 2| 887072826|
| 15| 2103| 167| 34| 1| 37| 3| 892094506|
| 16| 2095| 168| 37| 5| 36| 1| 893400966|
| 17| 2010| 156| 20| 3| 38| 5| 897313586|
| 18| 2117| 143| 17| 5| 36| 2| 875238076|
| 19| 2204| 155| 24| 4| 38| 6| 884159506|
| 20| 2103| 138| 15| 5| 37| 4| 886765356|
+---+---------+------+---+-----------------+-----------+------------+-------------+
And coverType.csv
+--------------+-----------------+
|cover_type_key| cover_type_label|
+--------------+-----------------+
| 1| Single|
| 2| Couple|
| 3| Family|
| 4| Concession|
| 5| Disable|
+--------------+-----------------+
Which I've managed to load as DataFrames (Patient and coverType);
val PatientDF=spark.read
.format("csv")
.option("header","true")
.option("inferSchema","true")
.option("nullValue","NA")
.option("timestampFormat","yyyy-MM-dd'T'HH:mm:ss")
.option("mode","failfast")
.option("path","/spark-data/Patient.csv")
.load()
val coverTypeDF=spark.read
.format("csv")
.option("header","true")
.option("inferSchema","true")
.option("nullValue","NA")
.option("timestampFormat","yyyy-MM-dd'T'HH:mm:ss")
.option("mode","failfast")
.option("path","/spark-data/covertype.csv")
.load()
How do I generate a bar chart visualization to show the distribution of different Disease_Type in my dataset.
How do I generate a bar chart visualization to show the average Post_Code of each cover type with string labels for cover type.
How do I extract the year (YYYY) from the Infected_Date (represented in date (unix seconds since 1/1/1970 UTC)) ordering the result in decending order of the year and average age.
To display charts natively with Databricks you need to use the display function on a dataframe. For number one, we can accomplish what you'd like by aggregating the dataframe on disease type.
display(PatientDF.groupBy(Disease_Type).count())
Then you can use the charting options to build a bar chart, you can do the same for your 2nd question, but instead of .count() use .avg("Post_Code")
For the third question you need to use the year function after casting the timestamp to a date and an orderBy.
from pyspark.sql.functions import *
display(PatientDF.select(year(to_timestamp("Infected_Date")).alias("year")).orderBy("year"))

Convert matrix to Pyspark Dataframe

I've a matrix which is 1000*10000 size. I want to convert this matrix into pyspark dataframe.
Can someone please tell me how to do it? This post has an example. But my number of columns is large. So, assigning column names manually will be difficult.
Thanks!
In order to create a Pyspark Dataframe, you can use the function createDataFrame()
matrix=([11,12,13,14,15],[21,22,23,24,25],[31,32,33,34,35],[41,42,43,44,45])
df=spark.createDataFrame(matrix)
df.show()
+---+---+---+---+---+
| _1| _2| _3| _4| _5|
+---+---+---+---+---+
| 11| 12| 13| 14| 15|
| 21| 22| 23| 24| 25|
| 31| 32| 33| 34| 35|
| 41| 42| 43| 44| 45|
+---+---+---+---+---+
As you can see above, the columns will be named automatically with numbers.
You can also pass your own column names to the createDataFrame() function:
columns=[ 'mycol_'+str(col) for col in range(5) ]
df=spark.createDataFrame(matrix,schema=columns)
df.show()
+-------+-------+-------+-------+-------+
|mycol_0|mycol_1|mycol_2|mycol_3|mycol_4|
+-------+-------+-------+-------+-------+
| 11| 12| 13| 14| 15|
| 21| 22| 23| 24| 25|
| 31| 32| 33| 34| 35|
| 41| 42| 43| 44| 45|
+-------+-------+-------+-------+-------+

New column receives the value Null

I have the following DataFrame df
+-----------+-----------+-----------+
|CommunityId|nodes_count|edges_count|
+-----------+-----------+-----------+
| 26| 3| 11|
| 964| 16| 18|
| 1806| 9| 31|
| 2040| 13| 12|
| 2214| 8| 8|
| 2927| 7| 7|
Then I add the column Rate as follows:
df
.withColumn("Rate",when(col("nodes_count") =!= 0, (lit("edges_count")/lit("nodes_count")).as[Double]).otherwise(0.0))
This is what I get:
+-----------+-----------+-----------+-----------------------+
|CommunityId|nodes_count|edges_count| Rate|
+-----------+-----------+-----------+-----------------------+
| 26| 3| 11| null|
| 964| 16| 18| null|
| 1806| 9| 31| null|
| 2040| 13| 12| null|
| 2214| 8| 8| null|
| 2927| 7| 7| null|
For some reason Rate is always equal to null.
That happens because you use lit. You should use col instead:
df
.withColumn(
"Rate" ,when(col("nodes_count") =!= 0,
(col("edges_count") / col("nodes_count")).as[Double]).otherwise(0.0))
although both when and as Double are useless here, and simple division would be more than sufficient:
df.withColumn("Rate", col("edges_count") / col("nodes_count"))