New column receives the value Null

New column receives the value Null - scala

I have the following DataFrame df
+-----------+-----------+-----------+
|CommunityId|nodes_count|edges_count|
+-----------+-----------+-----------+
| 26| 3| 11|
| 964| 16| 18|
| 1806| 9| 31|
| 2040| 13| 12|
| 2214| 8| 8|
| 2927| 7| 7|
Then I add the column Rate as follows:
df
.withColumn("Rate",when(col("nodes_count") =!= 0, (lit("edges_count")/lit("nodes_count")).as[Double]).otherwise(0.0))
This is what I get:
+-----------+-----------+-----------+-----------------------+
|CommunityId|nodes_count|edges_count| Rate|
+-----------+-----------+-----------+-----------------------+
| 26| 3| 11| null|
| 964| 16| 18| null|
| 1806| 9| 31| null|
| 2040| 13| 12| null|
| 2214| 8| 8| null|
| 2927| 7| 7| null|
For some reason Rate is always equal to null.

That happens because you use lit. You should use col instead:
df
.withColumn(
"Rate" ,when(col("nodes_count") =!= 0,
(col("edges_count") / col("nodes_count")).as[Double]).otherwise(0.0))
although both when and as Double are useless here, and simple division would be more than sufficient:
df.withColumn("Rate", col("edges_count") / col("nodes_count"))

Related

Spark SQL find min value in column and get whole row

I have a table like below and I want to get row where distance in min in spark sql
I tried this
result.select($"sourceBorder", $"targetBorder", $"min(distance))").show()
which gives error, and result.agg(min("distance")) only gives the distance column not others.
+------------+------------+--------+
|sourceBorder|targetBorder|distance|
+------------+------------+--------+
| 3| 12| 20|
| 4| 12| 28|
| 2| 12| 16|
| 3| 6| 15|
| 4| 6| 19|
| 2| 6| 7|
| 3| 7| 15|
| 4| 7| 23|
| 2| 7| 11|
+------------+------------+--------+
so at the end want this row
| 2| 6| 7|

Add a column of minimum distance, and filter the rows where distance = minimum distance:
result.withColumn(
"mindistance",
min($"distance").over(Window.orderBy("distance"))
).filter($"distance" === $"mindistance")

Apache Spark visualization

I'm new to Apache Spark and trying to learn visualization in Apache Spark/Databricks at the moment. If I have the following csv datasets;
Patient.csv
+---+---------+------+---+-----------------+-----------+------------+-------------+
| Id|Post_Code|Height|Age|Health_Cover_Type|Temperature|Disease_Type|Infected_Date|
+---+---------+------+---+-----------------+-----------+------------+-------------+
| 1| 2096| 131| 22| 5| 37| 4| 891717742|
| 2| 2090| 136| 18| 5| 36| 1| 881250949|
| 3| 2004| 120| 9| 2| 36| 2| 878887136|
| 4| 2185| 155| 41| 1| 36| 1| 896029926|
| 5| 2195| 145| 25| 5| 37| 1| 887100886|
| 6| 2079| 172| 52| 2| 37| 5| 871205766|
| 7| 2006| 176| 27| 1| 37| 3| 879487476|
| 8| 2605| 129| 15| 5| 36| 1| 876343336|
| 9| 2017| 145| 19| 5| 37| 4| 897281846|
| 10| 2112| 171| 47| 5| 38| 6| 882539696|
| 11| 2112| 102| 8| 5| 36| 5| 873648586|
| 12| 2086| 151| 11| 1| 35| 1| 894724066|
| 13| 2142| 148| 22| 2| 37| 1| 889446276|
| 14| 2009| 158| 57| 5| 38| 2| 887072826|
| 15| 2103| 167| 34| 1| 37| 3| 892094506|
| 16| 2095| 168| 37| 5| 36| 1| 893400966|
| 17| 2010| 156| 20| 3| 38| 5| 897313586|
| 18| 2117| 143| 17| 5| 36| 2| 875238076|
| 19| 2204| 155| 24| 4| 38| 6| 884159506|
| 20| 2103| 138| 15| 5| 37| 4| 886765356|
+---+---------+------+---+-----------------+-----------+------------+-------------+
And coverType.csv
+--------------+-----------------+
|cover_type_key| cover_type_label|
+--------------+-----------------+
| 1| Single|
| 2| Couple|
| 3| Family|
| 4| Concession|
| 5| Disable|
+--------------+-----------------+
Which I've managed to load as DataFrames (Patient and coverType);
val PatientDF=spark.read
.format("csv")
.option("header","true")
.option("inferSchema","true")
.option("nullValue","NA")
.option("timestampFormat","yyyy-MM-dd'T'HH:mm:ss")
.option("mode","failfast")
.option("path","/spark-data/Patient.csv")
.load()
val coverTypeDF=spark.read
.format("csv")
.option("header","true")
.option("inferSchema","true")
.option("nullValue","NA")
.option("timestampFormat","yyyy-MM-dd'T'HH:mm:ss")
.option("mode","failfast")
.option("path","/spark-data/covertype.csv")
.load()
How do I generate a bar chart visualization to show the distribution of different Disease_Type in my dataset.
How do I generate a bar chart visualization to show the average Post_Code of each cover type with string labels for cover type.
How do I extract the year (YYYY) from the Infected_Date (represented in date (unix seconds since 1/1/1970 UTC)) ordering the result in decending order of the year and average age.

To display charts natively with Databricks you need to use the display function on a dataframe. For number one, we can accomplish what you'd like by aggregating the dataframe on disease type.
display(PatientDF.groupBy(Disease_Type).count())
Then you can use the charting options to build a bar chart, you can do the same for your 2nd question, but instead of .count() use .avg("Post_Code")
For the third question you need to use the year function after casting the timestamp to a date and an orderBy.
from pyspark.sql.functions import *
display(PatientDF.select(year(to_timestamp("Infected_Date")).alias("year")).orderBy("year"))

Spark Scala Window extend result until the end

I will expose my problem based on the initial dataframe and the one I want to achieve:
val df_997 = Seq [(Int, Int, Int, Int)]((1,1,7,10),(1,10,4,300),(1,3,14,50),(1,20,24,70),(1,30,12,90),(2,10,4,900),(2,25,30,40),(2,15,21,60),(2,5,10,80)).toDF("policyId","FECMVTO","aux","IND_DEF").orderBy(asc("policyId"), asc("FECMVTO"))
df_997.show
+--------+-------+---+-------+
|policyId|FECMVTO|aux|IND_DEF|
+--------+-------+---+-------+
| 1| 1| 7| 10|
| 1| 3| 14| 50|
| 1| 10| 4| 300|
| 1| 20| 24| 70|
| 1| 30| 12| 90|
| 2| 5| 10| 80|
| 2| 10| 4| 900|
| 2| 15| 21| 60|
| 2| 25| 30| 40|
+--------+-------+---+-------+
Imagine I have partitioned this DF by the column policyId and created the column row_num based on it to better see the Windows:
val win = Window.partitionBy("policyId").orderBy("FECMVTO")
val df_998 = df_997.withColumn("row_num",row_number().over(win))
df_998.show
+--------+-------+---+-------+-------+
|policyId|FECMVTO|aux|IND_DEF|row_num|
+--------+-------+---+-------+-------+
| 1| 1| 7| 10| 1|
| 1| 3| 14| 50| 2|
| 1| 10| 4| 300| 3|
| 1| 20| 24| 70| 4|
| 1| 30| 12| 90| 5|
| 2| 5| 10| 80| 1|
| 2| 10| 4| 900| 2|
| 2| 15| 21| 60| 3|
| 2| 25| 30| 40| 4|
+--------+-------+---+-------+-------+
Now, for each window, if the value of aux is 4, I want to set the value of IND_DEF column for that register to the column FEC_MVTO for this register on until the end of the window.
The resulting DF would be:
+--------+-------+---+-------+-------+
|policyId|FECMVTO|aux|IND_DEF|row_num|
+--------+-------+---+-------+-------+
| 1| 1| 7| 10| 1|
| 1| 3| 14| 50| 2|
| 1| 300| 4| 300| 3|
| 1| 300| 24| 70| 4|
| 1| 300| 12| 90| 5|
| 2| 5| 10| 80| 1|
| 2| 900| 4| 900| 2|
| 2| 900| 21| 60| 3|
| 2| 900| 30| 40| 4|
+--------+-------+---+-------+-------+
Thanks for your suggestions as I am very stuck in here...

Here's one approach: First left-join the DataFrame with its aux == 4 filtered version, followed by applying Window function first to backfill nulls with the wanted IND_DEF values per partition, and finally conditionally recreate column FECMVTO:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1,1,7,10), (1,10,4,300), (1,3,14,50), (1,20,24,70), (1,30,12,90),
(2,10,4,900), (2,25,30,40), (2,15,21,60), (2,5,10,80)
).toDF("policyId","FECMVTO","aux","IND_DEF")
val win = Window.partitionBy("policyId").orderBy("FECMVTO").
rowsBetween(Window.unboundedPreceding, 0)
val df2 = df.
select($"policyId", $"aux", $"IND_DEF".as("IND_DEF2")).
where($"aux" === 4)
df.join(df2, Seq("policyId", "aux"), "left_outer").
withColumn("IND_DEF3", first($"IND_DEF2", ignoreNulls=true).over(win)).
withColumn("FECMVTO", coalesce($"IND_DEF3", $"FECMVTO")).
show
// +--------+---+-------+-------+--------+--------+
// |policyId|aux|FECMVTO|IND_DEF|IND_DEF2|IND_DEF3|
// +--------+---+-------+-------+--------+--------+
// | 1| 7| 1| 10| null| null|
// | 1| 14| 3| 50| null| null|
// | 1| 4| 300| 300| 300| 300|
// | 1| 24| 300| 70| null| 300|
// | 1| 12| 300| 90| null| 300|
// | 2| 10| 5| 80| null| null|
// | 2| 4| 900| 900| 900| 900|
// | 2| 21| 900| 60| null| 900|
// | 2| 30| 900| 40| null| 900|
// +--------+---+-------+-------+--------+--------+
Columns IND_DEF2, IND_DEF3 are kept only for illustration (and can certainly be dropped).

#I believe below can be solution for your issue
Considering input_df is your input dataframe
//Step#1 - Filter rows with IND_DEF = 4 from input_df
val only_FECMVTO_4_df1 = input_df.filter($"IND_DEF" === 4)
//Step#2 - Filling FECMVTO value from IND_DEF for the above result
val only_FECMVTO_4_df2 = only_FECMVTO_4_df1.withColumn("FECMVTO_NEW",$"IND_DEF").drop($"FECMVTO").withColumnRenamed("FECMVTO",$"FECMVTO_NEW")
//Step#3 - removing all the records from step#1 from input_df
val input_df_without_FECMVTO_4 = input_df.except(only_FECMVTO_4_df1)
//combining Step#2 output with output of Step#3
val final_df = input_df_without_FECMVTO_4.union(only_FECMVTO_4_df2)

PySpark : Dataframe : Numeric + Null column values resulting in NULL instead of numeric value

I am facing a problem in PySpark Dataframe loaded from a CSV file , where my numeric column do have empty values Like below
+-------------+------------+-----------+-----------+
| Player_Name|Test_Matches|ODI_Matches|T20_Matches|
+-------------+------------+-----------+-----------+
| Aaron, V R| 9| 9| |
| Abid Ali, S| 29| 5| |
|Adhikari, H R| 21| | |
| Agarkar, A B| 26| 191| 4|
+-------------+------------+-----------+-----------+
Casted those columns to integer and all those empty become null
df_data_csv_casted = df_data_csv.select(df_data_csv['Country'],df_data_csv['Player_Name'], df_data_csv['Test_Matches'].cast(IntegerType()).alias("Test_Matches"), df_data_csv['ODI_Matches'].cast(IntegerType()).alias("ODI_Matches"), df_data_csv['T20_Matches'].cast(IntegerType()).alias("T20_Matches"))
+-------------+------------+-----------+-----------+
| Player_Name|Test_Matches|ODI_Matches|T20_Matches|
+-------------+------------+-----------+-----------+
| Aaron, V R| 9| 9| null|
| Abid Ali, S| 29| 5| null|
|Adhikari, H R| 21| null| null|
| Agarkar, A B| 26| 191| 4|
+-------------+------------+-----------+-----------+
Then I am taking a total , but if one of them is null , result is also coming as null. How to solve it ?
df_data_csv_withTotalCol=df_data_csv_casted.withColumn('Total_Matches',(df_data_csv_casted['Test_Matches']+df_data_csv_casted['ODI_Matches']+df_data_csv_casted['T20_Matches']))
+-------------+------------+-----------+-----------+-------------+
|Player_Name |Test_Matches|ODI_Matches|T20_Matches|Total_Matches|
+-------------+------------+-----------+-----------+-------------+
| Aaron, V R | 9| 9| null| null|
|Abid Ali, S | 29| 5| null| null|
|Adhikari, H R| 21| null| null| null|
|Agarkar, A B | 26| 191| 4| 221|
+-------------+------------+-----------+-----------+-------------+

You can fix this by using coalesce function . for example , lets create some sample data
from pyspark.sql.functions import coalesce,lit
cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
cDf.show()
+----+----+
| a| b|
+----+----+
|null|null|
| 1|null|
|null| 2|
+----+----+
When I do simple sum as you did -
cDf.withColumn('Total',cDf.a+cDf.b).show()
I get total as null , same as you described-
+----+----+-----+
| a| b|Total|
+----+----+-----+
|null|null| null|
| 1|null| null|
|null| 2| null|
+----+----+-----+
to fix, use coalesce along with lit function , which replaces null values by zeroes.
cDf.withColumn('Total',coalesce(cDf.a,lit(0)) +coalesce(cDf.b,lit(0))).show()
this gives me correct results-
| a| b|Total|
+----+----+-----+
|null|null| 0|
| 1|null| 1|
|null| 2| 2|
+----+----+-----+

How to find the previous occurence of a value 'a' before some value 'b'

I join two data frames and have the resulting data frame as below.Now I want to
+---------+-----------+-----------+-------------------+---------+-------------------+
|a |b | c | d | e | f |
+---------+-----------+-----------+-------------------+---------+-------------------+
| 7| 2| 1|2015-04-12 23:59:01| null| null |
| 15| 2| 2|2015-04-12 23:59:02| | |
| 11| 2| 4|2015-04-12 23:59:03| null| null|
| 3| 2| 4|2015-04-12 23:59:04| null| null|
| 8| 2| 3|2015-04-12 23:59:05| {NORMAL} 2015-04-12 23:59:05|
| 16| 2| 3|2017-03-12 23:59:06| null| null|
| 5| 2| 3|2015-04-12 23:59:07| null| null|
| 18| 2| 3|2015-03-12 23:59:08| null| null|
| 17| 2| 1|2015-03-12 23:59:09| null| null|
| 6| 2| 1|2015-04-12 23:59:10| null| null|
| 19| 2| 3|2015-03-12 23:59:11| null| null|
| 9| 2| 3|2015-04-12 23:59:12| null| null|
| 1| 2| 2|2015-04-12 23:59:13| null| null|
| 1| 2| 2|2015-04-12 23:59:14| null| null|
| 1| 2| 2|2015-04-12 23:59:15| null| null|
| 10| 3| 2|2015-04-12 23:59:16| null| null|
| 4| 2| 3|2015-04-12 23:59:17| {NORMAL}|2015-04-12 23:59:17|
| 12| 3| 1|2015-04-12 23:59:18| null| null|
| 13| 3| 1|2015-04-12 23:59:19| null| null|
| 14| 2| 1|2015-04-12 23:59:20| null| null|
+---------+-----------+-----------+-------------------+---------+-------------------+
Now I have to find the first occuring 1 before each 3 in column c .For example
| 4| 2| 3|2015-04-12 23:59:17| {NORMAL}|2015-04-12 23:59:17|
Before this record I want to know the first occured 1 in column c which is
| 17| 2| 1|2015-03-12 23:59:09| null| null|
Any help is appreciated

You can use Spark window function lag import org.apache.spark.sql.expressions.Window
In first step you filter your data on the column "c" based on value as either 1 or 3. You will get data similar to
dft.show()
+---+---+---+---+
| id| a| b| c|
+---+---+---+---+
| 1| 7| 2| 1|
| 2| 15| 2| 3|
| 3| 11| 2| 3|
| 4| 3| 2| 1|
| 5| 8| 2| 3|
+---+---+---+---+
Next, define the window
val w = Window.orderBy("id")
Once this is done, create a new column and put previous value in it
dft.withColumn("prev", lag("c",1).over(w)).show()
+---+---+---+---+----+
| id| a| b| c|prev|
+---+---+---+---+----+
| 1| 7| 2| 1|null|
| 2| 15| 2| 3| 1|
| 3| 11| 2| 3| 3|
| 4| 3| 2| 1| 3|
| 5| 8| 2| 3| 1|
+---+---+---+---+----+
Finally filter on the values of column "c" and "prev"
Note: Do combine the steps when you are writing final code, so as to apply filter directly.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

New column receives the value Null - scala

Related

Spark SQL find min value in column and get whole row

Apache Spark visualization

Spark Scala Window extend result until the end

PySpark : Dataframe : Numeric + Null column values resulting in NULL instead of numeric value

How to find the previous occurence of a value 'a' before some value 'b'

Categories

Resources