Find max values in each rows of Spark data frame in Scala - scala

I have an input spark-dataframe named df as,
+---------------+---+---+---+---+
| CustomerID| P1| P2| P3| P4|
+---------------+---+---+---+---+
| 725153| 5| 6| 7| 8|
| 873008| 7| 8| 1| 2|
| 725116| 5| 6| 3| 2|
| 725110| 0| 1| 2| 5|
+---------------+---+---+---+---+
Among P1,P2,P3,P4 I need to find the maximum 2 values for each CustomerID. And get the equivalent column name and put in it the df.So that my resultant dataframe should be,
+---------------+----+----+
| CustomerID|col1|col2|
+---------------+----+----+
| 725153| P4| P3|
| 873008| P2| P1|
| 725116| P2| P1|
| 725110| P4| P3|
+---------------+----+----+
Here for the first row, 8 and 7 were the maximum values. Each equivalent column name is P4 and P3. So that for its particular CustomerID, it should contain values P4 and P3. This can be achieved in pyspark by using pandas dataframe.
nlargest = 2
order = np.argsort(-df.values, axis=1)[:, :nlargest]
result = pd.DataFrame(df.columns[order],columns=['top{}'.format(i) for i in range(1, nlargest+1)],index=recommend_df.index)
But how can I achieve this in scala?

You can use UDF to get your desired result. In UDF you can zip all the column names with their respective value and then sort the Array according to the value and finally return top two column names from it. Below is the code for same.
//get all the columns that you want
val requiredCol = df.columns.zipWithIndex.filter(_._2!=0).map(_._1)
//define a UDF which sorts according to the value and returns top two column names
val topTwoColumns = udf((seq: Seq[Int]) =>
seq.zip(requiredCol).
sortBy(_._1)(Ordering[Int].reverse).
take(2).map(_._2))
Now you can use withColumn and pass your column values as an array to previously defined UDF.
df.withColumn("col", topTwoColumns(array(requiredCol.map(col(_)):_*))).
select($"CustomerID",
$"col".getItem(0).as("col1"),
$"col".getItem(1).as("col2")).show
//output
//+----------+----+----+
//|CustomerID|col1|col2|
//+----------+----+----+
//| 725153| P4| P3|
//| 873008| P2| P1|
//| 725116| P2| P1|
//| 725110| P4| P3|
//+----------+----+----+

Related

Calculate number of columns with missing values per each row in PySpark

Let see we have the following data set
columns = ['id', 'dogs', 'cats']
values = [(1, 2, 0),(2, None, None),(3, None,9)]
df = spark.createDataFrame(values,columns)
df.show()
+----+----+----+
| id|dogs|cats|
+----+----+----+
| 1| 2| 0|
| 2|null|null|
| 3|null| 9|
+----+----+----+
I would like to calculate number ("miss_nb") and percents ("miss_pt") of columns with missing values per rows and get the following table
+----+-------+-------+
| id|miss_nb|miss_pt|
+----+-------+-------+
| 1| 0| 0.00|
| 2| 2| 0.67|
| 3| 1| 0.33|
+----+-------+-------+
The number of columns should be any (non-fixed list).
How to do it?
Thanks!

Show all pyspark columns after group and agg

I wish to groupby a column and then find the max of another column. Lastly, show all the columns based on this condition. However, when I used my codes, it only show 2 columns and not all of it.
# Normal way of creating dataframe in pyspark
sdataframe_temp = spark.createDataFrame([
(2,2,'0-2'),
(2,23,'22-24')],
['a', 'b', 'c']
)
sdataframe_temp2 = spark.createDataFrame([
(4,6,'4-6'),
(5,7,'6-8')],
['a', 'b', 'c']
)
# Concat two different pyspark dataframe
sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2)
sdataframe_union_1_2_g = sdataframe_union_1_2.groupby('a').agg({'b':'max'})
sdataframe_union_1_2_g.show()
output:
+---+------+
| a|max(b)|
+---+------+
| 5| 7|
| 2| 23|
| 4| 6|
+---+------+
Expected output:
+---+------+-----+
| a|max(b)| c |
+---+------+-----+
| 5| 7|6-8 |
| 2| 23|22-24|
| 4| 6|4-6 |
+---+------+---+
You can use a Window function to make it work:
Method 1: Using Window function
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window().partitionBy("a").orderBy(F.desc("b"))
(sdataframe_union_1_2
.withColumn('max_val', F.row_number().over(w) == 1)
.where("max_val == True")
.drop("max_val")
.show())
+---+---+-----+
| a| b| c|
+---+---+-----+
| 5| 7| 6-8|
| 2| 23|22-24|
| 4| 6| 4-6|
+---+---+-----+
Explanation
Window functions are useful when we want to attach a new column to the existing set of columns.
In this case, I tell Window function to groupby partitionBy('a') column and sort the column b in descending order F.desc(b). This make the first value in b in each group its max value.
Then we use F.row_number() to filter the max values where row number equals 1.
Finally, we drop the new column since it is not being used after filtering the data frame.
Method 2: Using groupby + inner join
f = sdataframe_union_1_2.groupby('a').agg(F.max('b').alias('b'))
sdataframe_union_1_2.join(f, on=['a','b'], how='inner').show()
+---+---+-----+
| a| b| c|
+---+---+-----+
| 2| 23|22-24|
| 5| 7| 6-8|
| 4| 6| 4-6|
+---+---+-----+

pyspark: drop columns that have same values in all rows

Related question: How to drop columns which have same values in all rows via pandas or spark dataframe?
So I have a pyspark dataframe, and I want to drop the columns where all values are the same in all rows while keeping other columns intact.
However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe?
Thanks
You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1 means it has only 1 value in all rows.
# apply countDistinct on each column
col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()
# select the cols with count=1 in an array
cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]
# drop the selected column
df.drop(*cols_to_drop).show()
You can use approx_count_distinct function (link) to count the number of distinct elements in a column. In case there is just one distinct, the remove the corresponding column.
Creating the DataFrame
from pyspark.sql.functions import approx_count_distinct
myValues = [(1,2,2,0),(2,2,2,0),(3,2,2,0),(4,2,2,0),(3,1,2,0)]
df = sqlContext.createDataFrame(myValues,['value1','value2','value3','value4'])
df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 1| 2| 2| 0|
| 2| 2| 2| 0|
| 3| 2| 2| 0|
| 4| 2| 2| 0|
| 3| 1| 2| 0|
+------+------+------+------+
Couting number of distinct elements and converting it into dictionary.
count_distinct_df=df.select([approx_count_distinct(x).alias("{0}".format(x)) for x in df.columns])
count_distinct_df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 4| 2| 1| 1|
+------+------+------+------+
dict_of_columns = count_distinct_df.toPandas().to_dict(orient='list')
dict_of_columns
{'value1': [4], 'value2': [2], 'value3': [1], 'value4': [1]}
#Storing those keys in the list which have just 1 distinct key.
distinct_columns=[k for k,v in dict_of_columns.items() if v == [1]]
distinct_columns
['value3', 'value4']
Drop the columns having distinct values
df=df.drop(*distinct_columns)
df.show()
+------+------+
|value1|value2|
+------+------+
| 1| 2|
| 2| 2|
| 3| 2|
| 4| 2|
| 3| 1|
+------+------+

pyspark MlLib: exclude a column value in a row

I am trying to create an RDD of LabeledPoint from a data frame, so I can later use it for MlLib.
The code below works fine if my_target column is the first column in sparkDF. However, if my_target column is not the first column, how do I modify the code below to exclude my_target to create a correct LabeledPoint?
import pyspark.mllib.classification as clf
labeledData = sparkDF.rdd.map(lambda row: clf.LabeledPoint(row['my_target'],row[1:]))
logRegr = clf.LogisticRegressionWithSGD.train(labeledData)
That is, row[1:] now exclude the value in the first column; if I want to exclude value in column N of row, how do I do this? Thanks!
>>> a = [(1,21,31,41),(2,22,32,42),(3,23,33,43),(4,24,34,44),(5,25,35,45)]
>>> df = spark.createDataFrame(a,["foo","bar","baz","bat"])
>>> df.show()
+---+---+---+---+
|foo|bar|baz|bat|
+---+---+---+---+
| 1| 21| 31| 41|
| 2| 22| 32| 42|
| 3| 23| 33| 43|
| 4| 24| 34| 44|
| 5| 25| 35| 45|
+---+---+---+---+
>>> N = 2
# N is the column that you want to exclude (in this example the third, indexing starts at 0)
>>> labeledData = df.rdd.map(lambda row: LabeledPoint(row['foo'],row[:N]+row[N+1:]))
# it is just a concatenation with N that is excluded both in row[:N] and row[N+1:]
>>> labeledData.collect()
[LabeledPoint(1.0, [1.0,21.0,41.0]), LabeledPoint(2.0, [2.0,22.0,42.0]), LabeledPoint(3.0, [3.0,23.0,43.0]), LabeledPoint(4.0, [4.0,24.0,44.0]), LabeledPoint(5.0, [5.0,25.0,45.0])]

Spark Scala DF. add a new Column to DF based in processing of some rows of the same column

Dears,
I'm New on SparK Scala, and,
I have a DF of two columns: "UG" and "Counts" and I like to obtain the Third
How was exposed in thsi list.
DF: UG, Counts, CUG ( the columns)
of 12 4
of 23 4
the 134 3
love 68 2
pain 3 1
the 18 3
love 100 2
of 23 4
the 12 3
of 11 4
I need to add a new column called "CUG", the third one exposed, where CUG(i) is the number of times that the string(i) in UG appears in the whole Column.
I tried with the following scheme:
Having the DF like the previous table in df. I did a sql UDF function to count the number of times that the string appear in the column "UG", that is:
val NW1 = (w1:String) => {
df.filter($"UG".like(w1.substring(1,(w1.length-1))).count()
}:Long
val sqlfunc = udf(NW1)
val df2= df.withColumn("CUG",sqlfunc(col("UG")))
But when I tried, ...It did'nt work. I obtained an error of Null Point exception. The UDF scheme worked isolated but not with in DF.
What can I do in order to obtain the asked results using DF.
Thanks In advance.
jm3
So what you can do is firstly count the number of rows grouped by the UG column which gives the third column you need, and then join with the original data frame. You can rename the column name if you want with the withColumnRenamed function.
scala> import org.apache.spark.sql.functions._
scala> myDf.show()
+----+------+
| UG|Counts|
+----+------+
| of| 12|
| of| 23|
| the| 134|
|love| 68|
|pain| 3|
| the| 18|
|love| 100|
| of| 23|
| the| 12|
| of| 11|
+----+------+
scala> myDf.join(myDf.groupBy("UG").count().withColumnRenamed("count", "CUG"), "UG").show()
+----+------+---+
| UG|Counts|CUG|
+----+------+---+
| of| 12| 4|
| of| 23| 4|
| the| 134| 3|
|love| 68| 2|
|pain| 3| 1|
| the| 18| 3|
|love| 100| 2|
| of| 23| 4|
| the| 12| 3|
| of| 11| 4|
+----+------+---+