How to convert number into percentage - pyspark

For a pyspark data frame, Do you know how to convert numbers with decimals into percentage format? I can even determine the number of decimal points I want to keep.

You can multiply with 100
df.withColumn("rate",(col("rate") * 100).cast("int")).show()
+---+---+----+
| id|row|rate|
+---+---+----+
| A| 1| 1|
+---+---+----+
df.withColumn("rate",concat((col("rate") * 100).cast("int"),lit('%'))).show()
+---+---+----+
| id|row|rate|
+---+---+----+
| A| 1| 1%|
+---+---+----+

Related

Scala Spark use Window function to find max value

I have a data set that looks like this:
+------------------------|-----+
| timestamp| zone|
+------------------------+-----+
| 2019-01-01 00:05:00 | A|
| 2019-01-01 00:05:00 | A|
| 2019-01-01 00:05:00 | B|
| 2019-01-01 01:05:00 | C|
| 2019-01-01 02:05:00 | B|
| 2019-01-01 02:05:00 | B|
+------------------------+-----+
For each hour I need to count which zone had the most rows and end up with a table that looks like this:
+-----|-----+-----+
| hour| zone| max |
+-----+-----+-----+
| 0| A| 2|
| 1| C| 1|
| 2| B| 2|
+-----+-----+-----+
My instructions say that I need to use the Window function along with "group by" to find my max count.
I've tried a few things but I'm not sure if I'm close. Any help would be appreciated.
You can use 2 subsequent window-functions to get your result:
df
.withColumn("hour",hour($"timestamp"))
.withColumn("cnt",count("*").over(Window.partitionBy($"hour",$"zone")))
.withColumn("rnb",row_number().over(Window.partitionBy($"hour").orderBy($"cnt".desc)))
.where($"rnb"===1)
.select($"hour",$"zone",$"cnt".as("max"))
You can use Windowing functions and group by with dataframes.
In your case you could use rank() over(partition by) window function.
import org.apache.spark.sql.function._
// first group by hour and zone
val df_group = data_tms.
select(hour(col("timestamp")).as("hour"), col("zone"))
.groupBy(col("hour"), col("zone"))
.agg(count("zone").as("max"))
// second rank by hour order by max in descending order
val df_rank = df_group.
select(col("hour"),
col("zone"),
col("max"),
rank().over(Window.partitionBy(col("hour")).orderBy(col("max").desc)).as("rank"))
// filter by col rank = 1
df_rank
.select(col("hour"),
col("zone"),
col("max"))
.where(col("rank") === 1)
.orderBy(col("hour"))
.show()
/*
+----+----+---+
|hour|zone|max|
+----+----+---+
| 0| A| 2|
| 1| C| 1|
| 2| B| 2|
+----+----+---+
*/

How do i filter bad or corrupted rows from a spark data frame after casting

df1
+-------+-------+-----+
| ID | Score| hits|
+-------+-------+-----+
| 01| 100| Null|
| 02| Null| 80|
| 03| spark| 1|
| 04| 300| 1|
+-------+-------+-----+
after casting Score to int and hits to float I get the below dataframe:
df2
+-------+-------+-----+
| ID | Score| hits|
+-------+-------+-----+
| 01| 100| Null|
| 02| Null| 80.0|
| 03| Null| 1.0|
| 04| 300| 1.0|
+-------+-------+-----+
Now I want to extract only the bad records , bad records mean that null produced after casting.
I want to do the operations only on existing dataframe. Please help me out if there is any build-in way to get the bad records after casting.
Please also consider this is sample dataframe. The solution should solve for any number of columns and any scenario.
I tried by separating the null records from both dataframes and compare them. Also i have thought of adding another column with number of nulls and then compare the both dataframes if number of nulls is grater in df2 than in df1 then those are bad one. But i think these solutions are pretty old school.
I would like to know the better way to resolve it.
You can use a custom function/udf to convert string to integer and map non integer values to specific number eg. -999999999.
Later you can filter on -999999999 to identify originally non integer records.
def udfInt(value):
if value is None:
return None
elif value.isdigit():
return int(value)
else:
return -999999999
spark.udf.register('udfInt', udfInt)
df.selectExpr("*",
"udfInt(Score) AS new_Score").show()
#+---+-----+----+----------+
#| ID|Score|hits| new_Score|
#+---+-----+----+----------+
#| 01| 100|null| 100|
#| 02| null| 80| null|
#| 03|spark| 1|-999999999|
#| 04| 300| 1| 300|
#+---+-----+----+----------+
Filter on -999999999 to identify non integer (bad records)
df.selectExpr("*","udfInt(Score) AS new_Score").filter("new_score == -999999999").show()
#+---+-----+----+----------+
#| ID|Score|hits| new_Score|
#+---+-----+----+----------+
#| 03|spark| 1|-999999999|
#+---+-----+----+----------+
The same way you can have customized udf for float conversion.

How get the percentage of totals for each count after a groupBy in pyspark?

Given the following DataFrame:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("test").getOrCreate()
df = spark.createDataFrame([['a',1],['b', 2],['a', 3]], ['category', 'value'])
df.show()
+--------+-----+
|category|value|
+--------+-----+
| a| 1|
| b| 2|
| a| 3|
+--------+-----+
I want to count the number of items in each category and provide a percentage of total for each count, like so
+--------+-----+----------+
|category|count|percentage|
+--------+-----+----------+
| b| 1| 0.333|
| a| 2| 0.667|
+--------+-----+----------+
You can obtain the count and percentage/ratio of totals with the following
import pyspark.sql.functions as f
from pyspark.sql.window import Window
df.groupBy('category').count()\
.withColumn('percentage', f.round(f.col('count') / f.sum('count')\
.over(Window.partitionBy()),3)).show()
+--------+-----+----------+
|category|count|percentage|
+--------+-----+----------+
| b| 1| 0.333|
| a| 2| 0.667|
+--------+-----+----------+
The previous statement can be divided in steps. df.groupBy('category').count() produces the count:
+--------+-----+
|category|count|
+--------+-----+
| b| 1|
| a| 2|
+--------+-----+
then by applying window functions we can obtain the total count on each row:
df.groupBy('category').count().withColumn('total', f.sum('count').over(Window.partitionBy())).show()
+--------+-----+-----+
|category|count|total|
+--------+-----+-----+
| b| 1| 3|
| a| 2| 3|
+--------+-----+-----+
where the total column is calculated by adding together all the counts in the partition (a single partition that includes all rows).
Once we have count and total for each row we can calculate the ratio:
df.groupBy('category')\
.count()\
.withColumn('total', f.sum('count').over(Window.partitionBy()))\
.withColumn('percentage',f.col('count')/f.col('total'))\
.show()
+--------+-----+-----+------------------+
|category|count|total| percentage|
+--------+-----+-----+------------------+
| b| 1| 3|0.3333333333333333|
| a| 2| 3|0.6666666666666666|
+--------+-----+-----+------------------+
You can groupby and aggregate with agg:
import pyspark.sql.functions as F
df.groupby('category').agg(F.count('value') / df.count()).show()
Output:
+--------+------------------+
|category|(count(value) / 3)|
+--------+------------------+
| b|0.3333333333333333|
| a|0.6666666666666666|
+--------+------------------+
To make it nicer you can use:
df.groupby('category').agg(
(
F.round(F.count('value') / df.count(), 2)
).alias('ratio')
).show()
Output:
+--------+-----+
|category|ratio|
+--------+-----+
| b| 0.33|
| a| 0.67|
+--------+-----+
You can also use SQL:
df.createOrReplaceTempView('df')
spark.sql(
"""
SELECT category, COUNT(*) / (SELECT COUNT(*) FROM df) AS ratio
FROM df
GROUP BY category
"""
).show()

PySpark: how to get the maximum absolute value of a column in a data frame?

Suppose I have
+----+---+
| v1| v2|
+----+---+
|-1.0| 0|
| 0.0| 1|
| 1.0| 2|
|-2.0| 3|
+----+---+
I want get the max absolute value of column v1, which is 2.0. Thanks!
Use agg with max and abs from pyspark.sql.functions:
import pyspark.sql.functions as F
df.agg(F.max(F.abs(df.v1))).first()[0]
# 2

pyspark MlLib: exclude a column value in a row

I am trying to create an RDD of LabeledPoint from a data frame, so I can later use it for MlLib.
The code below works fine if my_target column is the first column in sparkDF. However, if my_target column is not the first column, how do I modify the code below to exclude my_target to create a correct LabeledPoint?
import pyspark.mllib.classification as clf
labeledData = sparkDF.rdd.map(lambda row: clf.LabeledPoint(row['my_target'],row[1:]))
logRegr = clf.LogisticRegressionWithSGD.train(labeledData)
That is, row[1:] now exclude the value in the first column; if I want to exclude value in column N of row, how do I do this? Thanks!
>>> a = [(1,21,31,41),(2,22,32,42),(3,23,33,43),(4,24,34,44),(5,25,35,45)]
>>> df = spark.createDataFrame(a,["foo","bar","baz","bat"])
>>> df.show()
+---+---+---+---+
|foo|bar|baz|bat|
+---+---+---+---+
| 1| 21| 31| 41|
| 2| 22| 32| 42|
| 3| 23| 33| 43|
| 4| 24| 34| 44|
| 5| 25| 35| 45|
+---+---+---+---+
>>> N = 2
# N is the column that you want to exclude (in this example the third, indexing starts at 0)
>>> labeledData = df.rdd.map(lambda row: LabeledPoint(row['foo'],row[:N]+row[N+1:]))
# it is just a concatenation with N that is excluded both in row[:N] and row[N+1:]
>>> labeledData.collect()
[LabeledPoint(1.0, [1.0,21.0,41.0]), LabeledPoint(2.0, [2.0,22.0,42.0]), LabeledPoint(3.0, [3.0,23.0,43.0]), LabeledPoint(4.0, [4.0,24.0,44.0]), LabeledPoint(5.0, [5.0,25.0,45.0])]