pyspark UDF with null values check and if statement - pyspark

This works provided no null values exist in an array passed to a pyspark UDF.
concat_udf = udf(
lambda con_str, arr: [x + con_str for x in arr], ArrayType(StringType())
)
I am not seeing how we can adapt this with a null / None check with an If. How to adapt the following correctly below that does not work:
concat_udf = udf(lambda con_str, arr: [ if x is None: 'XXX' else: x + con_str for x in arr ], ArrayType(StringType()))
I can find no such example. if with transform no success either.
+----------+--------------+--------------------+
| name|knownLanguages| properties|
+----------+--------------+--------------------+
| James| [Java, Scala]|[eye -> brown, ha...|
| Michael|[Spark, Java,]|[eye ->, hair -> ...|
| Robert| [CSharp, ]|[eye -> , hair ->...|
|Washington| null| null|
| Jefferson| [1, 2]| []|
+----------+--------------+--------------------+
should become
+----------+--------------------+-----------------------+
| name|knownLanguages| properties |
+----------+--------------------+-----------------------+
| James| [JavaXXX, ScalaXXX]|[eye -> brown, ha... |
| Michael|[SparkXXX, JavaXXX,XXX]|[eye ->, hair -> ...|
| Robert| [CSharpXXX, XXX]|[eye -> , hair ->... |
|Washington| XXX| null |
| Jefferson| [1XXX, 2XXX]| [] |
+----------+--------------+-----------------------------+

using ternary operator, I would do something like this :
concat_udf = udf(
lambda con_str, arr: [x + con_str if x is not None else "XXX" for x in arr]
if arr is not None
else ["XXX"],
ArrayType(StringType()),
)
# OR
concat_udf = udf(
lambda con_str, arr: [
x + con_str if x is not None else "XXX" for x in arr or [None]
],
ArrayType(StringType()),
)

Related

spark - stack multiple when conditions from an Array of column expressions

I have the below spark dataframe:
val df = Seq(("US",10),("IND",20),("NZ",30),("CAN",40)).toDF("a","b")
df.show(false)
+---+---+
|a |b |
+---+---+
|US |10 |
|IND|20 |
|NZ |30 |
|CAN|40 |
+---+---+
and I'm applying the when() condition as follows:
df.withColumn("x", when(col("a").isin(us_list:_*),"u").when(col("a").isin(i_list:_*),"i").when(col("a").isin(n_list:_*),"n").otherwise("-")).show(false)
+---+---+---+
|a |b |x |
+---+---+---+
|US |10 |u |
|IND|20 |i |
|NZ |30 |n |
|CAN|40 |- |
+---+---+---+
Now to minimize the code, I'm trying the below:
val us_list = Array("U","US")
val i_list = Array("I","IND")
val n_list = Array("N","NZ")
val ar1 = Array((us_list,"u"),(i_list,"i"),(n_list,"n"))
val ap = ar1.map( x => when(col("a").isInCollection(x._1),x._2) )
This results in
ap: Array[org.apache.spark.sql.Column] = Array(CASE WHEN (a IN (U, US)) THEN u END, CASE WHEN (a IN (I, IND)) THEN i END, CASE WHEN (a IN (N, NZ)) THEN n END)
but when I try
val ap = ar1.map( x => when(col("a").isInCollection(x._1),x._2) ).reduce( (x,y) => x.y )
I get an error. How to fix this?
You can use foldLeft on ar1 list :
val x = ar1.foldLeft(lit("-")) { case (acc, (list, value)) =>
when(col("a").isin(list: _*), value).otherwise(acc)
}
// x: org.apache.spark.sql.Column = CASE WHEN (a IN (N, NZ)) THEN n ELSE CASE WHEN (a IN (I, IND)) THEN i ELSE CASE WHEN (a IN (U, US)) THEN u ELSE - END END END
There is usually no need to combine when statements using reduce/fold etc. coalesce is enough because the when statements are evaluated in sequence, and gives null when the condition is false. Also it can save you from specifying otherwise because you can just append one more column to the list of arguments to coalesce.
val ar1 = Array((us_list,"u"),(i_list,"i"),(n_list,"n"))
val ap = ar1.map( x => when(col("a").isInCollection(x._1),x._2) )
val combined = coalesce(ap :+ lit("-"): _*)
df.withColumn("x", combined).show
+---+---+---+
| a| b| x|
+---+---+---+
| US| 10| u|
|IND| 20| i|
| NZ| 30| n|
|CAN| 40| -|
+---+---+---+

How to collect a map after group by in Pyspark dataframe?

I have a pyspark dataframe like this:
| id | time | cat |
-------------------------
1 t1 a
1 t2 b
2 t3 b
2 t4 c
2 t5 b
3 t6 a
3 t7 a
3 t8 a
Now, I want to group them by "id" and aggregate them into a Map like this:
| id | cat |
---------------------------
| 1 | a -> 1, b -> 1 |
| 2 | b -> 2, c -> 1 |
| 3 | a -> 3 |
I guess we can use pyspark sql function's collect_list to collect them as list, and then I could apply some UDF function to turn the list into dict. But is there any other (shorter or more efficient) way to do this?
You can use this function from pyspark.sql.functions.map_from_entries
if we consider your dataframe is df you should do this:
import pyspark.sql.functions as F
df1 = df.groupby("id", "cat").count()
df2 = df1.groupby("id")\
.agg(F.map_from_entries(F.collect_list(F.struct("cat","count"))).alias("cat"))
similar to yasi's answer
import pyspark.sql.functions as F
df1 = df.groupby("id", "cat").count()
df2 = df1.groupby("id")\
.agg(F.map_from_arrays(F.collect_list("cat"),F.collect_list("count"))).alias("cat"))
Here is how I did it.
Code
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
df = spark.createDataFrame([(1,'t1','a'),(1,'t2','b'),(2,'t3','b'),(2,'t4','c'),(2,'t5','b'),\
(3,'t6','a'),(3,'t7','a'),(3,'t8','a')],\
('id','time','cat'))
(df.groupBy(['id', 'cat'])
.agg(F.count(F.col('cat')).cast(StringType()).alias('counted'))
.select(['id', F.concat_ws('->', F.col('cat'), F.col('counted')).alias('arrowed')])
.groupBy('id')
.agg(F.collect_list('arrowed'))
.show()
)
Output
+-------+---------------------+
| id|collect_list(arrowed)|
+-------+---------------------+
| 1 | [a -> 1, b -> 1] |
| 3 | [a -> 3] |
| 2 | [b -> 2, c -> 1] |
+-------+---------------------+
Edit
(df.groupBy(['id', 'cat'])
.count()
.select(['id',F.create_map('cat', 'count').alias('map')])
.groupBy('id')
.agg(F.collect_list('map').alias('cat'))
.show()
)
#+---+--------------------+
#| id| cat|
#+---+--------------------+
#| 1|[[a -> 1], [b -> 1]]|
#| 3| [[a -> 3]]|
#| 2|[[b -> 2], [c -> 1]]|
#+---+--------------------+

I want to calculate using three columns and produce single column with showing all three values

I am loading a file in dataframe in spark databrick
spark.sql("""select A,X,Y,Z from fruits""")
A X Y Z
1E5 1.000 0.000 0.000
1U2 2.000 5.000 0.000
5G6 3.000 0.000 10.000
I need output as
A D
1E5 X 1
1U2 X 2, Y 5
5G6 X 3, Z 10
I am able to find the solution.
Each column name can be joined with value, and then all values can be joined in one column, separated by comma:
// data
val df = Seq(
("1E5", 1.000, 0.000, 0.000),
("1U2", 2.000, 5.000, 0.000),
("5G6", 3.000, 0.000, 10.000))
.toDF("A", "X", "Y", "Z")
// action
val columnsToConcat = List("X", "Y", "Z")
val columnNameValueList = columnsToConcat.map(c =>
when(col(c) =!= 0, concat(lit(c), lit(" "), col(c).cast(IntegerType)))
.otherwise("")
)
val valuesJoinedByComaColumn = columnNameValueList.reduce((a, b) =>
when(org.apache.spark.sql.functions.length(a) =!= 0 && org.apache.spark.sql.functions.length(b) =!= 0, concat(a, lit(", "), b))
.otherwise(concat(a, b))
)
val result = df.withColumn("D", valuesJoinedByComaColumn)
.drop(columnsToConcat: _*)
Output:
+---+---------+
|A |D |
+---+---------+
|1E5|X 1 |
|1U2|X 2, Y 5 |
|5G6|X 3, Z 10|
+---+---------+
Solution similar with proposed by stack0114106, but looks more explicit.
Check this out:
scala> val df = Seq(("1E5",1.000,0.000,0.000),("1U2",2.000,5.000,0.000),("5G6",3.000,0.000,10.000)).toDF("A","X","Y","Z")
df: org.apache.spark.sql.DataFrame = [A: string, X: double ... 2 more fields]
scala> df.show()
+---+---+---+----+
| A| X| Y| Z|
+---+---+---+----+
|1E5|1.0|0.0| 0.0|
|1U2|2.0|5.0| 0.0|
|5G6|3.0|0.0|10.0|
+---+---+---+----+
scala> val newcol = df.columns.drop(1).map( x=> when(col(x)===0,lit("")).otherwise(concat(lit(x),lit(" "),col(x).cast("int").cast("string"))) ).reduce( (x,y) => concat(x,lit(", "),y) )
newcol: org.apache.spark.sql.Column = concat(concat(CASE WHEN (X = 0) THEN ELSE concat(X, , CAST(CAST(X AS INT) AS STRING)) END, , , CASE WHEN (Y = 0) THEN ELSE concat(Y, , CAST(CAST(Y AS INT) AS STRING)) END), , , CASE WHEN (Z = 0) THEN ELSE concat(Z, , CAST(CAST(Z AS INT) AS STRING)) END)
scala> df.withColumn("D",newcol).withColumn("D",regexp_replace(regexp_replace('D,", ,",","),", $", "")).drop("X","Y","Z").show(false)
+---+---------+
|A |D |
+---+---------+
|1E5|X 1 |
|1U2|X 2, Y 5 |
|5G6|X 3, Z 10|
+---+---------+
scala>

Spark 'join' DataFrame with List and return String

I have the following DataFrame:
DF1:
+------+---------+
|key1 |Value |
+------+---------+
|[k, l]| 1 |
|[m, n]| 2 |
|[o] | 3 |
+------+---------+
that needs to be 'joined' with another dataframe
DF2:
+----+
|key2|
+----+
|k |
|l |
|m |
|n |
|o |
+----+
so that the output looks like this:
DF3:
+--------------------+---------+
|key3 |Value |
+--------------------+---------+
|k:1 l:1 m:0 n:0 o:0 | 1 |
|k:0 l:0 m:1 n:1 o:0 | 2 |
|k:0 l:0 m:0 n:0 o:1 | 3 |
+--------------------+---------+
In other words, the output dataframe should have a column that is a string of all rows in DF2, and each element should be followed by a 1 or 0 indicating whether that element was present in the list in the column key1 of DF1.
I am not sure how to go about it. Is there a simple UDF I can write to accomplish what I want?
For operation like this to be possible DF2 so you can just use udf:
import spark.implicits._
import org.apache.spark.sql.functions._
val df1 = Seq(
(Seq("k", "l"), 1), (Seq("m", "n"), 2), (Seq("o"), 3)
).toDF("key1", "value")
val df2 = Seq("k", "l", "m", "n", "o").toDF("key2")
val keys = df2.as[String].collect.map((_, 0)).toMap
val toKeyMap = udf((xs: Seq[String]) =>
xs.foldLeft(keys)((acc, x) => acc + (x -> 1)))
df1.select(toKeyMap($"key1").alias("key3"), $"value").show(false)
// +-------------------------------------------+-----+
// |key3 |value|
// +-------------------------------------------+-----+
// |Map(n -> 0, m -> 0, l -> 1, k -> 1, o -> 0)|1 |
// |Map(n -> 1, m -> 1, l -> 0, k -> 0, o -> 0)|2 |
// |Map(n -> 0, m -> 0, l -> 0, k -> 0, o -> 1)|3 |
// +-------------------------------------------+-----+
If you want just a string:
val toKeyMapString = udf((xs: Seq[String]) =>
xs.foldLeft(keys)((acc, x) => acc + (x -> 1))
.map { case (k, v) => s"$k: $v" }
.mkString(" ")
)
df1.select(toKeyMapString($"key1").alias("key3"), $"value").show(false)
// +------------------------+-----+
// |key3 |value|
// +------------------------+-----+
// |n: 0 m: 0 l: 1 k: 1 o: 0|1 |
// |n: 1 m: 1 l: 0 k: 0 o: 0|2 |
// |n: 0 m: 0 l: 0 k: 0 o: 1|3 |
// +------------------------+-----+

Spark getting the average length of all the words by alphabet

I am trying to find out the average length of words that begin with each of other alphabets except z.
So far, I have
// words only
val words1 = words.map(_.toLowerCase).filter(x => x.length>0).filter(x => x(0).isLetter)
val allWords = words1.filter(x=> !x.startsWith("z"))// avoiding the z
var mapAllWords= allWords.map(x=> ((x), (x.length)))//mapped it by length.
Now, I what I am trying to do is like ((A,(2,3,4,.....), (b,(2,4,5,...,9),....)
and get the mean of all alphabets by length.
I am new on Scala Programming.
Let's say this is your data:
val words = sc.textFile("README.md").flatMap(_.split("\\s+"))
Convert to dataset:
val ds = spark.createDataset(words)
Filter and aggregate
ds
// Get first letter and length
.select(
lower(substring($"value", 0, 1)) as "letter", length($"value") as "length")
// Remove non-letters and z
.where($"letter".rlike("^[a-y]"))
// Compute average length
.groupBy("letter")
.avg()
.show
// +------+------------------+
// |letter| avg(length)|
// +------+------------------+
// | l| 7.333333333333333|
// | g|13.846153846153847|
// | m| 9.0|
// | f|3.8181818181818183|
// | n| 3.0|
// | v| 25.4|
// | e| 7.6|
// | o|3.3461538461538463|
// | h| 6.1875|
// | p| 9.0|
// | d| 9.55|
// | y| 3.3|
// | w| 4.0|
// | c| 6.56|
// | u| 4.416666666666667|
// | i| 4.774193548387097|
// | j| 5.0|
// | b| 5.352941176470588|
// | a|3.5526315789473686|
// | r| 4.6|
// +------+------------------+
// only showing top 20 rows
in scala (no spark)
some hints for you:
val l=List("mario","monica", "renzo","sabrina","sonia","nikola", "enrica","paola")
val couples = l.map(w => (w.charAt(0), w.length))
couples.groupBy(_._1)
.map(x=> ( x._1, (x._2, x._2.size)))
you get:
l: List[String] = List(mario, monica, renzo, sabrina, sonia, nikola, enrica, paola)
couples: List[(Char, Int)] = List((m,5), (m,6), (r,5), (s,7), (s,5), (n,6), (e,6), (p,5))
res0: scala.collection.immutable.Map[Char,(List[(Char, Int)], Int)] = Map(e -> (List((e,6)),1), s -> (List((s,7), (s,5)),2), n -> (List((n,6)),1), m -> (List((m,5), (m,6)),2), p -> (List((p,5)),1), r -> (List((r,5)),1))
Here's a Scala example of getting the average size of all words starting with the same letter that I think you could adapt easily enough to your use case.
val sentences = Array("Lester is nice", "Lester is cool", "cool Lester is an awesome dude", "awesome awesome awesome Les")
val sentRDD = sc.parallelize(sentences)
val gbRDD = sentRDD.flatMap(line => line.split(' ')).map(word => (word(0), word.length)).groupByKey(2)
gbRDD.map(wordKVP => (wordKVP._1, wordKVP._2.sum/wordKVP._2.size.toDouble)).collect()
It returns the following...
Array((d,4.0), (L,5.25), (n,4.0), (a,6.0), (i,2.0), (c,4.0))
Here it is with PySpark if you prefer...
sentences = ['Lester is nice', 'Lester is cool', 'cool Lester is an awesome dude', 'awesome awesome awesome Les']
sentRDD = sc.parallelize(sentences)
gbRDD = sentRDD.flatMap(lambda line: line.split(' ')).map(lambda word: (word[0], len(word))).groupByKey(2)
gbRDD.map(lambda wordKVP: (wordKVP[0], sum(wordKVP[1])/len(wordKVP[1]))).collect()
Same results...
[('L', 5.25), ('i', 2.0), ('c', 4.0), ('d', 4.0), ('n', 4.0), ('a', 6.0)]