Spark getting the average length of all the words by alphabet - scala

I am trying to find out the average length of words that begin with each of other alphabets except z.
So far, I have
// words only
val words1 = words.map(_.toLowerCase).filter(x => x.length>0).filter(x => x(0).isLetter)
val allWords = words1.filter(x=> !x.startsWith("z"))// avoiding the z
var mapAllWords= allWords.map(x=> ((x), (x.length)))//mapped it by length.
Now, I what I am trying to do is like ((A,(2,3,4,.....), (b,(2,4,5,...,9),....)
and get the mean of all alphabets by length.
I am new on Scala Programming.

Let's say this is your data:
val words = sc.textFile("README.md").flatMap(_.split("\\s+"))
Convert to dataset:
val ds = spark.createDataset(words)
Filter and aggregate
ds
// Get first letter and length
.select(
lower(substring($"value", 0, 1)) as "letter", length($"value") as "length")
// Remove non-letters and z
.where($"letter".rlike("^[a-y]"))
// Compute average length
.groupBy("letter")
.avg()
.show
// +------+------------------+
// |letter| avg(length)|
// +------+------------------+
// | l| 7.333333333333333|
// | g|13.846153846153847|
// | m| 9.0|
// | f|3.8181818181818183|
// | n| 3.0|
// | v| 25.4|
// | e| 7.6|
// | o|3.3461538461538463|
// | h| 6.1875|
// | p| 9.0|
// | d| 9.55|
// | y| 3.3|
// | w| 4.0|
// | c| 6.56|
// | u| 4.416666666666667|
// | i| 4.774193548387097|
// | j| 5.0|
// | b| 5.352941176470588|
// | a|3.5526315789473686|
// | r| 4.6|
// +------+------------------+
// only showing top 20 rows

in scala (no spark)
some hints for you:
val l=List("mario","monica", "renzo","sabrina","sonia","nikola", "enrica","paola")
val couples = l.map(w => (w.charAt(0), w.length))
couples.groupBy(_._1)
.map(x=> ( x._1, (x._2, x._2.size)))
you get:
l: List[String] = List(mario, monica, renzo, sabrina, sonia, nikola, enrica, paola)
couples: List[(Char, Int)] = List((m,5), (m,6), (r,5), (s,7), (s,5), (n,6), (e,6), (p,5))
res0: scala.collection.immutable.Map[Char,(List[(Char, Int)], Int)] = Map(e -> (List((e,6)),1), s -> (List((s,7), (s,5)),2), n -> (List((n,6)),1), m -> (List((m,5), (m,6)),2), p -> (List((p,5)),1), r -> (List((r,5)),1))

Here's a Scala example of getting the average size of all words starting with the same letter that I think you could adapt easily enough to your use case.
val sentences = Array("Lester is nice", "Lester is cool", "cool Lester is an awesome dude", "awesome awesome awesome Les")
val sentRDD = sc.parallelize(sentences)
val gbRDD = sentRDD.flatMap(line => line.split(' ')).map(word => (word(0), word.length)).groupByKey(2)
gbRDD.map(wordKVP => (wordKVP._1, wordKVP._2.sum/wordKVP._2.size.toDouble)).collect()
It returns the following...
Array((d,4.0), (L,5.25), (n,4.0), (a,6.0), (i,2.0), (c,4.0))
Here it is with PySpark if you prefer...
sentences = ['Lester is nice', 'Lester is cool', 'cool Lester is an awesome dude', 'awesome awesome awesome Les']
sentRDD = sc.parallelize(sentences)
gbRDD = sentRDD.flatMap(lambda line: line.split(' ')).map(lambda word: (word[0], len(word))).groupByKey(2)
gbRDD.map(lambda wordKVP: (wordKVP[0], sum(wordKVP[1])/len(wordKVP[1]))).collect()
Same results...
[('L', 5.25), ('i', 2.0), ('c', 4.0), ('d', 4.0), ('n', 4.0), ('a', 6.0)]

Related

spark - stack multiple when conditions from an Array of column expressions

I have the below spark dataframe:
val df = Seq(("US",10),("IND",20),("NZ",30),("CAN",40)).toDF("a","b")
df.show(false)
+---+---+
|a |b |
+---+---+
|US |10 |
|IND|20 |
|NZ |30 |
|CAN|40 |
+---+---+
and I'm applying the when() condition as follows:
df.withColumn("x", when(col("a").isin(us_list:_*),"u").when(col("a").isin(i_list:_*),"i").when(col("a").isin(n_list:_*),"n").otherwise("-")).show(false)
+---+---+---+
|a |b |x |
+---+---+---+
|US |10 |u |
|IND|20 |i |
|NZ |30 |n |
|CAN|40 |- |
+---+---+---+
Now to minimize the code, I'm trying the below:
val us_list = Array("U","US")
val i_list = Array("I","IND")
val n_list = Array("N","NZ")
val ar1 = Array((us_list,"u"),(i_list,"i"),(n_list,"n"))
val ap = ar1.map( x => when(col("a").isInCollection(x._1),x._2) )
This results in
ap: Array[org.apache.spark.sql.Column] = Array(CASE WHEN (a IN (U, US)) THEN u END, CASE WHEN (a IN (I, IND)) THEN i END, CASE WHEN (a IN (N, NZ)) THEN n END)
but when I try
val ap = ar1.map( x => when(col("a").isInCollection(x._1),x._2) ).reduce( (x,y) => x.y )
I get an error. How to fix this?
You can use foldLeft on ar1 list :
val x = ar1.foldLeft(lit("-")) { case (acc, (list, value)) =>
when(col("a").isin(list: _*), value).otherwise(acc)
}
// x: org.apache.spark.sql.Column = CASE WHEN (a IN (N, NZ)) THEN n ELSE CASE WHEN (a IN (I, IND)) THEN i ELSE CASE WHEN (a IN (U, US)) THEN u ELSE - END END END
There is usually no need to combine when statements using reduce/fold etc. coalesce is enough because the when statements are evaluated in sequence, and gives null when the condition is false. Also it can save you from specifying otherwise because you can just append one more column to the list of arguments to coalesce.
val ar1 = Array((us_list,"u"),(i_list,"i"),(n_list,"n"))
val ap = ar1.map( x => when(col("a").isInCollection(x._1),x._2) )
val combined = coalesce(ap :+ lit("-"): _*)
df.withColumn("x", combined).show
+---+---+---+
| a| b| x|
+---+---+---+
| US| 10| u|
|IND| 20| i|
| NZ| 30| n|
|CAN| 40| -|
+---+---+---+

How to create a map column to count occurrences without udaf

I would like to create a Map column which counts the number of occurrences.
For instance:
+---+----+
| b| a|
+---+----+
| 1| b|
| 2|null|
| 1| a|
| 1| a|
+---+----+
would result in
+---+--------------------+
| b| res|
+---+--------------------+
| 1|[a -> 2.0, b -> 1.0]|
| 2| []|
+---+--------------------+
For the moment, in Spark 2.4.6, I was able to make it using udaf.
While bumping to Spark3 I was wondering if I could get rid of this udaf (I tried using the new method aggregate without success)
Is there an efficient way to do it?
(For the efficiency part, I am able to test easily)
Here a Spark 3 solution:
import org.apache.spark.sql.functions._
df.groupBy($"b",$"a").count()
.groupBy($"b")
.agg(
map_from_entries(
collect_list(
when($"a".isNotNull,struct($"a",$"count"))
)
).as("res")
)
.show()
gives:
+---+----------------+
| b| res|
+---+----------------+
| 1|[b -> 1, a -> 2]|
| 2| []|
+---+----------------+
Here the solution using Aggregator:
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Encoder
val countOcc = new Aggregator[String, Map[String,Int], Map[String,Int]] with Serializable {
def zero: Map[String,Int] = Map.empty.withDefaultValue(0)
def reduce(b: Map[String,Int], a: String) = if(a!=null) b + (a -> (b(a) + 1)) else b
def merge(b1: Map[String,Int], b2: Map[String,Int]) = {
val keys = b1.keys.toSet.union(b2.keys.toSet)
keys.map{ k => (k -> (b1(k) + b2(k))) }.toMap
}
def finish(b: Map[String,Int]) = b
def bufferEncoder: Encoder[Map[String,Int]] = implicitly(ExpressionEncoder[Map[String,Int]])
def outputEncoder: Encoder[Map[String, Int]] = implicitly(ExpressionEncoder[Map[String, Int]])
}
val countOccUDAF = udaf(countOcc)
df
.groupBy($"b")
.agg(countOccUDAF($"a").as("res"))
.show()
gives:
+---+----------------+
| b| res|
+---+----------------+
| 1|[b -> 1, a -> 2]|
| 2| []|
+---+----------------+
You could always use collect_list with UDF, but only if you groupings are not too lage:
val udf_histo = udf((x:Seq[String]) => x.groupBy(identity).mapValues(_.size))
df.groupBy($"b")
.agg(
collect_list($"a").as("as")
)
.select($"b",udf_histo($"as").as("res"))
.show()
gives:
+---+----------------+
| b| res|
+---+----------------+
| 1|[b -> 1, a -> 2]|
| 2| []|
+---+----------------+
This should be faster than UDAF: Spark custom aggregation : collect_list+UDF vs UDAF
We can achieve this is spark 2.4
//GET THE COUNTS
val groupedCountDf = originalDf.groupBy("b","a").count
//CREATE MAPS FOR EVERY COUNT | EMPTY MAP FOR NULL KEY
//AGGREGATE THEM AS ARRAY
val dfWithArrayOfMaps = groupedCountDf
.withColumn("newMap", when($"a".isNotNull, map($"a",$"count")).otherwise(map()))
.groupBy("b").agg(collect_list($"newMap") as "multimap")
//EXPRESSION TO CONVERT ARRAY[MAP] -> MAP
val mapConcatExpr = expr("aggregate(multimap, map(), (k, v) -> map_concat(k, v))")
val finalDf = dfWithArrayOfMaps.select($"b", mapConcatExpr.as("merged_data"))
Here a solution with a single groupBy and a slightly complex sql expression. This solution works for Spark 2.4+
df.groupBy("b")
.agg(expr("sort_array(collect_set(a)) as set"),
expr("sort_array(collect_list(a)) as list"))
.withColumn("res",
expr("map_from_arrays(set,transform(set, x -> size(filter(list, y -> y=x))))"))
.show()
Output:
+---+------+---------+----------------+
| b| set| list| res|
+---+------+---------+----------------+
| 1|[a, b]|[a, a, b]|[a -> 2, b -> 1]|
| 2| []| []| []|
+---+------+---------+----------------+
The idea is to collect the data from column a twice: one time into a set and one time into a list. Then with the help of transform for each element of the set the number of occurences of the particular element in the list is counted. Finally, the set and the number of elements are combined with map_from_arrays.
However I cannot say if this approach is really faster than a UDAF.

How to map values in column(multiple columns also) of one dataset to other dataset

I am woking on graphframes part,where I need to have edges/links in d3.js to be in indexed values of Vertex/nodes as source and destination.
Now I have VertexDF as
+--------------------+-----------+
| id| rowID|
+--------------------+-----------+
| Raashul Tandon| 3|
| Helen Jones| 5|
----------------------------------
EdgesDF
+-------------------+--------------------+
| src| dst|
+-------------------+--------------------+
| Raashul Tandon| Helen Jones |
------------------------------------------
Now I need to transform this EdgesDF as below
+-------------------+--------------------+
| src| dst|
+-------------------+--------------------+
| 3 | 5 |
------------------------------------------
All the column values should be having the index of the names taken from VertexDF.I am expecting in Higher-order functions.
My approach is to convert VertexDF to map, then iterating the EdgesDF and replaces every occurence.
What I have Tried
made a map of name to ids
val Actmap = VertxDF.collect().map(f =>{
val name = f.getString(0)
val id = f.getLong(1)
(name,id)
})
.toMap
Used that map with EdgesDF
EdgesDF.collect().map(f => {
val src = f.getString(0)
val dst = f.getString(0)
val src_id = Actmap.get(src)
val dst_id = Actmap.get(dst)
(src_id,dst_id)
})
Your approach of collect-ing the vertex and edge dataframes would work only if they're small. I would suggest left-joining the edge and vertex dataframes to get what you need:
import org.apache.spark.sql.functions._
import spark.implicits._
val VertxDF = Seq(
("Raashul Tandon", 3),
("Helen Jones", 5),
("John Doe", 6),
("Rachel Smith", 7)
).toDF("id", "rowID")
val EdgesDF = Seq(
("Raashul Tandon", "Helen Jones"),
("Helen Jones", "John Doe"),
("Unknown", "Raashul Tandon"),
("John Doe", "Rachel Smith")
).toDF("src", "dst")
EdgesDF.as("e").
join(VertxDF.as("v1"), $"e.src" === $"v1.id", "left_outer").
join(VertxDF.as("v2"), $"e.dst" === $"v2.id", "left_outer").
select($"v1.rowID".as("src"), $"v2.rowID".as("dst")).
show
// +----+---+
// | src|dst|
// +----+---+
// | 3| 5|
// | 5| 6|
// |null| 3|
// | 6| 7|
// +----+---+

How to assign keys to items in a column in Scala?

I have the following RDD:
Col1 Col2
"abc" "123a"
"def" "783b"
"abc "674b"
"xyz" "123a"
"abc" "783b"
I need the following output where each item in each column is converted into a unique key.
for example : abc->1,def->2,xyz->3
Col1 Col2
1 1
2 2
1 3
3 1
1 2
Any help would be appreciated. Thanks!
In this case, you can rely on the hashCode of the string. The hashcode will be the same if the input and datatype is same. Try this.
scala> "abc".hashCode
res23: Int = 96354
scala> "xyz".hashCode
res24: Int = 119193
scala> val df = Seq(("abc","123a"),
| ("def","783b"),
| ("abc","674b"),
| ("xyz","123a"),
| ("abc","783b")).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala>
scala> def hashc(x:String):Int =
| return x.hashCode
hashc: (x: String)Int
scala> val myudf = udf(hashc(_:String):Int)
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(StringType)))
scala> df.select(myudf('col1), myudf('col2)).show
+---------+---------+
|UDF(col1)|UDF(col2)|
+---------+---------+
| 96354| 1509487|
| 99333| 1694000|
| 96354| 1663279|
| 119193| 1509487|
| 96354| 1694000|
+---------+---------+
scala>
If you must map your columns into natural numbers starting from 1, one approach would be to apply zipWithIndex to the individual columns, add 1 to the index (as zipWithIndex always starts from 0), convert indvidual RDDs to DataFrames, and finally join the converted DataFrames for the index keys:
val rdd = sc.parallelize(Seq(
("abc", "123a"),
("def", "783b"),
("abc", "674b"),
("xyz", "123a"),
("abc", "783b")
))
val df1 = rdd.map(_._1).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col1", "c1key")
val df2 = rdd.map(_._2).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col2", "c2key")
val dfJoined = rdd.toDF("col1", "col2").
join(df1, Seq("col1")).
join(df2, Seq("col2"))
// +----+----+-----+-----+
// |col2|col1|c1key|c2key|
// +----+----+-----+-----+
// |783b| abc| 2| 1|
// |783b| def| 3| 1|
// |123a| xyz| 1| 2|
// |123a| abc| 2| 2|
// |674b| abc| 2| 3|
//+----+----+-----+-----+
dfJoined.
select($"c1key".as("col1"), $"c2key".as("col2")).
show
// +----+----+
// |col1|col2|
// +----+----+
// | 2| 1|
// | 3| 1|
// | 1| 2|
// | 2| 2|
// | 2| 3|
// +----+----+
Note that if you're okay with having the keys start from 0, the step of map(r => (r._1, r._2 + 1)) can be skipped in generating df1 and df2.

Spark code to group integers in buckets

I'd like a function that a column and a list of bucket ranges as arguments and returns the appropriate bucket. I want to solve this with the Spark API and don't want to use a UDF.
Let's say we start with this DataFrame (df):
+--------+
|some_num|
+--------+
| 3|
| 24|
| 45|
| null|
+--------+
Here's the desired behavior of the function:
df.withColumn(
"bucket",
bucketFinder(
col("some_num"),
Array(
(0, 10),
(10, 20),
(20, 30),
(30, 70)
)
)
).show()
+--------+------+
|some_num|bucket|
+--------+------+
| 3| 0-10|
| 24| 20-30|
| 45| 30-70|
| null| null|
+--------+------+
Here is the code I tried that does not work:
def bucketFinder(col: Column, buckets: Array[(Any, Any)]): Column = {
buckets.foreach { res: (Any, Any) =>
when(col.between(res._1, res._2), lit(s"$res._1 - $res._2"))
}
}
It's pretty easy to write this code with a UDF, but hard when constrained to only the Spark API.
You can divide the column by 10 and then the floor and ceil of the column should make the bucket you need:
val bucket_size = 10
val floor_col = floor(df("some_num") / bucket_size) * bucket_size
df.withColumn("bucket", concat_ws("-", floor_col, floor_col + bucket_size)).show
+--------+------+
|some_num|bucket|
+--------+------+
| 3| 0-10|
| 24| 20-30|
For a bucket size of 5:
val bucket_size1 = 5
val floor_col = floor(df("some_num") / bucket_size1) * bucket_size1
df.withColumn("bucket", concat_ws("-", floor_col, floor_col + bucket_size1)).show
+--------+------+
|some_num|bucket|
+--------+------+
| 3| 0-5|
| 24| 20-25|
Here is a pure Spark solution:
def bucketFinder(col: Column, buckets: Array[(Any, Any)]): Column = {
val b = buckets.map { res: (Any, Any) =>
when(col.between(res._1, res._2), lit(s"${res._1}-${res._2}"))
}
coalesce(b: _*)
}
I'll leave this question open for a bit to see if someone else has a more elegant solution.