Distinct Word count per line - scala

This is little different than the common word count program. I am trying to get the distinct word count per line.
Input:
Line number one has six words
Line number two has two words
Expected output:
line1 => (Line,1),(number,1),(one,1),(has,1),(six,1),(words,1)
line2 => (Line,1),(number,1),(two,2),(has,1),(words,1)
Can anyone please guide me.

By using Dataframe in built functions explode,split,collect_set,groupBy.
//input data
val df=Seq("Line number one has six words","Line number two has has two words").toDF("input")
scala> :paste
// Entering paste mode (ctrl-D to finish)
df.withColumn("words",explode(split($"input","\\s+"))) //split by space and explode
.groupBy("input","words") //group by on both columns
.count()
.withColumn("line_word_count",struct($"words",$"count")) //create struct
.groupBy("input") //grouping by input data column
.agg(collect_set("line_word_count").alias("line_word_count"))
.show(false)
Result:
+---------------------------------+------------------------------------------------------------------+
|input |line_word_count |
+---------------------------------+------------------------------------------------------------------+
|Line number one has six words |[[one, 1], [has, 1], [six, 1], [number, 1], [words, 1], [Line, 1]]|
|Line number two has has two words|[[has, 2], [two, 2], [words, 1], [number, 1], [Line, 1]] |
+---------------------------------+------------------------------------------------------------------+
If you are expecting line numbers then use concat,monotonically_increasing_id functions:
df.withColumn("line",concat(lit("line"),monotonically_increasing_id()+1))
.withColumn("words",explode(split($"input","\\s+")))
.groupBy("input","words","line")
.count()
.withColumn("line_word_count",struct($"words",$"count"))
.groupBy("line")
.agg(collect_set("line_word_count").alias("line_word_count"))
.show(false)
Result:
+-----+------------------------------------------------------------------+
|line |line_word_count |
+-----+------------------------------------------------------------------+
|line1|[[one, 1], [has, 1], [six, 1], [words, 1], [number, 1], [Line, 1]]|
|line2|[[has, 2], [two, 2], [number, 1], [words, 1], [Line, 1]] |
+-----+------------------------------------------------------------------+
Note incase of larger datasets to make it consecutive we need to do .repartition(1).

Here is another way using RDD API:
val rdd = df.withColumn("output", split($"input", " ")).rdd.map(l => (
l.getAs[String](0),
l.getAs[Seq[String]](1).groupBy(identity).mapValues(_.size).map(identity))
)
val dfCount = spark.createDataFrame(rdd).toDF("input", "output")
Not a big fan of using UDF, but it can also be done like this:
import org.apache.spark.sql.functions.udf
val mapCount: Seq[String] => Map[String, Integer] = _.groupBy(identity).mapValues(_.size)
val countWordsUdf = udf(mapCount)
df.withColumn("output", countWordsUdf(split($"input", " "))).show(false)
Gives:
+---------------------------------+------------------------------------------------------------------+
|input |output |
+---------------------------------+------------------------------------------------------------------+
|Line number one has six words |[number -> 1, Line -> 1, has -> 1, six -> 1, words -> 1, one -> 1]|
|Line number two has has two words|[number -> 1, two -> 2, Line -> 1, has -> 2, words -> 1] |
+---------------------------------+------------------------------------------------------------------+

Related

Concatenate two arrays element with all possible combinations

I have a hive table with
| row | column |
| --------------------------- | ---------------------------|
| null | ["black", "blue", "orange"]
| ["mom", "dad", "sister"] | ["amazon", "fiipkart", "meesho", "jiomart", ""]
Using Spark SQL, I would like to create a new column with an array of all possible combinations:
| row | column | output |
| ---------------------------|------------------|-----------------------------------|
| null |["b", "s", "m"] |["b", "s", "m"] |
| ["1", "2"] |["a", "b",""] |["1_a", "1_b","1","2_a", "2_b","2"]|
Two ways to implement this:
The first way includes array transformations:
df
// we first make your null value as an empty array
.withColumn("row", when(col("row").isNull, Array("")).otherwise(col("row")))
// we create an array of 1s that is equal to the multiplying of row size and column size
.withColumn("repeated", array_repeat(lit(1), size(col("row")) * size(col("column"))))
// we create indexes according to the sizes
.withColumn("indexes", expr("transform(repeated, (x, i) -> array(i % size(row), i % size(column)))"))
// we concat the elements
.withColumn("concat", expr("transform(indexes, (x, i) -> concat_ws('_', row[x[0]], column[x[1]]))"))
// we remove underscores before and after the name (if found)
.withColumn("output", expr("transform(concat, x -> regexp_replace(x, '(_$)|(^_)', ''))"))
Output:
+------+---------+------------------+------------------------------------------------+----------------------------+--------------------------+
|row |column |repeated |indexes |concat |output |
+------+---------+------------------+------------------------------------------------+----------------------------+--------------------------+
|[] |[b, s, m]|[1, 1, 1] |[[0, 0], [0, 1], [0, 2]] |[_b, _s, _m] |[b, s, m] |
|[1, 2]|[a, b, ] |[1, 1, 1, 1, 1, 1]|[[0, 0], [1, 1], [0, 2], [1, 0], [0, 1], [1, 2]]|[1_a, 2_b, 1_, 2_a, 1_b, 2_]|[1_a, 2_b, 1, 2_a, 1_b, 2]|
+------+---------+------------------+------------------------------------------------+----------------------------+--------------------------+
The second way includes explode and other functions:
df
// we first make your null value as an empty array
.withColumn("row", when(col("row").isNull, Array("")).otherwise(col("row")))
// we create a unique ID to group by later and collect
.withColumn("id", monotonically_increasing_id())
// we explode the columns and the rows
.withColumn("column", explode(col("column")))
.withColumn("row", explode(col("row")))
// we combine the output with underscores as separators
.withColumn("output", concat_ws("_", col("row"), col("column")))
// we group by again and collect set
.groupBy("id").agg(
collect_set("row").as("row"),
collect_set("column").as("column"),
collect_set("output").as("output")
)
.drop("id")
// we replace whatever ends with _ and starst with _ (_1, or 1_)
.withColumn("output", expr("transform(output, x -> regexp_replace(x, '(_$)|(^_)', ''))"))
Final output:
+------+---------+--------------------------+
|row |column |output |
+------+---------+--------------------------+
|[1, 2]|[b, a, ] |[2, 1_a, 2_a, 1, 1_b, 2_b]|
|[] |[s, m, b]|[s, m, b] |
+------+---------+--------------------------+
I left other columns in case you want to see what is happening, good luck!
One straight forward and easy solution is to use a custom UDF function:
from pyspark.sql.functions import udf, col
def combinations(a, b):
c = []
for x in a:
for y in b:
if not x:
c.append(y)
elif not y:
c.append(x)
else:
c.append(f"{x}_{y}")
return c
udf_combination = udf(combinations)
df = spark.createDataFrame([
[["1", "2"], ["a", "b", ""]]
], ["row", "column"])
df.withColumn("res", udf_combination(col("row"), col("column")))
# +------+--------+--------------------------+
# |row |column |res |
# +------+--------+--------------------------+
# |[1, 2]|[a, b, ]|[1_a, 1_b, 1, 2_a, 2_b, 2]|
# +------+--------+--------------------------+

Concat column values in a dataframe

I have a csv file that looks like the following
name, state, a, b, c, d, ..., x
Jon, NY, 1, 4, 6, 2, 6
Eric, CA, 5, 3, 1, 5, 6
Chris,LA, 4, 4, 3, 1, 5
and I want the following result (one column with concat fields (inkl header name))
concate-fields
"name=Jon, state=NY, a=1, b=4, c=6, d= 2, ... x=6"
"name=Eric, state=CA, a=5, b=3, c=1, d= 5, ... x=6"
"name=Chris, state=LA, a=4, b=4, c=3, d= 1, ... x=5"
There can be many headers from a...>x so these should be appended in a generic way
I now have
import org.apache.spark.sql.functions.{concat, lit}
val lp = sample.select(concat(lit("name), $"name", lit(",state="), $"state")
display(lp)
But I have trouble adding the same for column a->x (as this needs to be done in a generic way)
You can dynamically create SQL expression to concat columns by calling map method on df.columns() as shown below.
val df = // Read CSV
df.withColumn("concate-fields", expr(s"concat(${df.columns.map(col=>s"'$col=', nvl($col,'null'),','").mkString("").dropRight(4)})"))
.withColumn("concate-fields", concat(lit("\""),col("concate-fields"),lit("\"")))

Reformatting Dataframe Containing Array to RowMatrix

I have this dataframe in the following format:
+----+-----+
| features |
+----+-----+
|[1,4,7,10]|
|[2,5,8,11]|
|[3,6,9,12]|
+----+----+
Script to create sample dataframe:
rows2 = sc.parallelize([ IndexedRow(0, [1, 4, 7, 10 ]),
IndexedRow(1, [2, 5, 8, 1]),
IndexedRow(1, [3, 6, 9, 12]),
])
rows_df = rows2.toDF()
row_vec= rows_df.drop("index")
row_vec.show()
The feature column contains 4 features, and there are 3 row ids. I want to convert this data to a rowmatrix, where the columns and rows will be in the following mat format:
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
Basically, I want to transpose the dataframe into the new format so that I can run the columnSimilarities() function. I have a much larger dataframe that contains 50 features, and 39000 rows.
Is this what you are trying to do? Hate using collect() but don't think it can be avoided here since you want to reshape/convert structured object to matrix ... right?
X = np.array(row_vec.select("_2").collect()).reshape(-1,3)
X = sc.parallelize(X)
for i in X.collect(): print(i)
[1 4 7]
[10 2 5]
[8 1 3]
[ 6 9 12]
I figured it out, I used the following:
from pyspark.mllib.linalg.distributed import RowMatrix
features_rdd = row_vec.select("features").rdd.map(lambda row: row[0])
features_mat = RowMatrix(features_rdd )
from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
coordmatrix_features = CoordinateMatrix(
features_mat .rows.zipWithIndex().flatMap(
lambda x: [MatrixEntry(x[1], j, v) for j, v in enumerate(x[0])]
)
)
transposed_rowmatrix_features = coordmatrix_features.transpose().toRowMatrix()

How to find duplicates in a list in Scala?

I have a list of unsorted integers and I want to find the elements which are duplicated.
val dup = List(1|1|1|2|3|4|5|5|6|100|101|101|102)
I have to find the list of unique elements and also how many times each element is repeated.
I know I can find it with below code :
val ans2 = dup.groupBy(identity).map(t => (t._1, t._2.size))
But I am not able to split the above list on "|" . I tried converting to a String then using split but I got the result below:
L
i
s
t
(
1
0
3
)
I am not sure why I am getting this result.
Reference: How to find duplicates in a list?
The symbol | is a function in scala. You can check the API here
|(x: Int): Int
Returns the bitwise OR of this value and x.
So you don't have a List, you have a single Integer (103) which is the result of operating | with all the integers in your pretended List.
Your code is fine, if you want to make a proper List you should separate its elements by commas
val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
If you want to convert your given String before having it on a List you can do:
"1|1|1|2|3|4|5|5|6|100|101|101|102".split("\\|").toList
Even easier, convert the list of duplicates into a set - a set is a data structure that by default does not have any duplicates.
scala> val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
dup: List[Int] = List(1, 1, 1, 2, 3, 4, 5, 5, 6, 100, 101, 101, 102)
scala> val noDup = dup.toSet
res0: scala.collection.immutable.Set[Int] = Set(101, 5, 1, 6, 102, 2, 3, 4, 100)
To count the elements, just call the method sizeon the resulting set:
scala> noDup.size
res3: Int = 9
Another way to solve the problem
"1|1|1|2|3|4|5|5|6|100|101|101|102".split("\|").groupBy(x => x).mapValues(_.size)
res0: scala.collection.immutable.Map[String,Int] = Map(100 -> 1, 4 -> 1, 5 -> 2, 6 -> 1, 1 -> 3, 102 -> 1, 2 -> 1, 101 -> 2, 3 -> 1)

How to remove duplicates from particular column in Scala by reading textfile

I am new to scala, I am reading textfile from local, and I want to find duplicate columns in example.
Input File:
1,2,3
2,3,4
1,3,4
2,4,5
3,4,5
I need output like this:
Select first column
1->2
2->3
3->1
program is:
val file=scala.io.Source.fromFile("D:/Files/test.txt").getLines().mkString("\n")
val d=file.groupBy(identity).mapValues(_.size)
println(d)
But I am getting output Like this
Map(-> 5, 4 -> 1, 9 -> 1, 5 -> 3, , -> 12, 1 -> 3, 0 -> 1, 2 -> 5, 3 -> 4)
Its counting all the data but I want to count duplicates in particualr column only
The issue here is because once the call mkString is made, the multiple lines on the file is 'lost'. Another approach could be to use the toArray call instead.
val file = scala.io.Source.fromFile("D:/Files/test.txt")
val lines = file.getLines().toArray
On the above example, lines would be a array of strings:
Array(1,2,3, 2,3,4, 1,3,4, 2,4,5, 3,4,5)
then to extract the first column before grouping you could use something like the slice method on each string
lines.map(_.slice(0,1)).groupBy(identity).mapValues(_.size)
Also, remember to close the file :)
Full example:
val file = scala.io.Source.fromFile("D:/Files/test.txt")
val lines = file.getLines().toArray
val grouping = lines.map(_.slice(0,1)).groupBy(identity).mapValues(_.size)
file.close
If I understand your question correctly, shouldn't the duplicate counts of the 1st column be (1->2, 2->2, 3->1)?
Here's one approach to get the counts:
// Create a list of split-column arrays
val list = scala.io.Source.
fromFile("/Users/leo/projects/scala/files/testfile.txt").
getLines.
map(_.split(",")).
toList
list: List[Array[String]] = List(Array(1, 2, 3), Array(2, 3, 4), Array(1, 3, 4), Array(2, 4, 5), Array(3, 4, 5))
// Count duplicates of the 1st split-column
val d = list.
groupBy(_(0)).
mapValues(_.size)
d: scala.collection.immutable.Map[String,Int] = Map(2 -> 2, 1 -> 2, 3 -> 1)