How to remove duplicates from particular column in Scala by reading textfile - scala

I am new to scala, I am reading textfile from local, and I want to find duplicate columns in example.
Input File:
1,2,3
2,3,4
1,3,4
2,4,5
3,4,5
I need output like this:
Select first column
1->2
2->3
3->1
program is:
val file=scala.io.Source.fromFile("D:/Files/test.txt").getLines().mkString("\n")
val d=file.groupBy(identity).mapValues(_.size)
println(d)
But I am getting output Like this
Map(-> 5, 4 -> 1, 9 -> 1, 5 -> 3, , -> 12, 1 -> 3, 0 -> 1, 2 -> 5, 3 -> 4)
Its counting all the data but I want to count duplicates in particualr column only

The issue here is because once the call mkString is made, the multiple lines on the file is 'lost'. Another approach could be to use the toArray call instead.
val file = scala.io.Source.fromFile("D:/Files/test.txt")
val lines = file.getLines().toArray
On the above example, lines would be a array of strings:
Array(1,2,3, 2,3,4, 1,3,4, 2,4,5, 3,4,5)
then to extract the first column before grouping you could use something like the slice method on each string
lines.map(_.slice(0,1)).groupBy(identity).mapValues(_.size)
Also, remember to close the file :)
Full example:
val file = scala.io.Source.fromFile("D:/Files/test.txt")
val lines = file.getLines().toArray
val grouping = lines.map(_.slice(0,1)).groupBy(identity).mapValues(_.size)
file.close

If I understand your question correctly, shouldn't the duplicate counts of the 1st column be (1->2, 2->2, 3->1)?
Here's one approach to get the counts:
// Create a list of split-column arrays
val list = scala.io.Source.
fromFile("/Users/leo/projects/scala/files/testfile.txt").
getLines.
map(_.split(",")).
toList
list: List[Array[String]] = List(Array(1, 2, 3), Array(2, 3, 4), Array(1, 3, 4), Array(2, 4, 5), Array(3, 4, 5))
// Count duplicates of the 1st split-column
val d = list.
groupBy(_(0)).
mapValues(_.size)
d: scala.collection.immutable.Map[String,Int] = Map(2 -> 2, 1 -> 2, 3 -> 1)

Related

Efficiently find common values in a map of lists - scala

I asked a similar question already here. However, I misjudged the scale of my specific case. In my example I gave, there were only 4 keys in the map. I am actually dealing with over 10,000 keys and they are mapped to lists of different sizes. So the solution given was correct, but I am now looking for a way that will do this in a more efficient manner.
Say I have:
val myMap: Map[Int, List[Int]] = Map(
1 -> List(1, 10, 12, 76, 105), 2 -> List(2, 5, 10), 3 -> List(10, 12, 76, 5), 4 -> List(2, 4, 5, 10),
... -> List(...)
)
Imagine the (...) go on for over 10,000 keys. I want to return a List of Lists containing a pair of keys and their shared values if the size of the intersection of their respective lists is >= 3.
For example:
res0: List[(Int, Int, List[Int])] = List(
(1, 3, List(10, 12, 76)),
(2, 4, List(2, 5, 10)),
(...),
(...),
)
I've been pretty stuck on this for a couple of days, so any help is genuinely appreciated. Thank you in advance!
If space is not the concern then the problem can be solved in the O(N) where N is the number of elements in the list.
Algorithm:
Create a reverse lookup map out from the input map. Here reverse lookup maps the list element to the key (Id).
For each input map key
Create a temp map
Iterate over the list and look for value (Id) in the reverse lookup. Count the number of occurred for the fetched id.
All key which occurred equal or more than 3 times is the desired pair.
Code
import scala.collection.mutable
import scala.collection.mutable.ArrayBuffer
object Application extends App {
val inputMap = Map(
1 -> List(1, 2, 3, 4),
2 -> List(2, 3, 4, 5),
3 -> List(3, 5, 6, 7),
4 -> List(1, 2, 3, 6, 7))
/*
Expected pairs
| pair | common elements |
---------------------------
(1, 2) -> 2, 3, 4
(1, 4) -> 2, 3, 4
(2, 1) -> 2, 3, 4
(3, 4) -> 3, 5, 6
(4, 1) -> 1, 2, 3
(4, 3) -> 3, 5, 6
*/
val reverseMap = mutable.Map[Int, ArrayBuffer[Int]]()
inputMap.foreach {
case (id, list) => list.foreach(
o => if (reverseMap.contains(o)) reverseMap(o).append(id) else reverseMap.put(o, ArrayBuffer(id)))
}
val result = inputMap.map {
case (id, list) =>
val m = mutable.Map[Int, Int]()
list.foreach(o =>
reverseMap(o).foreach(k => if (m.contains(k)) m.update(k, m(k)+1) else m.put(k, 1)))
val res = m.toList.filter(o => o._2 >= 3 && o._1 != id).map(o => (id, o._1))
res
}.flatten
println(result)
}

Compute the maximum length assigned to each element using scala

For example, this is the content in a file:
20,1,helloworld,alaaa
2,3,world,neww
1,223,ala,12341234
Desired output"
0-> 2
1-> 3
2-> 10
3-> 8
I want to find max-length assigned to each element.
It's possible to extend this to any number of columns. First read the file as a dataframe:
val df = spark.read.csv("path")
Then create an SQL expression for each column and evaluate it with expr:
val cols = df.columns.map(c => s"max(length(cast($c as String)))").map(expr(_))
Select the new columns as an array and covert to Map:
df.select(array(cols:_*)).as[Seq[Int]].collect()
.head
.zipWithIndex.map(_.swap)
.toMap
This should give you the desired Map.
Map(0 -> 2, 1 -> 3, 2 -> 10, 3 -> 8)
Update:
OP's example suggests that they will be of equal lengths.
Using Spark-SQL and max(length()) on the DF columns is the idea that is being suggested in this answer.
You can do:
val xx = Seq(
("20","1","helloworld","alaaa"),
("2","3","world","neww"),
("1","223","ala","12341234")
).toDF("a", "b", "c", "d")
xx.registerTempTable("yy")
spark.sql("select max(length(a)), max(length(b)), max(length(c)), max(length(d)) from yy")
I would recommend using RDD's aggregate method:
val rdd = sc.textFile("/path/to/textfile").
map(_.split(","))
// res1: Array[Array[String]] = Array(
// Array(20, 1, helloworld, alaaa), Array(2, 3, world, neww), Array(1, 223, ala, 12341234)
// )
val seqOp = (m: Array[Int], r: Array[String]) =>
(r zip m).map( t => Seq(t._1.length, t._2).max )
val combOp = (m1: Array[Int], m2: Array[Int]) =>
(m1 zip m2).map( t => Seq(t._1, t._2).max )
val size = rdd.collect.head.size
rdd.
aggregate( Array.fill[Int](size)(0) )( seqOp, combOp ).
zipWithIndex.map(_.swap).
toMap
// res2: scala.collection.immutable.Map[Int,Int] = Map(0 -> 2, 1 -> 3, 2 -> 10, 3 -> 8)
Note that aggregate takes:
an array of 0's (of size equal to rdd's row size) as the initial value,
a function seqOp for calculating maximum string lengths within a partition, and,
another function combOp to combine results across partitions for the final maximum values.

How to find duplicates in a list in Scala?

I have a list of unsorted integers and I want to find the elements which are duplicated.
val dup = List(1|1|1|2|3|4|5|5|6|100|101|101|102)
I have to find the list of unique elements and also how many times each element is repeated.
I know I can find it with below code :
val ans2 = dup.groupBy(identity).map(t => (t._1, t._2.size))
But I am not able to split the above list on "|" . I tried converting to a String then using split but I got the result below:
L
i
s
t
(
1
0
3
)
I am not sure why I am getting this result.
Reference: How to find duplicates in a list?
The symbol | is a function in scala. You can check the API here
|(x: Int): Int
Returns the bitwise OR of this value and x.
So you don't have a List, you have a single Integer (103) which is the result of operating | with all the integers in your pretended List.
Your code is fine, if you want to make a proper List you should separate its elements by commas
val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
If you want to convert your given String before having it on a List you can do:
"1|1|1|2|3|4|5|5|6|100|101|101|102".split("\\|").toList
Even easier, convert the list of duplicates into a set - a set is a data structure that by default does not have any duplicates.
scala> val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
dup: List[Int] = List(1, 1, 1, 2, 3, 4, 5, 5, 6, 100, 101, 101, 102)
scala> val noDup = dup.toSet
res0: scala.collection.immutable.Set[Int] = Set(101, 5, 1, 6, 102, 2, 3, 4, 100)
To count the elements, just call the method sizeon the resulting set:
scala> noDup.size
res3: Int = 9
Another way to solve the problem
"1|1|1|2|3|4|5|5|6|100|101|101|102".split("\|").groupBy(x => x).mapValues(_.size)
res0: scala.collection.immutable.Map[String,Int] = Map(100 -> 1, 4 -> 1, 5 -> 2, 6 -> 1, 1 -> 3, 102 -> 1, 2 -> 1, 101 -> 2, 3 -> 1)

Zip two Arrays, always 3 elements of the first array then 2 elements of the second

I've manually built a method that takes 2 arrays and combines them to 1 like this:
a0,a1,a2,b0,b1,a3,a4,a5,b2,b3,a6,...
So I always take 3 elements of the first array, then 2 of the second one.
As I said, I built that function manually.
Now I guess I could make this a one-liner instead with the help of zip. The problem is, that zip alone is not enough as zip builds tuples like (a0, b0).
Of course I can flatMap this, but still not what I want:
val zippedArray: List[Float] = data1.zip(data2).toList.flatMap(t => List(t._1, t._2))
That way I'd get a List(a0, b0, a1, b1,...), still not what I want.
(I'd then use toArray for the list... it's more convenient to work with a List right now)
I thought about using take and drop but they return new data-structures instead of modifying the old one, so not really what I want.
As you can imagine, I'm not really into functional programming (yet). I do use it and I see huge benefits, but some things are so different to what I'm used to.
Consider grouping array a by 3, and array b by 2, namely
val a = Array(1,2,3,4,5,6)
val b = Array(11,22,33,44)
val g = (a.grouped(3) zip b.grouped(2)).toArray
Array((Array(1, 2, 3),Array(11, 22)), (Array(4, 5, 6),Array(33, 44)))
Then
g.flatMap { case (x,y) => x ++ y }
Array(1, 2, 3, 11, 22, 4, 5, 6, 33, 44)
Very similar answer to #elm but I wanted to show that you can use more lazy approach (iterator) to avoid creating temp structures:
scala> val a = List(1,2,3,4,5,6)
a: List[Int] = List(1, 2, 3, 4, 5, 6)
scala> val b = List(11,22,33,44)
b: List[Int] = List(11, 22, 33, 44)
scala> val groupped = a.sliding(3, 3) zip b.sliding(2, 2)
groupped: Iterator[(List[Int], List[Int])] = non-empty iterator
scala> val result = groupped.flatMap { case (a, b) => a ::: b }
result: Iterator[Int] = non-empty iterator
scala> result.toList
res0: List[Int] = List(1, 2, 3, 11, 22, 4, 5, 6, 33, 44)
Note that it stays an iterator all the way until we materialize it with toList

toMap when keys are repeated with different values

I have a list
val data = List(2, 4, 3, 2, 1, 1, 1,7)
with which I want to create a map such that values in above list are keys to new one with indeces as new values I tried
scala> data.zipWithIndex.toMap
res5: scala.collection.immutable.Map[Int,Int] = Map(1 -> 6, 2 -> 3, 7 -> 7, 3 -> 2, 4 -> 1)
but strangely it gives res5(1) as 6 but I want it to be 4.
I could solve it by
data.zipWithIndex groupBy (_._1) mapValues (w=>w.map(tuple=>tuple._2) min)
but is there any way I can pass a function f to toMap so that it creates map in desired way.
toMap is going to add each pair to the map in the order of the zipped list, and when you add a mapping k -> v to a map that already contains a k, the old value is simply replaced.
An easy fix is just to reverse the list after zipping the indices and before converting to a map:
data.zipWithIndex.reverse.toMap
Now the mappings 1 -> 6 and 1 -> 5 will be added before 1 -> 4, which means 1 -> 4 is the one you'll see in the result.