Map one value to all values with a common relation Scala - scala

Having a set of data:
{sentenceA1}{\t}{sentenceB1}
{sentenceA1}{\t}{sentenceB2}
{sentenceA2}{\t}{sentenceB1}
{sentenceA3}{\t}{sentenceB1}
{sentenceA4}{\t}{sentenceB2}
I want to map a sentenceA to all the sentences that have a common sentenceB in Scala so the result will be something like this:
{sentenceA1}->{sentenceA2,sentenceA3,sentenceA4} or
{sentenceA2}->{sentenceA1, sentenceA3}

val lines = List(
"sentenceA1\tsentenceB1",
"sentenceA1\tsentenceB2",
"sentenceA2\tsentenceB1",
"sentenceA3\tsentenceB1",
"sentenceA4\tsentenceB2"
)
val afterSplit = lines.map(_.split("\t"))
val ba = afterSplit
.groupBy(_(1))
.mapValues(_.map(_(0)))
val ab = afterSplit
.groupBy(_(0))
.mapValues(_.map(_(1)))
val result = ab.map { case (a, b) =>
a -> b.foldLeft(Set[String]())(_ ++ ba(_)).diff(Set(a))
}

Related

Transform a list of object to lists of its field

I have a List[MyObject], with MyObject containing the fields field1, field2 and field3.
I'm looking for an efficient way of doing :
Tuple3(_.map(_.field1), _.map(_.field2), _.map(_.field3))
In java I would do something like :
Field1Type f1 = new ArrayList<Field1Type>();
Field2Type f2 = new ArrayList<Field2Type>();
Field3Type f3 = new ArrayList<Field3Type>();
for(MyObject mo : myObjects) {
f1.add(mo.getField1());
f2.add(mo.getField2());
f3.add(mo.getField3());
}
I would like something more functional since I'm in scala but I can't put my finger on it.
Get 2\3 sub-groups with unzip\unzip3
Assuming the starting point:
val objects: Seq[MyObject] = ???
You can unzip to get all 3 sub-groups:
val (firsts, seconds, thirds) =
objects
.unzip3((o: MyObject) => (o.f1, o.f2, o.f3))
What if I have more than 3 relevant sub-groups ?
If you really need more sub-groups you need to implement your own unzipN however instead of working with Tuple22 I would personally use an adapter:
case class MyObjectsProjection(private val objs: Seq[MyObject]) {
lazy val f1s: Seq[String] =
objs.map(_.f1)
lazy val f2s: Seq[String] =
objs.map(_.f2)
...
lazy val f22s: Seq[String] =
objs.map(_.f3)
}
val objects: Seq[MyClass] = ???
val objsProjection = MyObjectsProjection(objects)
objs.f1s
objs.f2s
...
objs.f22s
Notes:
Change MyObjectsProjection according to your needs.
This is from a Scala 2.12\2.11 vanilla perspective.
The following will decompose your objects into three lists:
case class MyObject[T,S,R](f1: T, f2: S, f3: R)
val myObjects: Seq[MyObject[Int, Double, String]] = ???
val (l1, l2, l3) = myObjects.foldLeft((List.empty[Int], List.empty[Double], List.empty[String]))((acc, nxt) => {
(nxt.f1 :: acc._1, nxt.f2 :: acc._2, nxt.f3 :: acc._3)
})

Reading CSV into Map[String, Array[String]] in Scala

Given a csv in the format below, what is the best way to load it into Scala as type Map[String, Array[String]], with the first key being the unique values for Col2, and the value Array[String]] as all co-occurring values of Col1?
a,1,
b,2,m
c,2,
d,1,
e,3,m
f,4,
g,2,
h,3,
I,1,
j,2,n
k,2,n
l,1,
m,5,
n,2,
I have tried to use the function below, but am getting errors trying to add to the Option type:
+= is not a member of Option[Array[String]]
In addition, I get overloaded method value ++ with alternatives:
with regards to the line case None => mapping ++ (linesplit(2) -> Array(linesplit(1)))
def parseCSV() : Map[String, Array[String]] = {
var mapping = Map[String, Array[String]]()
val lines = Source.fromFile("test.csv")
for (line <- lines.getLines) {
val linesplit = line.split(",")
mapping.get(linesplit(2)) match {
case Some(_) => mapping.get(linesplit(2)) += linesplit(1)
case None => mapping ++ (linesplit(2) -> Array(linesplit(1)))
}
}
mapping
}
}
I am hoping for a Map[String, Array[String]] like the following:
(2 -> Array["b","c","g","j", "k", "n"])
(3 -> Array["e","h"])
(4 -> Array["f"])
(5 -> Array["m"])
You can do the following:
First - read the file to List[List[String]]:
val rows: List[List[String]] = using(io.Source.fromFile("test.csv")) { source =>
source.getLines.toList map { line =>
line.split(",").map(_.trim).toList
}
}
Then, because the input has only 2 values per row, I filter the rows (rows with only one value I want to ignore)
val filteredRows = rows.filter(row => row.size > 1)
And the last step is to groupBy the first value (which is the second column - the index column is not returned from Source.fromFile):
filteredRows.groupBy(row => row.head).mapValues(_.map(_.last)))
This isn't complete, but it should give you an outline of how it might be done.
io.Source
.fromFile("so.txt") //open file
.getLines() //line by line
.map(_.split(",")) //split on commas
.toArray //load into memory
.groupMap(_(1))(_(0)) //Scala 2.13
//res0: Map[String,Array[String]] = Map(4 -> Array(f), 5 -> Array(m), 1 -> Array(a, d, I, l), 2 -> Array(b, c, g, j, k, n), 3 -> Array(e, h))
You'll notice that the file resource isn't closed, and it doesn't handle malformed input. I leave that for the diligent reader.
For the above code mutable Map & ArrayBuffer should be used, as they could be mutated/updated later.
def parseCSV(): Map[String, Array[String]] = {
val mapping = scala.collection.mutable.Map[String, ArrayBuffer[String]]()
val lines = Source.fromFile("test.csv")
for (line <- lines.getLines) {
val linesplit = line.split(",")
val key = line.split(",")(1)
val values = line.replace(s",$key", "").split(",")
mapping.get(key) match {
case Some(_) => mapping(linesplit(1)) ++= values
case None =>
val ab = ArrayBuffer[String]()
mapping(linesplit(1)) = ab ++= values
}
}
mapping.map(v => (v._1, v._2.toArray)).toMap
}

How to convert a Seq of tuples into set's of individual elements Scala

We have a sequence of tuples Seq(department, title) depTitleSeq we would like to extract Set(department) and Set(title) looking for the best way to do so far we could come up with is
val depTitleSeq = getDepTitleTupleSeq()
var departmentSeq = ArrayBuffer[String]()
var titleSeq = ArrayBuffer[String]()
for (depTitle <- depTitleSeq) yield {
departmentSeq += depTitle._1
titleSeq += depTitle._2
}
val depSet = departmentSeq.toSet
val titleSet = titleSeq.toSet
Fairly new to scala, i'm sure there are better and more efficient ways to achieve this if you could please point us in the right direction it would of great help
If you have two Seqs of data that you want combined into a Seq of tuples, you can zip them together.
If you have a Seq of tuples and you want the elements separated, then you can unzip them.
val (departmentSeq, titleSeq) = getDepTitleTupleSeq().unzip
val depSet :Set[String] = departmentSeq.toSet
val titleSet :Set[String] = titleSeq.toSet
val depTitleSeq = Seq(("x","a"),("y","b"))
val depSet = depTitleSeq.map(_._1).toSet
val titleSet = depTitleSeq.map(_._2).toSet
In Scala REPL:
scala> val depTitleSeq = Seq(("x","a"),("y","b"))
depTitleSeq: Seq[(String, String)] = List((x,a), (y,b))
scala> val depSet = depTitleSeq.map(_._1).toSet
depSet: scala.collection.immutable.Set[String] = Set(x, y)
scala> val titleSet = depTitleSeq.map(_._2).toSet
titleSet: scala.collection.immutable.Set[String] = Set(a, b)
val result:(Set[String], Set[String]) = depTitleSeq.foldLeft((Set[String](), Set[String]())){(a, b) => (a._1 + b._1, a._2 + b._2) }
you can use foldLeft to achieve this.

how to join two datasets by key in scala spark

I have two datasets and each dataset have two elements.
Below are examples.
Data1: (name, animal)
('abc,def', 'monkey(1)')
('df,gh', 'zebra')
...
Data2: (name, fruit)
('a,efg', 'apple')
('abc,def', 'banana(1)')
...
Results expected: (name, animal, fruit)
('abc,def', 'monkey(1)', 'banana(1)')
...
I want to join these two datasets by using first column 'name.' I have tried to do this for a couple of hours, but I couldn't figure out. Can anyone help me?
val sparkConf = new SparkConf().setAppName("abc").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text1 = sc.textFile(args(0))
val text2 = sc.textFile(args(1))
val joined = text1.join(text2)
Above code is not working!
join is defined on RDDs of pairs, that is, RDDs of type RDD[(K,V)].
The first step needed is to transform the input data into the right type.
We first need to transform the original data of type String into pairs of (Key, Value):
val parse:String => (String, String) = s => {
val regex = "^\\('([^']+)',[\\W]*'([^']+)'\\)$".r
s match {
case regex(k,v) => (k,v)
case _ => ("","")
}
}
(Note that we can't use a simple split(",") expression because the key contains commas)
Then we use that function to parse the text input data:
val s1 = Seq("('abc,def', 'monkey(1)')","('df,gh', 'zebra')")
val s2 = Seq("('a,efg', 'apple')","('abc,def', 'banana(1)')")
val rdd1 = sparkContext.parallelize(s1)
val rdd2 = sparkContext.parallelize(s2)
val kvRdd1 = rdd1.map(parse)
val kvRdd2 = rdd2.map(parse)
Finally, we use the join method to join the two RDDs
val joined = kvRdd1.join(kvRdd2)
// Let's check out results
joined.collect
// res31: Array[(String, (String, String))] = Array((abc,def,(monkey(1),banana(1))))
You have to create pairRDDs first for your data sets then you have to apply join transformation. Your data sets are not looking accurate.
Please consider the below example.
**Dataset1**
a 1
b 2
c 3
**Dataset2**
a 8
b 4
Your code should be like below in Scala
val pairRDD1 = sc.textFile("/path_to_yourfile/first.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val pairRDD2 = sc.textFile("/path_to_yourfile/second.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val joinRDD = pairRDD1.join(pairRDD2)
joinRDD.collect
Here is the result from scala shell
res10: Array[(String, (String, String))] = Array((a,(1,8)), (b,(2,4)))

obtain a specific value from a RDD according to another RDD

I want to map a RDD by lookup another RDD by this code:
val product = numOfT.map{case((a,b),c)=>
val h = keyValueRecords.lookup(b).take(1).mkString.toInt
(a,(h*c))
}
a,b are Strings and c is a Integer. keyValueRecords is like this: RDD[(string,string)]-
i got type missmatch error: how can I fix it ?
what is my mistake ?
This is a sample of data:
userId,movieId,rating,timestamp
1,16,4.0,1217897793
1,24,1.5,1217895807
1,32,4.0,1217896246
2,3,2.0,859046959
3,7,3.0,8414840873
I'm triying by this code:
val lines = sc.textFile("ratings.txt").map(s => {
val substrings = s.split(",")
(substrings(0), (substrings(1),substrings(1)))
})
val shoppingList = lines.groupByKey()
val coOccurence = shoppingList.flatMap{case(k,v) =>
val arry1 = v.toArray
val arry2 = v.toArray
val pairs = for (pair1 <- arry1; pair2 <- arry2 ) yield ((pair1,pair2),1)
pairs.iterator
}
val numOfT = coOccurence.reduceByKey((a,b)=>(a+b)) // (((item,rate),(item,rate)),coccurence)
// produce recommend for an especial user
val keyValueRecords = sc.textFile("ratings.txt").map(s => {
val substrings = s.split(",")
(substrings(0), (substrings(1),substrings(2)))
}).filter{case(k,v)=> k=="1"}.groupByKey().flatMap{case(k,v) =>
val arry1 = v.toArray
val arry2 = v.toArray
val pairs = for (pair1 <- arry1; pair2 <- arry2 ) yield ((pair1,pair2),1)
pairs.iterator
}
val numOfTForaUser = keyValueRecords.reduceByKey((a,b)=>(a+b))
val joined = numOfT.join(numOfTForaUser).map{case(k,v)=>(k._1._1,(k._2._2.toFloat*v._1.toFloat))}.collect.foreach(println)
The Last RDD won't produced. Is it wrong ?