calculate cosine similarity in scala - scala

I have a file (tags.csv) that contains UserId, MovieId,tags.I want to use a domain-based method to calculate the cosine similarity between tags. I want to show the relevant tags for comedy only and measure similarity for each tag relevant to the comedy tag.
dataset
My code is:
val rows = sc.textFile("/usr/local/comedy")
val vecData = rows.map(line => Vectors.dense(line.split(", ").map(_.toDouble)))
val mat = new RowMatrix(vecData)
val exact = mat.columnSimilarities()
val approx = mat.columnSimilarities(0.07)
val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i, j), u) }
val approxEntries = approx.entries.map { case MatrixEntry(i, j, v) => ((i, j), v) }
val MAE = exactEntries.leftOuterJoin(approxEntries).values.map {
case (u, Some(v)) =>
math.abs(u - v)
case (u, None) =>
math.abs(u)
}.mean()
but this error appear:
java.lang.NumberFormatException: For input string: "[1,898,"black comedy"]"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
What's wrong?

The error message is full of pertinent info.
NumberFormatException: For input string: "[1,898,"black comedy"]"
It looks like the input String isn't being split into separate column data. So .split(", ") isn't doing its job and it's easy to see why, there are no comma-space sequences to split on.
We could take out the space and split on just the comma but that would still leave a non-digit [ in the 1st column data and the 3rd column data has no digit characters at all.
There are a few different ways to attack this. I'd be tempted to use a regex parser.
val twoNums = "(\\d+),(\\d+),".r.unanchored
val vecData = rows.collect{ case twoNums(a, b) =>
Vectors.dense(Array(a.toDouble, b.toDouble))
}

Related

Spark - calculate max ocurrence per day-event

I have the following RDD[String]:
1:AAAAABAAAAABAAAAABAAABBB
2:BBAAAAAAAAAABBAAAAAAAAAA
3:BBBBBBBBAAAABBAAAAAAAAAA
The first number is supposed to be days and the following characters are events.
I have to calculate the day where each event has the maximum occurrence.
The expected result for this dataset should be:
{ "A" -> Day2 , "B" -> Day3 }
(A has repeated 10 times in day2 and b 10 times in day3)
I am splitting the original dataset
val foo = rdd.map(_.split(":")).map(x => (x(0), x(1).split("")) )
What could be the best implementation for count and aggregation?
Any help is appreciated.
This should do the trick:
import org.apache.spark.sql.functions._
val rdd = sqlContext.sparkContext.makeRDD(Seq(
"1:AAAAABAAAAABAAAAABAAABBB",
"2:BBAAAAAAAAAABBAAAAAAAAAA",
"3:BBBBBBBBAAAABBAAAAAAAAAA"
))
val keys = Seq("A", "B")
val seqOfMaps: RDD[(String, Map[String, Int])] = rdd.map{str =>
val split = str.split(":")
(s"Day${split.head}", split(1).groupBy(a => a.toString).mapValues(_.length))
}
keys.map{key => {
key -> seqOfMaps.mapValues(_.get(key).get).sortBy(a => -a._2).first._1
}}.toMap
The processing that need to be done consist in transforming the data into a rdd that is easy to apply on functions like: find the maximum for a list
I will try to explain step by step
I've used dummy data for "A" and "B" chars.
The foo rdd is the first step it will give you RDD[(String, Array[String])]
Let's extract each char for the Array[String]
val res3 = foo.map{case (d,s)=> (d, s.toList.groupBy(c => c).map{case (x, xs) => (x, xs.size)}.toList)}
(1,List((A,18), (B,6)))
(2,List((A,20), (B,4)))
(3,List((A,14), (B,10)))
Next we will flatMap over values to expand our rdd by char
res3.flatMapValues(list => list)
(3,(A,14))
(3,(B,10))
(1,(A,18))
(2,(A,20))
(2,(B,4))
(1,(B,6))
Rearrange the rdd in order to look better
res5.map{case (d, (s, c)) => (s, c, d)}
(A,20,2)
(B,4,2)
(A,18,1)
(B,6,1)
(A,14,3)
(B,10,3)
Now we are groupy by char
res7.groupBy(_._1)
(A,CompactBuffer((A,18,1), (A,20,2), (A,14,3)))
(B,CompactBuffer((B,6,1), (B,4,2), (B,10,3)))
Finally we are taking the maxium count for each row
res9.map{case (s, list) => (s, list.maxBy(_._2))}
(B,(B,10,3))
(A,(A,20,2))
Hope this help
Previous answers are good, but I prefer such solution:
val data = Seq(
"1:AAAAABAAAAABAAAAABAAABBB",
"2:BBAAAAAAAAAABBAAAAAAAAAA",
"3:BBBBBBBBAAAABBAAAAAAAAAA"
)
val initialRDD = sparkContext.parallelize(data)
// to tuples like (1,'A',18)
val charCountRDD = initialRDD.flatMap(s => {
val parts = s.split(":")
val charCount = parts(1).groupBy(i => i).mapValues(_.length)
charCount.map(i => (parts(0), i._1, i._2))
})
// group by character, and take max value from grouped collection
val result = charCountRDD.groupBy(i => i._2).map(k => k._2.maxBy(z => z._3))
result.foreach(println(_))
Result is:
(3,B,10)
(2,A,20)

Scala: write a MapReduce progam to find a word that follows a word [Homework]

I have a homework assignment where I must write a MapReduce program in Scala to find, for each word in the file which word that follows the most.
For example, for the word "basketball", the word "is" comes next 5 times, "has" 2 times, and "court" 1 time.
In a text file this might show up as:
basketball is..... (this sequence happens 5 times)
basketball has..... (this sequence happens 2 times)
basketball court.... (this sequence happens 1 time)
I am having a hard time conceptually figuring out how to do this.
The idea I have had but have not been able to successfully implement is
Iterate through each word, if the word is basketball, take the next word and add it to a map. Reduce by key, and sort from highest to lowest.
Unfortunately I do not know how to take the next next word in a list of words.
For example, i would like to do something like this
val lines = spark.textFile("basketball_words_only.txt") // process lines in file
// split into individual words
val words = lines.flatMap(line => line.split(" "))
var listBuff = new ListBuffer[String]() // a list Buffer to hold each following word
val it = Iterator(words)
while (it.hasNext) {
listBuff += it.next().next() // <-- this is what I would like to do
}
val follows = listBuff.map(word => (word, 1))
val count = follows.reduceByKey((x, y) => x + y) // another issue as I cannot reduceByKey with a listBuffer
val sort = count.sortBy(_._2,false,1)
val result2 = sort.collect()
for (i <- 0 to result2.length - 1) {
printf("%s follows %d times\n", result1(2)._1, result2(i)._2);
}
Any help would be appreciated. If I am over thinking this I am open to different ideas and suggestions.
Here's one way to do it using MLlib's sliding function:
import org.apache.spark.mllib.rdd.RDDFunctions._
val resRDD = textFile.
flatMap(_.split("""[\s,.;:!?]+""")).
sliding(2).
map{ case Array(x, y) => ((x, y), 1) }.
reduceByKey(_ + _).
map{ case ((x, y), c) => (x, y, c) }.
sortBy( z => (z._1, z._3, z._2), false )

How to copy matrix to column array

I'm trying to copy a column of a matrix into an array, also I want to make this matrix public.
Heres my code:
val years = Array.ofDim[String](1000, 1)
val bufferedSource = io.Source.fromFile("Top_1_000_Songs_To_Hear_Before_You_Die.csv")
val i=0;
//println("THEME, TITLE, ARTIST, YEAR, SPOTIFY_URL")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
years(i)=cols(3)(i)
}
I want the cols to be a global matrix and copy the column 3 to years, because of the method of that I get cols I dont know how to define it
There're three different problems in your attempt:
Your regexp will fail for this dataset. I suggest you change it to:
val regex = ",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))"
This will capture the blocks wrapped in double quotes but containing commas (courtesy of Luke Sheppard on regexr)
This val i=0; is not very scala-ish / functional. We can replace it by a zipWithIndex in the for comprehension:
for ((line, count) <- bufferedSource.getLines.zipWithIndex)
You can create the "global matrix" by extracting elements from each line (val Array (...)) and returning them as the value of the for-comprehension block (yield):
It looks like that:
for ((line, count) <- bufferedSource.getLines.zipWithIndex) yield {
val Array(theme,title,artist,year,spotify_url) = line....
...
(theme,title,artist,year,spotify_url)
}
And here is the complete solution:
val bufferedSource = io.Source.fromFile("/tmp/Top_1_000_Songs_To_Hear_Before_You_Die.csv")
val years = Array.ofDim[String](1000, 1)
val regex = ",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))"
val iteratorMatrix = for ((line, count) <- bufferedSource.getLines.zipWithIndex) yield {
val Array(theme,title,artist,year,spotify_url) = line.split(regex, -1).map(_.trim)
years(count) = Array(year)
(theme,title,artist,year,spotify_url)
}
// will actually consume the iterator and fill in globalMatrix AND years
val globalMatrix = iteratorMatrix.toList
Here's a function that will get the col column from the CSV. There is no error handling here for any empty row or other conditions. This is a proof of concept so add your own error handling as you see fit.
val years = (fileName: String, col: Int) => scala.io.Source.fromFile(fileName)
.getLines()
.map(_.split(",")(col).trim())
Here's a suggestion if you are looking to keep the contents of the file in a map. Again there's no error handling just proof of concept.
val yearColumn = 3
val fileName = "Top_1_000_Songs_To_Hear_Before_You_Die.csv"
def lines(file: String) = scala.io.Source.fromFile(file).getLines()
val mapRow = (row: String) => row.split(",").zipWithIndex.foldLeft(Map[Int, String]()){
case (acc, (v, idx)) => acc.updated(idx,v.trim())}
def mapColumns = (values: Iterator[String]) =>
values.zipWithIndex.foldLeft(Map[Int, Map[Int, String]]()){
case (acc, (v, idx)) => acc.updated(idx, mapRow(v))}
val parser = lines _ andThen mapColumns
val matrix = parser(fileName)
val years = matrix.flatMap(_.swap._1.get(yearColumn))
This will build a Map[Int,Map[Int, String]] which you can use elsewhere. The first index of the map is the row number and the index of the inner map is the column number. years is an Iterable[String] that contains the year values.
Consider adding contents to a collection at the same time as it is created, in contrast to allocate space first and then update it; for instance like this,
val rawSongsInfo = io.Source.fromFile("Top_Songs.csv").getLines
val cols = for (rsi <- rawSongsInfo) yield rsi.split(",").map(_.trim)
val years = cols.map(_(3))

Spark: Efficient mass lookup in pair RDD's

In Apache Spark I have two RDD's. The first data : RDD[(K,V)] containing data in key-value form. The second pairs : RDD[(K,K)] contains a set of interesting key-pairs of this data.
How can I efficiently construct an RDD pairsWithData : RDD[((K,K)),(V,V))], such that it contains all the elements from pairs as the key-tuple and their corresponding values (from data) as the value-tuple?
Some properties of the data:
The keys in data are unique
All entries in pairs are unique
For all pairs (k1,k2) in pairs it is guaranteed that k1 <= k2
The size of 'pairs' is only a constant the size of data |pairs| = O(|data|)
Current data sizes (expected to grow): |data| ~ 10^8, |pairs| ~ 10^10
Current attempts
Here is some example code in Scala:
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
// This kind of show the idea, but fails at runtime.
def massPairLookup1(keyPairs : RDD[(Int, Int)], data : RDD[(Int, String)]) = {
keyPairs map {case (k1,k2) =>
val v1 : String = data lookup k1 head;
val v2 : String = data lookup k2 head;
((k1, k2), (v1,v2))
}
}
// Works but is O(|data|^2)
def massPairLookup2(keyPairs : RDD[(Int, Int)], data : RDD[(Int, String)]) = {
// Construct all possible pairs of values
val cartesianData = data cartesian data map {case((k1,v1),(k2,v2)) => ((k1,k2),(v1,v2))}
// Select only the values who's keys are in keyPairs
keyPairs map {(_,0)} join cartesianData mapValues {_._2}
}
// Example function that find pairs of keys
// Runs in O(|data|) in real life, but cannot maintain the values
def relevantPairs(data : RDD[(Int, String)]) = {
val keys = data map (_._1)
keys cartesian keys filter {case (x,y) => x*y == 12 && x < y}
}
// Example run
val data = sc parallelize(1 to 12) map (x => (x, "Number " + x))
val pairs = relevantPairs(data)
val pairsWithData = massPairLookup2(pairs, data)
// Print:
// ((1,12),(Number1,Number12))
// ((2,6),(Number2,Number6))
// ((3,4),(Number3,Number4))
pairsWithData.foreach(println)
Attempt 1
First I tried just using the lookup function on data, but that throws an runtime error when executed. It seems like self is null in the PairRDDFunctions trait.
In addition I am not sure about the performance of lookup. The documentation says This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. This sounds like n lookups takes O(n*|partition|) time at best, which I suspect could be optimized.
Attempt 2
This attempt works, but I create |data|^2 pairs which will kill performance. I do not expect Spark to be able to optimize that away.
Your lookup 1 doesn't work because you cannot perform RDD transformations inside workers (inside another transformation).
In the lookup 2, I don't think it's necessary to perform full cartesian...
You can do it like this:
val firstjoin = pairs.map({case (k1,k2) => (k1, (k1,k2))})
.join(data)
.map({case (_, ((k1, k2), v1)) => ((k1, k2), v1)})
val result = firstjoin.map({case ((k1,k2),v1) => (k2, ((k1,k2),v1))})
.join(data)
.map({case(_, (((k1,k2), v1), v2))=>((k1, k2), (v1, v2))})
Or in a more dense form:
val firstjoin = pairs.map(x => (x._1, x)).join(data).map(_._2)
val result = firstjoin.map({case (x,y) => (x._2, (x,y))})
.join(data).map({case(x, (y, z))=>(y._1, (y._2, z))})
I don't think you can do it more efficiently, but I might be wrong...

Scala: groupBy (identity) of List Elements

I develop an application that builds pairs of words in (tokenised) text and produces the number of times each pair occurs (even when same-word pairs occur multiple times, it's OK as it'll be evened out later in the algorithm).
When I use
elements groupBy()
I want to group by the elements' content itself, so I wrote the following:
def self(x: (String, String)) = x
/**
* Maps a collection of words to a map where key is a pair of words and the
* value is number of
* times this pair
* occurs in the passed array
*/
def producePairs(words: Array[String]): Map[(String,String), Double] = {
var table = List[(String, String)]()
words.foreach(w1 =>
words.foreach(w2 =>
table = table ::: List((w1, w2))))
val grouppedPairs = table.groupBy(self)
val size = int2double(grouppedPairs.size)
return grouppedPairs.mapValues(_.length / size)
}
Now, I fully realise that this self() trick is a dirty hack. So I thought a little a came out with a:
grouppedPairs = table groupBy (x => x)
This way it produced what I want. However, I still feel that I clearly miss something and there should be easier way of doing it. Any ideas at all, dear all?
Also, if you'd help me to improve the pairs extraction part, it'll also help a lot – it looks very imperative, C++ - ish right now. Many thanks in advance!
I'd suggest this:
def producePairs(words: Array[String]): Map[(String,String), Double] = {
val table = for(w1 <- words; w2 <- words) yield (w1,w2)
val grouppedPairs = table.groupBy(identity)
val size = grouppedPairs.size.toDouble
grouppedPairs.mapValues(_.length / size)
}
The for comprehension is much easier to read, and there is already a predifined function identity, with is a generalized version of your self.
you are creating a list of pairs of all words against all words by iterating over words twice, where i guess you just want the neighbouring pairs. the easiest is to use a sliding view instead.
def producePairs(words: Array[String]): Map[(String, String), Int] = {
val pairs = words.sliding(2, 1).map(arr => arr(0) -> arr(1)).toList
val grouped = pairs.groupBy(t => t)
grouped.mapValues(_.size)
}
another approach would be to fold the list of pairs by summing them up. not sure though that this is more efficient:
def producePairs(words: Array[String]): Map[(String, String), Int] = {
val pairs = words.sliding(2, 1).map(arr => arr(0) -> arr(1))
pairs.foldLeft(Map.empty[(String, String), Int]) { (m, p) =>
m + (p -> (m.getOrElse(p, 0) + 1))
}
}
i see you are return a relative number (Double). for simplicity i have just counted the occurances, so you need to do the final division. i think you want to divide by the number of total pairs (words.size - 1) and not by the number of unique pairs (grouped.size)..., so the relative frequencies sum up to 1.0
Alternative approach which is not of order O(num_words * num_words) but of order O(num_unique_words * num_unique_words) (or something like that):
def producePairs[T <% Traversable[String]](words: T): Map[(String,String), Double] = {
val counts = words.groupBy(identity).map{case (w, ws) => (w -> ws.size)}
val size = (counts.size * counts.size).toDouble
for(w1 <- counts; w2 <- counts) yield {
((w1._1, w2._1) -> ((w1._2 * w2._2) / size))
}
}