How can I create a TF-IDF for Text Classification using Spark? - scala

I have a CSV file with the following format :
product_id1,product_title1
product_id2,product_title2
product_id3,product_title3
product_id4,product_title4
product_id5,product_title5
[...]
The product_idX is a integer and the product_titleX is a String, example :
453478692, Apple iPhone 4 8Go
I'm trying to create the TF-IDF from my file so I can use it for a Naive Bayes Classifier in MLlib.
I am using Spark for Scala so far and using the tutorials I have found on the official page and the Berkley AmpCamp 3 and 4.
So I'm reading the file :
val file = sc.textFile("offers.csv")
Then I'm mapping it in tuples RDD[Array[String]]
val tuples = file.map(line => line.split(",")).cache
and after I'm transforming the tuples into pairs RDD[(Int, String)]
val pairs = tuples.(line => (line(0),line(1)))
But I'm stuck here and I don't know how to create the Vector from it to turn it into TFIDF.
Thanks

To do this myself (using pyspark), I first started by creating two data structures out of the corpus. The first is a key, value structure of
document_id, [token_ids]
The second is an inverted index like
token_id, [document_ids]
I'll call those corpus and inv_index respectively.
To get tf we need to count the number of occurrences of each token in each document. So
from collections import Counter
def wc_per_row(row):
cnt = Counter()
for word in row:
cnt[word] += 1
return cnt.items()
tf = corpus.map(lambda (x, y): (x, wc_per_row(y)))
The df is simply the length of each term's inverted index. From that we can calculate the idf.
df = inv_index.map(lambda (x, y): (x, len(y)))
num_documnents = tf.count()
# At this step you can also apply some filters to make sure to keep
# only terms within a 'good' range of df.
import math.log10
idf = df.map(lambda (k, v): (k, 1. + log10(num_documents/v))).collect()
Now we just have to do a join on the term_id:
def calc_tfidf(tf_tuples, idf_tuples):
return [(k1, v1 * v2) for (k1, v1) in tf_tuples for
(k2, v2) in idf_tuples if k1 == k2]
tfidf = tf.map(lambda (k, v): (k, calc_tfidf(v, idf)))
This isn't a particularly performant solution, though. Calling collect to bring idf into the driver program so that it's available for the join seems like the wrong thing to do.
And of course, it requires first tokenizing and creating a mapping from each uniq token in the vocabulary to some token_id.
If anyone can improve on this, I'm very interested.

Related

Vector vs Vectors in spark [duplicate]

This question already has an answer here:
Difference between spark Vectors and scala immutable Vector?
(1 answer)
Closed 4 years ago.
i am a newbie in Spark.
I am trying to read a text file that has data like:
timestamp id counter value
00:01 1 c1 0.5
00:02 5 c3 0.3
00:03 1 c2 0.1
00:04 2 c2 0.13
and transform them to:
(id, array_of_counters):
(1, [ c1 c2 ])
[ 0.5 0.1]
So, for every id, i create an 2d array, which will have every counter and every value for that specific id in the text file.
I tried to do it with Vectors but i think that what is stored in them, must be double and that i cannot add two vectors, except the case they are breeze Vectors.
Then, i found out there is a data structure called just Vector but i can't find any details about it.
So, my question is what are the main differences between Vector and Vectors in mllib?
Code:
val inputRdd = sc.textFile(inputFile).map(x => x.split(","))
val data = inputRdd.map(y => (y(1), Vector(y(2), y(3)))).reduceByKey(_++_)
I don't think a Vector is necessary or appropriate for what it appears you are trying to do here (I could be wrong, we need more specifics on what you want to accomplish). The only way it makes sense is if there is a fixed number of counters (c1, c2, etc...) for each id. If you simply want a set of every id, with it's corresponding list of counters and values, try this (I'm assuming counters are unique to each id):
val data = inputRdd
.map(y => (y(1).toLong, y(2), y(3).toDouble))
.toDF("id", "counter", "value")
.groupBy("id")
.agg(collect_list(map($"counter", $"value")))
.as[(Long, Seq[Map[String, Double]])]
.map(r => (r._1, r._2.reduce(_++_)))
//this results in a Dataset[(Long, Map[String, Double])]
A spark ml.linalg.Vector is basically an Array[Double], and would require a fixed number of counter for every record. You could tranform from the data above into a vector by ordering the Map[String, Double] by it's ._1 and creating a Vector from it's .values.
ml.linalg.Vectors is just a helper object with functions for creating Vector objects.
Factory methods for org.apache.spark.ml.linalg.Vector. We don't use the name Vector because Scala imports scala.collection.immutable.Vector by default.
It's also worth noting that mllib is intended for the older RDD API while ml is intended for the newer Dataframe/Dataset API.
Edit: RDD[(Long, Seq[(String, Double)])]
val data = inputRdd
.map(y => (y(1).toLong, Seq[(String, Double)]((y(2), y(3).toDouble))))
.reduceByKey(_++_)

Comparing Subsets of an RDD

I’m looking for a way to compare subsets of an RDD intelligently.
Lets say I had an RDD with key/value pairs of type (Int->T). I eventually need to say “compare all values of key 1 with all values of key 2 and compare values of key 3 to the values of key 5 and key 7”, how would I go about doing this efficiently?
The way I’m currently thinking of doing it is by creating a List of filtered RDDs and then using RDD.cartesian()
def filterSubset[T] = (b:Int, r:RDD[(Int, T)]) => r.filter{case(name, _) => name == b}
Val keyPairs:(Int, Int) // all key pairs
Val rddPairs = keyPairs.map{
case (a, b) =>
filterSubset(a,r).cartesian(filterSubset(b,r))
}
rddPairs.map{whatever I want to compare…}
I would then iterate the list and perform a map on each of the RDDs of pairs to gather the relational data that I need.
What I can’t tell about this idea is whether it would be extremely inefficient to set up possibly of hundreds of map jobs and then iterate through them. In this case, would the lazy valuation in spark optimize the data shuffling between all of the maps? If not, can someone please recommend a possibly more efficient way to approach this issue?
Thank you for your help
One way you can approach this problem is to replicate and partition your data to reflect key pairs you want to compare. Lets start with creating two maps from the actual keys to the temporary keys we'll use for replication and joins:
def genMap(keys: Seq[Int]) = keys
.zipWithIndex.groupBy(_._1)
.map{case (k, vs) => (k -> vs.map(_._2))}
val left = genMap(keyPairs.map(_._1))
val right = genMap(keyPairs.map(_._2))
Next we can transform data by replicating with new keys:
def mapAndReplicate[T: ClassTag](rdd: RDD[(Int, T)], map: Map[Int, Seq[Int]]) = {
rdd.flatMap{case (k, v) => map.getOrElse(k, Seq()).map(x => (x, (k, v)))}
}
val leftRDD = mapAndReplicate(rddPairs, left)
val rightRDD = mapAndReplicate(rddPairs, right)
Finally we can cogroup:
val cogrouped = leftRDD.cogroup(rightRDD)
And compare / filter pairs:
cogrouped.values.flatMap{case (xs, ys) => for {
(kx, vx) <- xs
(ky, vy) <- ys
if cosineSimilarity(vx, vy) <= threshold
} yield ((kx, vx), (ky, vy)) }
Obviously in the current form this approach is limited. It assumes that values for arbitrary pair of keys can fit into memory and require a significant amount of network traffic. Still it should give you some idea how to proceed.
Another possible approach is to store data in the external system (for example database) and fetch required key-value pairs on demand.
Since you're trying to find similarity between elements I would also consider completely different approach. Instead of naively comparing key-by-key I would try to partition data using custom partitioner which reflects expected similarity between documents. It is far from trivial in general but should give much better results.
Using Dataframe you can easily do the cartesian operation using join:
dataframe1.join(dataframe2, dataframe1("key")===dataframe2("key"))
It will probably do exactly what you want, but efficiently.
If you don't know how to create an Dataframe, please refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#creating-dataframes

Spark: Efficient mass lookup in pair RDD's

In Apache Spark I have two RDD's. The first data : RDD[(K,V)] containing data in key-value form. The second pairs : RDD[(K,K)] contains a set of interesting key-pairs of this data.
How can I efficiently construct an RDD pairsWithData : RDD[((K,K)),(V,V))], such that it contains all the elements from pairs as the key-tuple and their corresponding values (from data) as the value-tuple?
Some properties of the data:
The keys in data are unique
All entries in pairs are unique
For all pairs (k1,k2) in pairs it is guaranteed that k1 <= k2
The size of 'pairs' is only a constant the size of data |pairs| = O(|data|)
Current data sizes (expected to grow): |data| ~ 10^8, |pairs| ~ 10^10
Current attempts
Here is some example code in Scala:
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
// This kind of show the idea, but fails at runtime.
def massPairLookup1(keyPairs : RDD[(Int, Int)], data : RDD[(Int, String)]) = {
keyPairs map {case (k1,k2) =>
val v1 : String = data lookup k1 head;
val v2 : String = data lookup k2 head;
((k1, k2), (v1,v2))
}
}
// Works but is O(|data|^2)
def massPairLookup2(keyPairs : RDD[(Int, Int)], data : RDD[(Int, String)]) = {
// Construct all possible pairs of values
val cartesianData = data cartesian data map {case((k1,v1),(k2,v2)) => ((k1,k2),(v1,v2))}
// Select only the values who's keys are in keyPairs
keyPairs map {(_,0)} join cartesianData mapValues {_._2}
}
// Example function that find pairs of keys
// Runs in O(|data|) in real life, but cannot maintain the values
def relevantPairs(data : RDD[(Int, String)]) = {
val keys = data map (_._1)
keys cartesian keys filter {case (x,y) => x*y == 12 && x < y}
}
// Example run
val data = sc parallelize(1 to 12) map (x => (x, "Number " + x))
val pairs = relevantPairs(data)
val pairsWithData = massPairLookup2(pairs, data)
// Print:
// ((1,12),(Number1,Number12))
// ((2,6),(Number2,Number6))
// ((3,4),(Number3,Number4))
pairsWithData.foreach(println)
Attempt 1
First I tried just using the lookup function on data, but that throws an runtime error when executed. It seems like self is null in the PairRDDFunctions trait.
In addition I am not sure about the performance of lookup. The documentation says This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. This sounds like n lookups takes O(n*|partition|) time at best, which I suspect could be optimized.
Attempt 2
This attempt works, but I create |data|^2 pairs which will kill performance. I do not expect Spark to be able to optimize that away.
Your lookup 1 doesn't work because you cannot perform RDD transformations inside workers (inside another transformation).
In the lookup 2, I don't think it's necessary to perform full cartesian...
You can do it like this:
val firstjoin = pairs.map({case (k1,k2) => (k1, (k1,k2))})
.join(data)
.map({case (_, ((k1, k2), v1)) => ((k1, k2), v1)})
val result = firstjoin.map({case ((k1,k2),v1) => (k2, ((k1,k2),v1))})
.join(data)
.map({case(_, (((k1,k2), v1), v2))=>((k1, k2), (v1, v2))})
Or in a more dense form:
val firstjoin = pairs.map(x => (x._1, x)).join(data).map(_._2)
val result = firstjoin.map({case (x,y) => (x._2, (x,y))})
.join(data).map({case(x, (y, z))=>(y._1, (y._2, z))})
I don't think you can do it more efficiently, but I might be wrong...

How to transform Array[(Double, Double)] into Array[Double] in Scala?

I'm using MLlib of Spark (v1.1.0) and Scala to do k-means clustering applied to a file with points (longitude and latitude).
My file contains 4 fields separated by comma (the last two are the longitude and latitude).
Here, it's an example of k-means clustering using Spark:
https://spark.apache.org/docs/1.1.0/mllib-clustering.html
What I want to do is to read the last two fields of my files that are in a specific directory in HDFS, transform them into an RDD<Vector> o use this method in KMeans class:
train(RDD<Vector> data, int k, int maxIterations)
This is my code:
val data = sc.textFile("/user/test/location/*")
val parsedData = data.map(s => Vectors.dense(s.split(',').map(fields => (fields(2).toDouble,fields(3).toDouble))))
But when I run it in spark-shell I get the following error:
error: overloaded method value dense with alternatives: (values:
Array[Double])org.apache.spark.mllib.linalg.Vector (firstValue:
Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
cannot be applied to (Array[(Double, Double)])
So, I don't know how to transform my Array[(Double, Double)] into Array[Double]. Maybe there is another way to read the two fields and convert them into RDD<Vector>, any suggestion?
Previous suggestion using flatMap was based on the assumption that you wanted to map over the elements of the array given by the .split(",") - and offered to satisfy the types, by using Array instead of Tuple2.
The argument received by the .map/.flatMap functions is an element of the original collection, so should be named 'field' (singluar) for clarity. Calling fields(2) selects the 3rd character of each of the elements of the split - hence the source of confusion.
If what you're after is the 3rd and 4th elements of the .split(",") array, converted to Double:
s.split(",").drop(2).take(2).map(_.toDouble)
or if you want all BUT the first to fields converted to Double (if there may be more than 2):
s.split(",").drop(2).map(_.toDouble)
There're two 'factory' methods for dense Vectors:
def dense(values: Array[Double]): Vector
def dense(firstValue: Double, otherValues: Double*): Vector
While the provided type above is Array[Tuple2[Double,Double]] and hence does not type-match:
(Extracting the logic above:)
val parseLineToTuple: String => Array[(Double,Double)] = s => s=> s.split(',').map(fields => (fields(2).toDouble,fields(3).toDouble))
What is needed here is to create a new Array out of the input String, like this: (again focusing only on the specific parsing logic)
val parseLineToArray: String => Array[Double] = s=> s.split(",").flatMap(fields => Array(fields(2).toDouble,fields(3).toDouble)))
Integrating that in the original code should solve the issue:
val data = sc.textFile("/user/test/location/*")
val vectors = data.map(s => Vectors.dense(parseLineToArray(s))
(You can of course inline that code, I separated it here to focus on the issue at hand)
val parsedData = data.map(s => Vectors.dense(s.split(',').flatMap(fields => Array(fields(2).toDouble,fields(3).toDouble))))

How to create a map from a RDD[String] using scala?

My file is,
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
Here there are 7 rows & 5 columns(0,1,2,3,4)
I want the output as,
Map(0 -> Set("sunny","overcast","rainy"))
Map(1 -> Set("hot","mild","cool"))
Map(2 -> Set("high","normal"))
Map(3 -> Set("false","true"))
Map(4 -> Set("yes","no"))
The output must be the type of [Map[Int,Set[String]]]
EDIT: Rewritten to present the map-reduce version first, as it's more suited to Spark
Since this is Spark, we're probably interested in parallelism/distribution. So we need to take care to enable that.
Splitting each string into words can be done in partitions. Getting the set of values used in each column is a bit more tricky - the naive approach of initialising a set then adding every value from every row is inherently serial/local, since there's only one set (per column) we're adding the value from each row to.
However, if we have the set for some part of the rows and the set for the rest, the answer is just the union of these sets. This suggests a reduce operation where we merge sets for some subset of the rows, then merge those and so on until we have a single set.
So, the algorithm:
Split each row into an array of strings, then change this into an
array of sets of the single string value for each column - this can
all be done with one map, and distributed.
Now reduce this using an
operation that merges the set for each column in turn. This also can
be distributed
turn the single row that results into a Map
It's no coincidence that we do a map, then a reduce, which should remind you of something :)
Here's a one-liner that produces the single row:
val data = List(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes")
val row = data.map(_.split("\\W+").map(s=>Set(s)))
.reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}
Converting it to a Map as the question asks:
val theMap = row.zipWithIndex.map(_.swap).toMap
Zip the list with the index, since that's what we need as the key of
the map.
The elements of each tuple are unfortunately in the wrong
order for .toMap, so swap them.
Then we have a list of (key, value)
pairs which .toMap will turn into the desired result.
These don't need to change AT ALL to work with Spark. We just need to use a RDD, instead of the List. Let's convert data into an RDD just to demo this:
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
val rdd = sc.makeRDD(data)
val row = rdd.map(_.split("\\W+").map(s=>Set(s)))
.reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}
(This can be converted into a Map as before)
An earlier oneliner works neatly (transpose is exactly what's needed here) but is very difficult to distribute (transpose inherently needs to visit every row)
data.map(_.split("\\W+")).transpose.map(_.toSet)
(Omitting the conversion to Map for clarity)
Split each string into words.
Transpose the result, so we have a list that has a list of the first words, then a list of the second words, etc.
Convert each of those to a set.
Maybe this do the trick:
val a = Array(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes")
val b = new Array[Map[String, Set[String]]](5)
for (i <- 0 to 4)
b(i) = Map(i.toString -> (Set() ++ (for (s <- a) yield s.split(",")(i))) )
println(b.mkString("\n"))