Pyspark label points aggregation - pyspark

I am performing a binary classification using LabeledPoint. I then attempt to sum() the number of labeled points with 1.0 to verify if the classification.
I have labelled an RDD as follows
lp_RDD = RDD.map(lambda x: LabeledPoint(1 if (flag in x[0]) else 0,x[1]))
I thought perhaps I could get a count of how many have been labelled with 1 using:
cnt = lp_RDD.map(lambda x: x[0]).sum()
But I get the following error :
'LabeledPoint' object does not support indexing
I have verified the labeled RDD as correct by printing the entire RDD and then doing a search for the string "LabeledPoint(1.0". I was simply wondering if there was a shortcut by trying to do a sum?

LabeledPoint has label value member which can be used to find the count or sum.Please try,
cnt = lp_RDD.map(lambda x: x.label).sum()

Related

How to convert a type Any List to a type Double (Scala)

I am new to Scala and I would like to understand some basic stuff.
First of all, I need to calculate the average of a certain column of a DataFrame and use the result as a double type variable.
After some Internet research I was able to calculate the average and at the same time pass it into a List type Any by using the following command:
val avgX_List = mainDataFrame.groupBy().agg(mean("_c1")).collect().map(_(0)).toList
where "_c1" is the second column of my dataframe. This line of code returns a List with type List[Any].
To pass the result into a variable I used the following command:
var avgX = avgX_List(0)
hoping that the var avgX would be type double automatically but that didn't happen obviously.
So now let the questions begin:
What does map(_(0)) do? I know the basic definition of the map() transformation but I can't find an explanation with this exact argument
I know that by using .toList method in the end of the command my result will be a List with type Any. Is there a way that I could change this into List which contains type Double elements? Or even convert this one
Do you think that it would be much more appropriate to pass the column of my Dataframe into a List[Double] and then calculate the average of its elements?
Is the solution I showed above at any point of view correct based on my problem? I know that "it is working" is different from "correct solution"?
Summing up, I need to calculate the average of a certain column of a Dataframe and have the result as a double type variable.
Note that: I am Greek and I find it hard sometimes to understand some English coding "slang".
map(_(0)) is a shortcut for map( (r: Row) => r(0) ), which is in turn a shortcut for map( (r: Row) => r.apply(0) ). The apply method returns Any, and so you are losing the right type. Try using map(_.getAs[Double](0)) or map(_.getDouble(0)) instead.
Collecting all entries of the column and then computing the average would be highly counterproductive, because you'd have to send huge amounts of data to the master node, and then do all the calculations on this single central node. That would be the exact opposite of what Spark is good for.
You also don't need collect(...).toList, because you can access the 0-th entry directly (it doesn't matter whether you get it from an Array or from a List). Since you are collapsing everything into a single Row anyway, you could get rid of the map step entirely by reordering the methods a little bit:
val avgX = mainDataFrame.groupBy().agg(mean("_c1")).collect()(0).getDouble(0)
It can be written even shorter using the first method:
val avgX = mainDataFrame.groupBy().agg(mean("_c1")).first().getDouble(0)
#Any dataType in Scala can't be directly converted to Double.
#Use toString & then toDouble on final captured result.
#Eg-
#scala> x
#res22: Any = 1.0
#scala> x.toString.toDouble
#res23: Double = 1.0
#Note- Instead of using map().toList() directly use (0)(0) to get the final value from your resultset.
#TestSample(Scala)-
val wa = Array("one","two","two")
val wrdd = sc.parallelize(wa,3).map(x=>(x,1))
val wdf = wrdd.toDF("col1","col2")
val x = wdf.groupBy().agg(mean("col2")).collect()(0)(0).toString.toDouble
#O/p-
#scala> val x = wdf.groupBy().agg(mean("col2")).collect()(0)(0).toString.toDouble
#x: Double = 1.0

PySpark filtering gives inconsistent behavior

So I have a data set where I do some transformations and the last step is to filter out rows that have a 0 in a column called frequency. The code that does the filtering is super simple:
def filter_rows(self, name: str = None, frequency_col: str = 'frequency', threshold: int = 1):
df = getattr(self, name)
df = df.where(df[frequency_col] >= threshold)
setattr(self, name, df)
return self
The problem is a very strange behavior where if I put a rather high threshold like 10, it works fine, filtering out all the rows below 10. But if I make the threshold just 1, it does not remove the 0s! Here is an example of the former (threshold=10):
{"user":"XY1677KBTzDX7EXnf-XRAYW4ZB_vmiNvav7hL42BOhlcxZ8FQ","domain":"3a899ebbaa182778d87d","frequency":12}
{"user":"lhoAWb9U9SXqscEoQQo9JqtZo39nutq3NgrJjba38B10pDkI","domain":"3a899ebbaa182778d87d","frequency":9}
{"user":"aRXbwY0HcOoRT302M8PCnzOQx9bOhDG9Z_fSUq17qtLt6q6FI","domain":"33bd29288f507256d4b2","frequency":23}
{"user":"RhfrV_ngDpJex7LzEhtgmWk","domain":"390b4f317c40ac486d63","frequency":14}
{"user":"qZqqsNSNko1V9eYhJB3lPmPp0p5bKSq0","domain":"390b4f317c40ac486d63","frequency":11}
{"user":"gsmP6RG13azQRmQ-RxcN4MWGLxcx0Grs","domain":"f4765996305ccdfa9650","frequency":10}
{"user":"jpYTnYjVkZ0aVexb_L3ZqnM86W8fr082HwLliWWiqhnKY5A96zwWZKNxC","domain":"f4765996305ccdfa9650","frequency":15}
{"user":"Tlgyxk_rJF6uE8cLM2sArPRxiOOpnLwQo2s","domain":"f89838b928d5070c3bc3","frequency":17}
{"user":"qHu7fpnz2lrBGFltj98knzzbwWDfU","domain":"f89838b928d5070c3bc3","frequency":11}
{"user":"k0tU5QZjRkBwqkKvMIDWd565YYGHfg","domain":"f89838b928d5070c3bc3","frequency":17}
And now here is some of the data with threshold=1:
{"user":"KuhSEPFKACJdNyMBBD2i6ul0Nc_b72J4","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"EP1LomZ3qAMV3YtduC20","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"UxulBfshmCro-srE3Cs5znxO5tnVfc0_yFps","domain":"d69cb6f62b885fec9b7d","frequency":1}
{"user":"v2OX7UyvMVnWlDeDyYC8Opk-va_i8AwxZEsxbk","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"4hu1uE2ucAYZIrNLeOY2y9JMaArFZGRqjgKzlKenC5-GfxDJQQbLcXNSzj","domain":"68b588cedbc66945c442","frequency":0}
{"user":"5rFMWm_A-7N1E9T289iZ65TIR_JG_OnZpJ-g","domain":"68b588cedbc66945c442","frequency":1}
{"user":"RLqoxFMZ7Si3CTPN1AnI4hj6zpwMCJI","domain":"68b588cedbc66945c442","frequency":1}
{"user":"wolq9L0592MGRfV_M-FxJ5Wc8UUirjqjMdaMDrI","domain":"68b588cedbc66945c442","frequency":0}
{"user":"9spTLehI2w0fHcxyvaxIfo","domain":"68b588cedbc66945c442","frequency":1}
I should note that before this step I perform some other transformations, and I've noticed weird behaviors in Spark in the past sometimes doing very simple things like this after a join or a union can give very strange results where eventually the only solution is to write out the data and read it back in again and do the operation in a completely separate script. I hope there is a better solution than this!

How to use RowMatrix.columnSimilarities (similarity search)

TL;DR; I am trying to train off of an existing data set (Seq[Words] with corresponding categories), and use that trained dataset to filter another dataset using category similarity.
I am trying to train a corpus of data and then use it for text analysis*. I've tried using NaiveBayes, but that seems to only work with the data you have, so it's predict algorithm will always return something, even if it doesn't match anything.
So, I am now trying to use TFIDF and passing that output into a RowMatrix and computing the similarities. But, I'm not sure how to run my query (one word for now). Here's what I've tried:
val rddOfTfidfFromCorpus : RDD[Vector]
val query = "word"
val tf = new HashingTF().transform(List(query))
val tfIDF = new IDF().fit(sc.makeRDD(List(tf))).transform(tf)
val mergedVectors = rddOfTfidfFromCorpus.union(sc.makeRDD(List(tfIDF)))
val similarities = new RowMatrix(mergedVectors).columnSimilarities(1.0)
Here is where I'm stuck (if I've even done everything right until here). I tried filtering the similarities i and j down to the parts of my query's TFIDF and end up with an empty collection.
The gist is that I want to train on a corpus of data and find what category it falls in. The above code is at least trying to get it down to one category and checking if I can get a prediction from that at least....
*Note that this is a toy example, so I only need something that works well enough
*I am using Spark 1.4.0
Using columnSimilarities doesn't make sense here. Since each column in your matrix represents a set of terms you'll get a matrix of similarities between tokens not documents. You could transpose the matrix and then use columnSimilarities but as far as I understand what you want is a similarity between query and corpus. You can express that using matrix multiplication as follows:
For starters you'll need an IDFModel you've trained on a corpus. Lets assume it is called idf:
import org.apache.spark.mllib.feature.IDFModel
val idf: IDFModel = ??? // Trained using corpus data
and a small helper:
def toBlockMatrix(rdd: RDD[Vector]) = new IndexedRowMatrix(
rdd.zipWithIndex.map{case (v, i) => IndexedRow(i, v)}
).toCoordinateMatrix.toBlockMatrix
First lets convert query to an RDD and compute TF:
val query: Seq[String] = ???
val queryTf = new HashingTF().transform(query)
Next we can apply IDF model and convert result to matrix:
val queryTfidf = idf.transform(queryTf)
val queryMatrix = toBlockMatrix(queryTfidf)
We'll need a corpus matrix as well:
val corpusMatrix = toBlockMatrix(rddOfTfidfFromCorpus)
If you multiple both we get a matrix with number of rows equal to the number of docs in the query and number of columns equal to the number of documents in the corpus.
val dotProducts = queryMatrix.multiply(corpusMatrix.transpose)
To get a proper cosine similarity you have to divide by a product of magnitudes but if you can handle that.
There are two problems here. First of all it is rather expensive. Moreover I am not sure if it really useful. To reduce cost you can apply some dimensionality reduction algorithm first but lets leave it for now.
Judging from a following statement
NaiveBayes (...) seems to only work with the data you have, so it's predict algorithm will always return something, even if it doesn't match anything.
I guess you want some kind of unsupervised learning method. The simplest thing you can try is K-means:
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
val numClusters: Int = ???
val numIterations = 20
val model = KMeans.train(rddOfTfidfFromCorpus, numClusters, numIterations)
val predictions = model.predict(queryTfidf)

Imputing the dataset with mean of class label causing crash in filter operation

I have a csv file that contains numeric values.
val row = withoutHeader.map{
line => {
val arr = line.split(',')
for (h <- 0 until arr.length){
if(arr(h).trim == ""){
val abc = avgrdd.filter {case ((x,y),z) => x == h && y == arr(dependent_col_index).toDouble} //crashing here
arr(h) = //imputing with the value above
}
}
arr.mkString(",")
}
}
This is a snippet of the code where I am trying to impute the missing values with the mean of class labels.
avgrdd contains the average for the key value pairs where key is column index and the class label value. This avgrdd is calculated using the combiners which I see is calculating the results correctly.
dependent_col_index is the column containing the class labels.
The line with filter is crashing with the null pointer exception.
On removing this line the original array is the output (comma separated).
I am confused why the filter operation is causing a crash.
Please suggest on how to fix this issue.
Example
col1,dependent_col_index
4,1
8,0
,1
21,1
21,0
,1
25,1
,0
34,1
mean for class 1 is 84/4 = 21 and for class 0 is 29/2 = 14.5
Required Output
4,1
8,0
21,1
21,1
21,0
21,1
25,1
14.5,0
34,1
Thanks !!
You are trying to execute a RDD transformation inside of another RDD transformation. Remember that you cannot use RDD inside of another RDD transformation, this would cause an error.
The way to proceed is the following:
Transform the source RDD withoutHeader to the RDD of pairs <Class, Value> of the corrent type (Long in your case). Cache it
Calculate avgrdd on top of withoutHeader. This should be an RDD of pairs <Class, AvgValue>
Join withoutHeader RDD and avgrdd together - this way for each row you would have a structure <Class, <Value, AvgValue>>
Execute map on top of the result to replace missing Value with AvgValue
Another option might be to split the RDD in two parts on step 3 (one part - RDD with missing values, second one - RDD with non-missing values), join the avgrdd only with the RDD containing only missing values and after that make a union between this two parts. It would be faster if you have a small fraction of missing values

Sum of elements based on grouping an element using scala?

I have following scala list-
List((192.168.1.1,8590298237), (192.168.1.1,8590122837), (192.168.1.1,4016236988),
(192.168.1.1,1018539117), (192.168.1.1,2733649135), (192.168.1.2,16755417009),
(192.168.1.1,3315423529), (192.168.1.2,1523080027), (192.168.1.1,1982762904),
(192.168.1.2,6148851261), (192.168.1.1,1070935897), (192.168.1.2,276531515092),
(192.168.1.1,17180030107), (192.168.1.1,8352532280), (192.168.1.3,8590120563),
(192.168.1.3,24651063), (192.168.1.3,4431959144), (192.168.1.3,8232349877),
(192.168.1.2,17493253102), (192.168.1.2,4073818556), (192.168.1.2,42951186251))
I want following output-
List((192.168.1.1, sum of all values of 192.168.1.1),
(192.168.1.2, sum of all values of 192.168.1.2),
(192.168.1.3, sum of all values of 192.168.1.3))
How do I get sum of second elements from list by grouping on first element using scala??
Here you can use the groupBy function in Scala. You do, in your input data however have some issues, the ip numbers or whatever that is must be Strings and the numbers Longs. Here is an example of the groupBy function:
val data = ??? // Your list
val sumList = data.groupBy(_._1).map(x => (x._1, x._2.map(_._2).sum)).toList
If the answer is correct accept it or comment and I'll explain some more.