VectorAssembler behavior and aggregating sparse data with dense - pyspark

May someone explain behavior of VectorAssembler?
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=['CategoryID', 'CountryID', 'CityID', 'tf'],
outputCol="features")
output = assembler.transform(tf)
output.select("features").show(truncate=False)
the code via show method returns me
(262147,[0,1,2,57344,61006,80641,126469,142099,190228,219556,221426,231784],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])
when I use the same variable "output" with take I get different return
output.select('features').take(1)
[Row(features=SparseVector(262147, {0: 2.0, 1: 1.0, 2: 1.0, 57344: 1.0, 61006: 1.0, 80641: 1.0, 126469: 1.0, 142099: 1.0, 190228: 1.0, 219556: 1.0, 221426: 1.0, 231784: 1.0}))]
By the way, consider case, There is an sparse array output from "tfidf". I still have an additional data (metadata) available. I need somehow aggregate sparse arrays in Pyspark Dataframes with metadata for LSH algorithm. I've tried VectorAssembler as you can see but it also returns dense vector. Maybe there are any tricks to combine data and still have sparse data as output.

Only the format of the two returns is different; in both cases, you get actually the same sparse vector.
In the first case, you get a sparse vector with 3 elements: the dimension (262147), and two lists, containing the indices & values respectively of the nonzero elements. You can easily verify that the length of these lists is the same, as it should be:
len([0,1,2,57344,61006,80641,126469,142099,190228,219556,221426,231784])
# 12
len([2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])
# 12
In the second case you get again a sparse vector with the same first element, but here the two lists are combined into a dictionary of the form {index: value}, which again has the same length with the lists of the previous representation:
len({0: 2.0, 1: 1.0, 2: 1.0, 57344: 1.0, 61006: 1.0, 80641: 1.0, 126469: 1.0, 142099: 1.0, 190228: 1.0, 219556: 1.0, 221426: 1.0, 231784: 1.0} )
# 12
Since assembler.transform() returns a Spark dataframe, the difference is due to the different formats returned by the Spark SQL functions show and take, respectively.
By the way, consider case [...]
It is not at all clear what exactly you are asking here, and in any case I suggest you open a new question on this with a reproducible example, since it sounds like a different subject...

Related

Microsoft SEAL : Required negative values as a result after subtraction of two PolyCRT composed ciphertexts

Suppose I have two vectors x = [1,2,3,4] and y = [5,1,2,6].
I composed and encrypted the two array using PolyCRTBuilder ( Ciphertextx and Ciphertexty ) .
If I subtract the two ciphertexts ( Ciphertextx MINUS Ciphertexty ), I should get Result = [-4, 1, 1, -2] but after the homomorphic subtraction I am getting ResultDecrypted = [40957, 1, 1, 40959] .
I understood that because the plaintext is only defined modulo plain_modulus, we got that result. But i want the resultant negative values to be used for the next computation how can i assign the resultant negative values to a vector and use the same for the further computations
You are using a pretty old version of SEAL if it still has PolyCRTBuilder; in newer versions of the library this has been renamed to BatchEncoder and it supports encoding to/from std::vector<std::int64_t> which, I believe, is what you want.

How to do a subtraction with BroadcastedRows in breeze

I would do a subtraction between two dense vectors, both of them are the result of a dataset and a function. there is an example
1. The first dense vector is a row of dataset(*, 2)
BroadcastedRows(DenseVector(0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0))
2. The second dense vector is a return of predict_coef_sgd(dataset, coef)
DenseVector(0.2987569855650975, 0.14595105593031163, 0.08533326519733725,
0.21973731424800344, 0.24705900008926596, 0.9547021347460022,
0.8620341905282771, 0.9717729050420985, 0.9992954520878627,
0.9054893228110497)
I got an error when I subtract them
dataset(*, 2) - predict_coef_sgd(dataset, coef)
Name: Compile Error
Message: <console>:36: error: could not find implicit value for parameter op: breeze.linalg.operators.OpSub.Impl2[breeze.linalg.BroadcastedRows[breeze.linalg.DenseVector[Double],breeze.linalg.DenseVector[Double]],breeze.linalg.DenseVector[Double],That]
dataset(*, 2) - predict_coef_sgd(dataset, coef)
^
StackTrace:
Please comment how to convert "BroadcastedRows(DenseVector" to a dense vector. Thank you.
According to John's comment, referring row-broadcasting and transposed vectors? it solved, however, I don't understand why, please feel free to comment if you would explain that in details
(dataset.t(2,::) - predict_coef_sgd(dataset, coef).t).t
DenseVector(-0.2987569855650975, -0.14595105593031163, -0.08533326519733725,
-0.21973731424800344, -0.24705900008926596, 0.045297865253997815,
0.13796580947172288, 0.028227094957901544, 7.045479121372544E-4,
0.09451067718895034)

MALLET - Which weighting schema?

I am using MALLET for text classification (with Naive Bayes) and I understand there is this FeatureSequence2FeatureVector() method for creating feature vectors that can be used as part of the Pipe. My question is which weighting schema is implemented when we use FeatureSequence2FeatureVector() with no arguments and FeatureSequence2FeatureVector(boolean x). With the second one, x=TRUE should result in Bernoulli Naive Bayes, I suppose. But what about the no argument and also x=FALSE versions?
By default the FeatureSequence2FeatureVector will set feature values to raw feature counts. For example, the string "dog cat dog" will map to
{ "dog": 2.0, "cat": 1.0 }
Passing true as an argument will result in
{ "dog" 1.0, "cat": 1.0 }

filling a matrix with Scala library breeze

I'm new to Scala and I'm having a mental block on a seemingly easy problem. I'm using the Scala library breeze and need to take an array buffer (mutable) and put the results into a matrix. This... should be simple but? Scala is so insanely type casted breeze seems really picky about what data types it will take when making a DenseVector. This is just some prototype code, but can anyone help me come up with a solution?
Right now I have something like...
//9 elements that need to go into a 3x3 matrix, 1-3 as top row, 4-6 as middle row, etc)
val numbersForMatrix: ArrayBuffer[Double] = (1, 2, 3, 4, 5, 6, 7, 8, 9)
//the empty 3x3 matrix
var M: breeze.linalg.DenseMatrix[Double] = DenseMatrix.zeros(3,3)
In breeze you can do stuff like
M(0,0) = 100 and set the first value to 100 this way,
You can also do stuff like:
M(0, 0 to 2) := DenseVector(1, 2, 3)
which sets the first row to 1, 2, 3
But I cannot get it to do something like...
var dummyList: List[Double] = List(1, 2, 3) //this works
var dummyVec = DenseVector[Double](dummyList) //this works
M(0, 0 to 2) := dummyVec //this does not work
and successfully change the first row to the 1, 2,3.
And that's with a List, not even an ArrayBuffer.
Am willing to change datatypes from ArrayBuffer but just not sure how to approach this at all... could try updating the matrix values one by one but that seems like it would be VERY hacky to code up(?).
Note: I'm a Python programmer who is used to using numpy and just giving it arrays. The breeze documentation doesn't provide enough examples with other datatypes for me to have been able to figure this out yet.
Thanks!
Breeze is, in addition to pickiness over types, pretty picky about vector shape: DenseVectors are column vectors, but you are trying to assign to a subset of a row, which expects a transposed DenseVector:
M(0, 0 to 2) := dummyVec.t

Scala Breeze Dirichlet distribution parameter estimation

I am trying to estimate parameters (Dirichlet distribution) for a data set using Scala's breeze lib. I already have a working python (pandas/dataframes) and R code for it but I was curious as to how to do it in Scala. Also I am new to Scala.
I cant seem to get it to work. I guess syntactically I don't have things right or something.
The code I trying to use is here: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/stats/distributions/Dirichlet.scala#L111
According to the code above: ExpFam[T,I] accepts two parameters T and I. I dont know what T and I are. Can T be a Dense Matrix ?
What I am doing is:
# Creating a matrix. The values are counts in my case.
val mat = DenseMatrix((1.0, 2.0, 3.0),(4.0, 5.0, 6.0))
# Then try to get sufficient stats and then MLE. I think this where I doing something wrong.
val diri = new ExpFam[DenseMatrix[Double],Int](mat)
println(diri.sufficientStatisticFor(mat))
Also if one has a data matrix like this DenseMatrix((1.0, 2.0, 3.0),(4.0, 5.0, 6.0)) how do estimate parameters (Dirichlet) in Scala.
I am not really very familiar with this aspect of breeze, but this works for me:
val data = Seq(
DenseVector(0.1, 0.1, 0.8),
DenseVector(0.2, 0.3, 0.5),
DenseVector(0.5, 0.1, 0.4),
DenseVector(0.3, 0.3, 0.4)
)
val expFam = new Dirichlet.ExpFam(DenseVector.zeros[Double](3))
val suffStat = data.foldLeft(expFam.emptySufficientStatistic){(a, x) =>
a + expFam.sufficientStatisticFor(x)
}
val alphaHat = expFam.mle(suffStat)
//DenseVector(2.9803000577558274, 2.325871404559782, 5.850530402841005)
The result is very close to but not exactly the same as what I get with my own code for maximum likelihood estimation of Dirichlets. The difference probably just comes down to differences in the optimizer being used (I'm using the fixed point iteration (9) in section 1 of this paper by T. Minka) and the stopping criteria.
Maybe there's a better way of doing this using the breeze api; if so, hopefully #dlwh or someone else more familiar with breeze will chime in.
T should be DenseVector and I should be Int. ExpFams aren't vectorized right now.