MALLET - Which weighting schema? - classification

I am using MALLET for text classification (with Naive Bayes) and I understand there is this FeatureSequence2FeatureVector() method for creating feature vectors that can be used as part of the Pipe. My question is which weighting schema is implemented when we use FeatureSequence2FeatureVector() with no arguments and FeatureSequence2FeatureVector(boolean x). With the second one, x=TRUE should result in Bernoulli Naive Bayes, I suppose. But what about the no argument and also x=FALSE versions?

By default the FeatureSequence2FeatureVector will set feature values to raw feature counts. For example, the string "dog cat dog" will map to
{ "dog": 2.0, "cat": 1.0 }
Passing true as an argument will result in
{ "dog" 1.0, "cat": 1.0 }

Related

VectorAssembler behavior and aggregating sparse data with dense

May someone explain behavior of VectorAssembler?
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=['CategoryID', 'CountryID', 'CityID', 'tf'],
outputCol="features")
output = assembler.transform(tf)
output.select("features").show(truncate=False)
the code via show method returns me
(262147,[0,1,2,57344,61006,80641,126469,142099,190228,219556,221426,231784],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])
when I use the same variable "output" with take I get different return
output.select('features').take(1)
[Row(features=SparseVector(262147, {0: 2.0, 1: 1.0, 2: 1.0, 57344: 1.0, 61006: 1.0, 80641: 1.0, 126469: 1.0, 142099: 1.0, 190228: 1.0, 219556: 1.0, 221426: 1.0, 231784: 1.0}))]
By the way, consider case, There is an sparse array output from "tfidf". I still have an additional data (metadata) available. I need somehow aggregate sparse arrays in Pyspark Dataframes with metadata for LSH algorithm. I've tried VectorAssembler as you can see but it also returns dense vector. Maybe there are any tricks to combine data and still have sparse data as output.
Only the format of the two returns is different; in both cases, you get actually the same sparse vector.
In the first case, you get a sparse vector with 3 elements: the dimension (262147), and two lists, containing the indices & values respectively of the nonzero elements. You can easily verify that the length of these lists is the same, as it should be:
len([0,1,2,57344,61006,80641,126469,142099,190228,219556,221426,231784])
# 12
len([2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])
# 12
In the second case you get again a sparse vector with the same first element, but here the two lists are combined into a dictionary of the form {index: value}, which again has the same length with the lists of the previous representation:
len({0: 2.0, 1: 1.0, 2: 1.0, 57344: 1.0, 61006: 1.0, 80641: 1.0, 126469: 1.0, 142099: 1.0, 190228: 1.0, 219556: 1.0, 221426: 1.0, 231784: 1.0} )
# 12
Since assembler.transform() returns a Spark dataframe, the difference is due to the different formats returned by the Spark SQL functions show and take, respectively.
By the way, consider case [...]
It is not at all clear what exactly you are asking here, and in any case I suggest you open a new question on this with a reproducible example, since it sounds like a different subject...

Dynamic Json Keys in Scala

I'm new to scala (from python) and I'm trying to create a Json object that has dynamic keys. I would like to use some starting number as the top-level key and then combinations involving that number as second-level keys.
From reading the play-json docs/examples, I've seen how to build these nested structures. While that will work for the top-level keys (there are only 17 of them), this is a combinatorial problem and the power set contains ~130k combinations that would be the second-level keys so it isn't feasible to list that structure out. I also saw the use of a case class for structures, however the parameter name becomes the key in those instances which is not what I'm looking for.
Currently, I'm considering using HashMaps with the MultiMap trait so that I can map multiple combinations to the same original starting number and then second-level keys would be the combinations themselves.
I have python code that does this, but it takes 3-4 days to work through up-to-9-number combinations for all 17 starting numbers. The ideal final format would look something like below.
Perhaps it isn't possible to do in scala given the goal of using immutable structures. I suppose using regex on a string of the output might be an option as well. I'm open to any solutions regarding data structures to hold the info and how to approach the problem. Thanks!
{
"2": {
"(2, 3, 4, 5, 6)": {
"best_permutation": "(2, 4, 3, 5, 6)",
"amount": 26.0
},
"(2, 4, 5, 6)": {
"best_permutation": "(2, 5, 4, 6)",
"amount": 21.0
}
},
"3": {
"(3, 2, 4, 5, 6)": {
"best_permutation": "(3, 4, 2, 5, 6)",
"amount": 26.0
},
"(3, 4, 5, 6)": {
"best_permutation": "(3, 5, 4, 6)",
"amount": 21.0
}
}
}
EDIT:
There is no real data source other than the matrix I'm using as my lookup table. I've posted the links to the lookup table I'm using and the program if it might help, but essentially, I'm generating the content myself within the code.
For a given combination, I have a function that basically takes the first value of the combination (which is to be the starting point) and then uses the tail of that combination to generate a permutation.
After that I prepend the starting location to the front of each permutation and then use sliding(2) to work my way through the permutation looking up the amount which is in a breeze.linalg.DenseMatrix by using the two values to index the matrix I've provided below and summing the amounts gathered by indexing the matrix with the two sliding values (subtracting 1 from each value to account for the 0-based indexing).
At this point, it is just a matter of gathering the information (starting_location, combination, best_permutation and the amount) and constructing the nested HashMap. I'm using scala 2.11.8 if it makes any difference.
MATRIX: see here.
PROGRAM:see here.

Tinkerpop3 - degree centrality

I'm looking to find the most liked nodes so basically the degree centrality recipe. This query kind of works but I'd like to return the full vertex (including properties) rather than just the id's.
( I am using Tinkerpop 3.0.1-incubating )
g.V()
.where( inE("likes") )
.group()
.by()
.by( inE("likes").count() )
Result
{
"8240": [
2
],
"8280": [
1
],
"12376": [
1
],
"24704": [
1
],
"40976": [
1
]
}
You're probably looking for the order step, using an anonymous traversal passed to the by() modulator:
g.V().order().by(inE('likes').count(), decr)
Note: this will require iterating over all vertices in Titan v1.0.0 and this query cannot be optimized, it will only work over smaller graphs in OLTP.
To get the 10 most liked:
g.V().order().by(inE('likes').count(), decr).limit(10)
If you want to get the full properties, simply chain .valueMap() or .valueMap(true) (for id and label) on the query.
See also:
http://tinkerpop.apache.org/docs/3.0.1-incubating/#order-step
https://groups.google.com/d/topic/gremlin-users/rt3qRKyAqts/discussion
GraphSON, as it is JSON based, does not support the conversion of complex objects to keys. A key in JSON must be string based and, as in this case, cannot be a map. To deal with this JSON limitation, GraphSON converts complex objects that are to be keys via the Java's toString() or by other methods for certain objects like graph elements (which returns a string representation of the element identifier, explaining why you received the output that you did).
If you want to return properties of elements while using GraphSON, you will have to figure out a workaround. In this specific case, you might do:
gremlin> graph = TinkerFactory.createModern()
==>tinkergraph[vertices:6 edges:6]
gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.V().group().
......1> by(id).
......2> by(union(__(), outE('knows').count()).fold())
==>[1:[v[1],2],2:[0,v[2]],3:[v[3],0],4:[0,v[4]],5:[v[5],0],6:[0,v[6]]]
In this way you get the vertex identifier as the map id and in the value you get the full vertex plus the count. TinkerPop is working on improving this issue, but I don't expect a fast fix.

Database language organization, by word or by language?

I seek to add a dictionary for sentences (that most are single words) but am undecided if I should organize them by words, such as:
{word:"bacon", en:"bacon", ro:"sunca", fr:"jambon"}
or by language:
{
en:{ bacon:bacon },
ro:{ bacon:sunca },
fr:{ bacon:jambon }
}
I realize both have pros and cons, are equally valid, but I am seeking the wisdom of those who met this problem before, made a choice, and are happy or regret to have made it, and of course, why.
Thank you.
The below representation is simple and elegant. But the document representation in mongodb (or most nosql databases for that matter) is heavily influenced by the usage pattern of the data.
{word:"bacon", en:"bacon", ro:"sunca", fr:"jambon"}
This representation has the below merits assuming you want to look-up the other language translation by passing in the word
Intitive
No duplication
You can have index on the word
Of the two options that you provide, the first one is better. But since there will be at most so many languages, in your case, I would actually go for arrays:
LANG = {en: 0, ro: 1, fr: 2}
DICT = [
[:ham, :sunca, :jambon],
[:egg, :ou, :œuf]
]
For translation, write eg:
def translate( word, from: nil, to: nil )
from = LANG[from]
to = LANG[to]
i = DICT.index { |entry| entry[from] == word }
return DICT[i][to]
end
translate( :egg, from: :en, to: :fr )
#=> :œuf
Note please, that although effort has been done to minimize the size of the dictionary, as for the speed, faster matching algorithms would be available, using eg. suffix trees.

PyBrain how to interpret the results from net.activate?

I've trained a network on PyBrain for purpose of classification and am ready to fire away with specific input. However, when I do
classes = ['apple', 'orange', 'peach', 'banana']
data = ClassificationDataSet(len(input), 1, nb_classes=len(classes), class_labels=classes)
data._convertToOneOfMany( ) # recommended by PyBrain
fnn = buildNetwork( data.indim, 5, data.outdim, outclass=SoftmaxLayer )
trainer = BackpropTrainer( fnn, dataset=data, momentum=m, verbose=True, weightdecay=wd)
trainer.trainUntilConvergence(maxEpochs=80)
# stop training and start using my trained network here
output = fnn.activate(input)
As expected, I get a numeric value for "output", but is there a way to determine the predicted class label directly? Even if there's not one, how can I map the value of "output" to my class label? Thank you for your help.
When you say you get a numeric value for "output" do you mean a scalar (that is, not an array)? From my understanding of it, you should have gotten an array of four values (ie. as many as possible output classes you have). The biggest value in that array corresponds to the index of the class. I don't know if PyBrain provides an utility function to extract that, but you can do it like this:
class_index = max(xrange(len(output)), key=output.__getitem__)
class_name = classes[class_index]
Incidentally, you omitted the step in which you actually fill the data in the dataset.