I am trying to add values for same key.
val final= d1.join(d2).flatMap(line => Seq(line.swap._1)).reduceByKey((x, y) =>(x+y))
d1 and d2 are data streams. After flatMap I get Key value pair.
However, it is resulting in Infinity value in this line reduceByKey((x, y) =>(x+y))
for example, if the pairs are (k1,1.0) (k1,1.0) the line reduceByKey((x, y) =>(x+y)) results in (k1,Infinity)
Any suggestion?
The above code snippet is working. As #maasg righty hinted the problem was elsewhere. The error was caused by division by zero in previous code which I didnt post here. Thanks!
Related
How do i passed 2 variables to a lambda function, where x is a number and y is a symbol.
I have written this, but it wouldn't process
{[x;y]
// some calculation with x and y
}
each ((til 5) ,\:/: `a`b`c`d`f)
It seems to be complaining that i am missing another arg.
Here's an example that I think does what you're looking for:
q){string[x],string y}./: raze (til 5) ,\:/: `a`b`c`d`f
The issue with your example is that you need to raze the output of ((til 5) ,\:/: `a`b`c`d`f) to get your list of 2 inputs.
Passing a list of variables into a function is accomplished using "." (dot apply) http://code.kx.com/q/ref/unclassified/#apply
.e.g
q){x+y} . 10 2
12
In my example, I've then used an "each right" to then apply to each pair. http://code.kx.com/q/ref/adverbs/#each-right
Alternatively, you could use the each instead if you wrapped the function in another lamda
q){{string[x],string y} . x} each raze (til 5) ,\:/: `a`b`c`d`f
Instead of generating a list of arguments using cross or ",/:\:" and passing each of these into your function, modify your function with each left each right ("/:\:") to give you all combination. his should take the format;
x f/:\: y
Where x and y are both lists. Reusing the example {string[x],string y};
til[5] {string[x], string y}/:\:`a`b`c`d
This will give you a matrix of all combinations of x and y. If you want to flatten that list add a 'raze'
I have a list of (key,value) pairs of the form:
x=[(('cat','dog),('a','b')),(('cat','dog'),('a','b')),(('mouse','rat'),('e','f'))]
I want to count the number of times each value tuple appears with the key tuple.
Desired output:
[(('cat','dog'),('a','b',2)),(('mouse','rat'),('e','f',1))]
A working solution is:
xs=sc.parallelize(x)
xs=xs.groupByKey()
xs=xs.map(lambda (x,y):(x,Counter(y))
however for large datasets, this method fills up the disk space (~600GB). I was trying to implement a similar solution using reduceByKey:
xs=xs.reduceByKey(Counter).collect()
but I get the following error:
TypeError: __init__() takes at most 2 arguments (3 given)
Here is how I usually do it:
xs=sc.parallelize(x)
a = xs.map(lambda x: (x, 1)).reduceByKey(lambda a,b: a+b)
a.collect() yields:
[((('mouse', 'rat'), ('e', 'f')), 1), ((('cat', 'dog'), ('a', 'b')), 2)]
I'm going to assume that you want the counts (here, 1 and 2) inside the second key in the (key1, key2) pair.
To achieve that, try this:
a.map(lambda x: (x[0][0], x[0][1] + (x[1],))).collect()
The last step basically remaps it so that you get the first key pair (like ('mouse','rat')), then takes the second key pair (like ('e','f')), and then adds the tuple version of b[1], which is the count, to the second key pair.
I'm trying to create a map which goes through all the ngrams in a document and counts how often they appear. Ngrams are sets of n consecutive words in a sentence (so in the last sentence, (Ngrams, are) is a 2-gram, (are, sets) is the next 2-gram, and so on). I already have code that creates a document from a file and parses it into sentences. I also have a function to count the ngrams in a sentence, ngramsInSentence, which returns Seq[Ngram].
I'm getting stuck syntactically on how to create my counts map. I am iterating through all the ngrams in the document in the for loop, but don't know how to map the ngrams to the count of how often they occur. I'm fairly new to Scala and the syntax is evading me, although I'm clear conceptually on what I need!
def getNGramCounts(document: Document, n: Int): Counts = {
for (sentence <- document.sentences; ngram <- nGramsInSentence(sentence,n))
//I need code here to map ngram -> count how many times ngram appears in document
}
The type Counts above, as well as Ngram, are defined as:
type Counts = Map[NGram, Double]
type NGram = Seq[String]
Does anyone know the syntax to map the ngrams from the for loop to a count of how often they occur? Please let me know if you'd like more details on the problem.
If I'm correctly interpreting your code, this is a fairly common task.
def getNGramCounts(document: Document, n: Int): Counts = {
val allNGrams: Seq[NGram] = for {
sentence <- document.sentences
ngram <- nGramsInSentence(sentence, n)
} yield ngram
allNgrams.groupBy(identity).mapValues(_.size.toDouble)
}
The allNGrams variable collects a list of all the NGrams appearing in the document.
You should eventually turn to Streams if the document is big and you can't hold the whole sequence in memory.
The following groupBycreates a Map[NGram, List[NGram]] which groups your values by its identity (the argument to the method defines the criteria for "aggregate identification") and groups the corresponding values in a list.
You then only need to map the values (the List[NGram]) to its size to get how many recurring values there were of each NGram.
I took for granted that:
NGram has the expected correct implementation of equals + hashcode
document.sentences returns a Seq[...]. If not you should expect allNGrams to be of the corresponding collection type.
UPDATED based on the comments
I wrongly assumed that the groupBy(_) would shortcut the input value. Use the identity function instead.
I converted the count to a Double
Appreciate the help - I have the correct code now using the suggestions above. The following returns the desired result:
def getNGramCounts(document: Document, n: Int): Counts = {
val allNGrams: Seq[NGram] = (for(sentence <- document.sentences;
ngram <- ngramsInSentence(sentence,n))
yield ngram)
allNGrams.groupBy(l => l).map(t => (t._1, t._2.length.toDouble))
}
I have a function that takes two arguments and compares if they are natural numbers in their unit form and if the first arg is bigger than the second!
So here is the code I've written but every time it gets me "no".
nat(0).
nat(s(X)) :- nat(X).
sum(X,0,X) :- nat(X).
sum(X,s(Y),s(Z)) :- sum(X,Y,Z).
gr(X,Y) :- nat(s(X)), nat(s(Y)), X>Y.
What goes wrong? Everything is in Prolog . the function is the gr() .
First, you probably want for sum rather this:
sum(0, Y, Y) :-
nat(Y).
sum(s(X), Y, s(Z)) :-
sum(X, Y, Z).
This is so that Prolog can recognize that the two clauses are exclusive by only looking at the first argument.
Now to your greater than:
% gr(X, Y) is true if X is greater than Y
gr(X, Y) :- sm(Y, X).
% sm(X, Y) is true if X is smaller than Y
sm(0, s(Y)) :-
nat(Y).
sm(s(X), s(Y)) :-
sm(X, Y).
To answer your actual question: what goes wrong is that the operator > works on integers (like 1 or 0 or -19), not on compound terms. The operator #> will work (see the documentation of the implementation you are using), but I have the feeling you might actually want to be explicit about it.
I have a char array A which basically contains a list of files names (each row one file)
(char, 526x26)
val =
0815_5275_UBA_A_1971.txt
0815_5275_UBA_A_1972.txt
0823_6275_UBA_A_1971.txt
0823_6275_UBA_A_1972.txt
0823_6275_UBA_A_1973.txt
...
I also have a variable
B = '0815_5275'
I'd like to select all rows (filenames) that start with B and save them in a new array C.
This should be simple, but somehow I can't make it work.
I've got this:
C = A(A(:,1:9) == B);
but I get the error message:
Error using ==
Matrix dimensions must agree.
I do not know in advance how many rows will match, so I can not pre-define an empty array.
thanks, any help is appreciated!
Try ismember(A(:, 1:numel(B)), B, 'rows') rather to get a logical vector that indexes only the rows you want
and now
A(C,:) to extract the rows
The reason you're getting a dimension mismatch error is because your A(:,1:9) has many rows but B only has one and Matlab does not automatically broadcast like Octave or Python. You could do it using either repmat or bsxfun but in this case ismember is the correct function to choose.