Pyspark counting the occurance of values with keys - pyspark

I have a list of (key,value) pairs of the form:
x=[(('cat','dog),('a','b')),(('cat','dog'),('a','b')),(('mouse','rat'),('e','f'))]
I want to count the number of times each value tuple appears with the key tuple.
Desired output:
[(('cat','dog'),('a','b',2)),(('mouse','rat'),('e','f',1))]
A working solution is:
xs=sc.parallelize(x)
xs=xs.groupByKey()
xs=xs.map(lambda (x,y):(x,Counter(y))
however for large datasets, this method fills up the disk space (~600GB). I was trying to implement a similar solution using reduceByKey:
xs=xs.reduceByKey(Counter).collect()
but I get the following error:
TypeError: __init__() takes at most 2 arguments (3 given)

Here is how I usually do it:
xs=sc.parallelize(x)
a = xs.map(lambda x: (x, 1)).reduceByKey(lambda a,b: a+b)
a.collect() yields:
[((('mouse', 'rat'), ('e', 'f')), 1), ((('cat', 'dog'), ('a', 'b')), 2)]
I'm going to assume that you want the counts (here, 1 and 2) inside the second key in the (key1, key2) pair.
To achieve that, try this:
a.map(lambda x: (x[0][0], x[0][1] + (x[1],))).collect()
The last step basically remaps it so that you get the first key pair (like ('mouse','rat')), then takes the second key pair (like ('e','f')), and then adds the tuple version of b[1], which is the count, to the second key pair.

Related

Print the hash elements by grouping their values in Raku

I keep record of how many times a letter occur in a word e.g. 'embeddedss'
my %x := {e => 3, m => 1, b => 1, d => 3, s => 2};
I'd like to print the elements by grouping their values like this:
# e and d 3 times
# m and b 1 times
# s 2 times
How to do it practically i.e. without constructing loops (if there is any)?
Optional Before printing the hash, I'd like to convert and assing it to a temporary data structure such as ( <3 e d>, <1 m b>, <2 s> ) and then print it. What could be the most practical data structure and way to print it?
Using .categorize as suggested in the comments, you can group them together based on the value.
%x.categorize(*.value)
This produces a Hash, with the keys being the value used for categorization, and the values being Pairs from your original Hash ($x). This can be looped over using for or .map. The letters you originally counted are the key of the Pairs, which can be neatly extracted using .map.
for %x.categorize(*.value) {
say "{$_.value.map(*.key).join(' and ')} {$_.key} times";
}
Optionally, you can also sort the List by occurrences using .sort. This will sort the lowest number first, but adding a simple .reverse will make the highest value come first.
for %x.categorize(*.value).sort.reverse {
say "{$_.value.map(*.key).join(' and ')} {$_.key} times";
}

Scala - Not enough arguments for method count

I am fairly new to Scala and Spark RDD programming. The dataset I am working with is a CSV file containing list of movies (one row per movie) and their associated user ratings (comma delimited list of ratings). Each column in the CSV represents a distinct user and what rating he/she gave the movie. Thus, user 1's ratings for each movie are represented in the 2nd column from the left:
Sample Input:
Spiderman,1,2,,3,3
Dr.Sleep, 4,4,,,1
I am getting the following error:
Task4.scala:18: error: not enough arguments for method count: (p: ((Int, Int)) => Boolean)Int.
Unspecified value parameter p.
var moviePairCounts = movieRatings.reduce((movieRating1, movieRating2) => (movieRating1, movieRating2, movieRating1._2.intersect(movieRating2._2).count()
when I execute the few lines below. For the program below, the second line of code splits all values delimited by "," and produces this:
( Spiderman, [[1,0],[2,1],[-1,2],[3,3],[3,4]] )
( Dr.Sleep, [[4,0],[4,1],[-1,2],[-1,3],[1,4]] )
On the third line, taking the count() throws an error. For each movie (row), I am trying to get the number of common elements. In the above example, [-1, 2] is clearly a common element shared by both Spiderman and Dr.Sleep.
val textFile = sc.textFile(args(0))
var movieRatings = textFile.map(line => line.split(","))
.map(movingRatingList => (movingRatingList(0), movingRatingList.drop(1)
.map(ranking => if (ranking.isEmpty) -1 else ranking.toInt).zipWithIndex));
var moviePairCounts = movieRatings.reduce((movieRating1, movieRating2) => (movieRating1, movieRating2, movieRating1._2.intersect(movieRating2._2).count() )).saveAsTextFile(args(1));
My target output of line 3 is as follows:
( Spiderman, Dr.Sleep, 1 ) --> Between these 2 movies, there is 1 common entry.
Can somebody please advise ?
To get the number of elements in a collection, use length or size. count() returns number of elements which satisfy some additional condition.
Or you could avoid building the complete intersection by using count to count the elements of the first collection which the second contains:
movieRating1._2.count(movieRating2._2.contains(_))
The error message seems pretty clear: count takes one argument, but in your call, you are passing an empty argument list, i.e. zero arguments. You need to pass one argument to count.

scala - Find a pair in a list only with first element value

Suppose that we have a list like: val list = List((1,'o'), (3,'t'), (10, 't'), (7, 's')).
Then I want to find a pair whose first element is 10, ignoring what the second element is.
How can I find the pair or the index of the pair?
I tried list.indexOf((10,_)), list.indexOf((10,???)) and so on. However,
as you know, these tries are wrong.
Any suggestions are welcome :)
Use indexWhere to find the index:
list.indexWhere(_._1 == 10)
If you want the pair you can use find:
list.find(_._1 == 10)
Note that find returns an option because it may not find any element. If you want to return a default value you can use getOrElse, otherwise you need to handle the not found case:
list.find(_._1 == 10).getOrElse(/* default value */)

reduceByKey results in Infinity value

I am trying to add values for same key.
val final= d1.join(d2).flatMap(line => Seq(line.swap._1)).reduceByKey((x, y) =>(x+y))
d1 and d2 are data streams. After flatMap I get Key value pair.
However, it is resulting in Infinity value in this line reduceByKey((x, y) =>(x+y))
for example, if the pairs are (k1,1.0) (k1,1.0) the line reduceByKey((x, y) =>(x+y)) results in (k1,Infinity)
Any suggestion?
The above code snippet is working. As #maasg righty hinted the problem was elsewhere. The error was caused by division by zero in previous code which I didnt post here. Thanks!

Give value, return field name in matlab structure

I have a Matlab structure like this:
Columns.T21=6;
Columns.ws21=9;
Columns.wd21=10;
Columns.u21=11;
Is there some elegant way I can give the value and return the field name? For instance, if I give 6 and it would return 'T21.' I know that fieldnames() will return all the field names, but I want the fieldname for a specific value. Many thanks!
Assuming that the structure contains fields with scalar numeric values, you can use this struct2array based approach -
search_num = 6; %// Edit this for a different search number
fns=fieldnames(Columns) %// Get field names
out = fns(struct2array(Columns)==search_num) %// Logically index into names to find
%// the one that matches our search
Goal:
Construct two vectors from your struct, one for the names of fields and the other for their respective values. This has analogy to the dict in Python or map in C++, where you have unique keys being mapped to possibly non-unique values.
Simple Solution:
You can do this very simply using the various functions defined for struct in Matlab, namely: struc2cell() and cell2mat()
For the particular element of interest, say 1 of your struct Columns, get the names of all fields in the form of a cell array, using fieldnames() function:
fields = fieldnames( Columns(1) )
Similarly, get the values of all the fields of that element of Columns, in the form of a matrix
vals = cell2mat( struct2cell( Columns(1) ) )
Next, find the field with the corresponding value, say 6 here, using the find function and convert the resulting 1x1 cell into a char using cell2mat() function :
cell2mat( fields( find( vals == 6 ) ) )
which will yield:
T21
Now, you can define a function that does this for you, e.g.:
function fieldname = getFieldForValue( myStruct, value)
Advanced Solution using Map Container Data Abstraction:
You can also choose to define an object of the containers.map class using the field-names of your struct as the keySet and values as valueSet.
myMap = containers.Map( fieldnames( Columns(1) ), struct2cell( Columns(1) ) );
This allows you to get keys and values using corresponding built-in functions:
myMapKeys = keys(myMap);
myMapValues = values(myMap);
Now, you can find all the keys corresponding to a particular value, say 6 in this case:
cell2mat( myMapKeys( find( myMapValues == 6) )' )
which again yields:
T21
Caution: This method, or for that matter all methods for doing so, will only work if all the fields have the values of the same type, because the matrix to which we are converting vals to, need to have a uniform type for all its elements. But I assume from your example that this would always be the case.
Customized function/ logic:
struct consists of elements that contain fields which have values, all in that order. An element is thus a key for which field is a value. The essence of "lookup" is to find values (which are non-unique) for specific keys (which are unique). Thus, Matlab has a built-in way of doing so. But what you want is the other way around, i.e. to find keys for specific values. Since its not a typical use case, you need to write up your own logic or function for it.
Suppose your structure is called S. First extract all the field names into an array:
fNames=fieldnames(S);
Now define a following anonymous function in your code:
myfun=#(yourArray,desiredValue) yourArray==desiredValue;
Then you can get the desired field name as:
desiredFieldIndex=myfun(structfun(#(x) x,S),3) %desired value is 3 (say)
desiredFieldName=fNames(desiredFieldIndex)
Alternative using containers.Map
Assuming each field in the structure contains one scalar value as in the question (not an array).
Aim is to create a Map object with the field values as keys and the field names as values
myMap = containers.Map(struct2cell(Columns),fieldnames(Columns))
Now to get the fieldname for a value index into myMap with the value
myMap(6)
ans =
T21
This has the advantage that if the structure doesn't change you can repeatedly use myMap to find other value-field name pairs