compute dispersion of categorical variable

compute dispersion of categorical variable - pyspark

I' working with pyspark and i have a column which is called "batches" which contains lists of strings.
batches:
[batch1, batch1, batch1, batch2, batch15, ....]
[batch10, batch1, batch12, batch2, batch1, ....]
...
...
I want to compute the "dispersion" of the data for each value in this column. I don't know i the correct term is dispersion so i'm going to explain my self with an example. I want a function that, given this list: [batch1, batch1, batch1, batch1, batch1, ... , batch2] which contain a huge number of batch1 and only one batch2, return a value which is close to zero, implying that basically only batch1 is going to be used. On the other hand, with a list which contains all different names, i want that the returned values is 1, because the dispersion here is higher. I found that entropy could be useful in my case but i'm not sure about it, i need a third person opinion.

Related

PySpark - Split the combined key

I have a RDD that is structured in this format:
(MAC_address, dst_ip_address, 1)
Here, 1 means the machine with the MAC_address has accessed the dst_ip_address once. I need to count how many times a specific machine with MAC_address has reached a specific dst_ip_address.
I created a rdd with a combined MAC_address and dst_ip_address as key, and applied reduceByKey to count the times.
def processJson(data):
return ((MAC_address, dst_ip_address), 1)
def countreducer(a,b):
return a+b
tt = df.map(processJson).reduceByKey(countreducer)
I am able to get a RDD ((MAC_address, dst_ip_address), 52)
I need to write the RDD into a Json format like this:
MAC_address_1:
[dst_ip_1: 52],
[dst_ip_2: 38]
MAC_address_2:
[dst_ip_1: 12]
My intuition is to split the combined key first but there is no function to flat a combined key. Thus, I wonder whether the above approach is on the right track.

scala return matrix of average pixels

Here's the thing: I want to modify (and then return) a matrix of integers that is given in the parameters of the function. The funcion average (of the class MatrixMotionBlur) gives the average between the own pixel, upper, down and left pixels. Follows the following formula:
result(x, y) = (M1(x, y)+M1(x-1, y)+M1(x, y-1)+M1(x, y+1)) / 4
This is the code i've implemented so far
MatrixMotionBlur - Average function
MotionBlurSingleThread - run
The objetive here is to apply "average" method to alter the matrix value and return that matrix. The thing is the program gives me error when I to insert the value on the matrix.
Any ideas how to do this ?

The functional way
val updatedData = data.map{ outter =>
outter(i).map{ inner =>
mx.average(i.j)
}
}
Pay attention that Seq is immutable collection type and you can't just modify it, you can create new, modified collection only.
By the way, why you iterate starting 1, but not 0. Are you sure you want it?

C++AMP:How does the iterator work

example:
parallel_for_each(mas.extend,[=](index<1> idx) restrict(amp)
{
...
}
How does the index<1> idx work?
supposing,in mas: a set of numbers at array_view

Perhaps don't think of the index as an iterator because that implies a sequential iteration through the container, but this is the parallel world so we're not starting at the beginning and going to the end.
index<1> idx
In your example, idx is a index object of rank (dimension) 1. So it indexes into a one-dimensional array_view container. i.e. mas also should be a one-dimensional.
array_view<const int, 1> mas;
Inside the parallel_for_each body, idx can be used if necessary to index into mas.
size_t i = idx[0];
int value = mas[i];
However in cases where the absolute position in the array_view is not important you don't need to make use of the index. Every item in mas will be processed.
If the array_view was for example a 2D matrix of values, then you would have an index of rank 2 rather, and then idx[0] would reference the rows and idx[1] would reference the columns for example.
I've found this free online book invaluable in understanding these concepts: Parallel Programming with Microsoft Visual C++

Selecting random values in a set in mathematica

I have a set which has {0} and other 8 elements, total 9 elements. I want to choose random 3 value in this set and create a 3x1 column matrix. This will repeat all possible choices in the set. How can I do?

As #Picket said in comment,
The way RandomSample works will ensure it will not output the same choice twice in a single call
If your list is small, you can generate all subsets and sample it.
Example
RandomSample[Subsets[{a, b, c, d, e, f}, {3}], 7]
will generate all (20) subsets with 3 (distinct) elements and then pick 7 different uniformly (there are options to weight each member differently, chose the random generator, etc.).
RandomSample[Flatten[Permutations /# Subsets[{a, b, c, d, e, f}, {3}], 1], 13]
will generate all (120) possible ordered selections of 3 distinct elements among a set of 6 elements and give a sample of 13 distinct elements of this list.
If what you want is a random ordering of all possible subsets of size 3, or of all ordered selections without duplicate of size 3 just ask the same way but with the exact number of such sets.
myset = { foo, foo2, foo3, foo5 };
RandomSample[Subsets[myset, {3}], Binomial[Length[myset],3 ]]
RandomSample[Flatten[Permutations /# Subsets[myset, {3}], 1], 3!*Binomial[Length[myset],3 ] ]
(if you ask more than the exact number of possibilities, RandomSample will complain)
Now if your initial set is large so that the set of subsets is impractical for generation time and memory, take advantage of representing set composition by numbers, even if it is not perfect in term of uniform distribution. Say that your initial set has 20 distinct elements. A three digit number in base 20 can represent any selection of 3. If you account for the need to filter out the few with one digit appearing more than once
20^3/(3!*Binomial[20, 3]) // N
1.16959
You are probably safe by generating 25% more numbers than what you need and filtering the ones with repetition:
Cases[IntegerDigits[RandomSample[0 ;; 20^3-1, Ceiling[31*(1 + 1/4)] ], 20, 3], _?(Length[Union[#]] == 3 &), 1, 31]
This generates a random sample of 39 distinct 3-digit numbers in base 20 and select the first 31 with no duplicates in the form of a list of 3-coordinates vectors.

Finding correlation in an enum type data

I have the following dataset containing information about countries
5,1,648,16,10,2,0,3,5,1,1,0,1,1,1,0,0,0,0,0,1,0,0,1,0,0,
3,1,29,3,6,6,0,0,3,1,0,0,1,0,1,0,0,0,0,0,1,0,0,0,1,0,
4,1,2388,20,8,2,2,0,3,1,1,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,
...
The sixth column in each row indicates the main religion of the country: 0 is catholic, 1 is other christian, 2 is muslim, etc. Some of the other data is about if different colors are present in the flag of the country symbols they contain, and so on.
The description of the data can be found here. I have removed the string data columns though so it doesn't fit exactly like the information shown.
My problem is that I want to use co-variance matrices and Pearson correlation to see if, for example, the fact that a flag has the color red in it will tell anything about if the religion of that country has a bigger chance of being something than something else. But since the religion is enumerated, I am a bit lost on how to progress with this problem.

Your problem is that, despite the fact that your data is ordered, this order is arbitrary. The "distance" between "muslim" (enum val=1) to "hindu" (enum val=3) is not 2.
The most straight-forward way of tackling this issue is to convert enum values to binary indicator vectors:
Suppose you have
enum {
Catholic = 0
Protestant,
Muslim,
Jewish,
Hindu,
...
NumOfRel };
You replace the single entry of enum val with a binary vector of length NumOfRel with zeros everywhere except for a single 1 at the appropriate place:
For a Protestant entry, you'll have the following binary vector:
[ 0 1 0 0 ... ]
For a Jewish:
[ 0 0 0 1 0 ... ]
And so on...
This way, the "distance" between different religions is always 1.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

compute dispersion of categorical variable - pyspark

Related

PySpark - Split the combined key

scala return matrix of average pixels

C++AMP:How does the iterator work

Selecting random values in a set in mathematica

Finding correlation in an enum type data

Categories

Resources