Related
I have a single trained classifier tested on 2 related multiclass classification tasks. As each trial of the classification tasks are related, the 2 sets of predictions constitute paired data. I would like to run a paired permutation test to find out if the difference in classification accuracy between the 2 prediction sets is significant.
So my data consists of 2 lists of predicted classes, where each prediction is related to the prediction in the other test set at the same index.
Example:
actual_classes = [1, 3, 6, 1, 22, 1, 11, 12, 9, 2]
predictions1 = [1, 3, 6, 1, 22, 1, 11, 12, 9 10] # 90% acc.
predictions2 = [1, 3, 7, 10, 22, 1, 7, 12, 2, 10] # 50% acc.
H0: There is no significant difference in classification accuracy.
How do I go about running a paired permutation test to test significance of the difference in classification accuracy?
I have been thinking about this and I'm going to post a proposed solution and see if someone approves or explains why I'm wrong.
actual_classes = [1, 3, 6, 1, 22, 1, 11, 12, 9, 2]
predictions1 = [1, 3, 6, 1, 22, 1, 11, 12, 9 10] # 90% acc.
predictions2 = [1, 3, 7, 10, 22, 1, 7, 12, 2, 10] # 50% acc.
paired_predictions = [[1,1], [3,3], [6,7], [1,10], [22,22], [1,1], [11,7], [12,12], [9,2], [10,10]]
actual_test_statistic = predictions1 - predictions2 # 90%-50%=40 # 0.9-0.5=0.4
all_simulations = [] # empty list
for number_of_iterations:
shuffle(paired_predictions) # only shuffle between pairs, not within
simulated_predictions1 = paired_predictions[first prediction of each pair]
simulated_predictions2 = paired_predictions[second prediction of each pair]
simulated_accuracy1 = proportion of times simulated_predictions1 equals actual_classes
simulated_accuracy2 = proportion of times simulated_predictions2 equals actual_classes
all_simulations.append(simulated_accuracy1 - simulated_accuracy2) # Put the simulated difference in the list
p = count(absolute(all_simulations) > absolute(actual_test_statistic ))/number_of_iterations
If you have any thoughts, let me know in the comments. Or better still, provide your own corrected version in your own answer. Thank you!
I need to generate all dates between two given dates.
My predicate date_between(DateLow, DateHigh, X) works correctly:
?- date_between(date(2020,2,15), date(2020,2,25), X).
X = date(2020, 2, 15) ;
X = date(2020, 2, 16) ;
....
X = date(2020, 2, 25) .
But I think predicate is too clumsy. Is there another approach to do the same but more elegant?
Should I translate back and forth Date to Seconds (Stamp) and Seconds to Date?
I have to compare dates through conversion in seconds?
You can see my code:
date_between(DateLow, DateHigh, DateLow) :-
datestd_stamp(DateLow, StampLow),
datestd_stamp(DateHigh, StampHigh),
StampLow =< StampHigh.
date_between(DateLow, DateHigh, X) :-
datestd_stamp(DateLow, StampLow),
datestd_stamp(DateHigh, StampHigh),
StampLow < StampHigh,
DateLow = date(Y,M,D),
Dnxt is D + 1,
date_time_stamp(date(Y,M,Dnxt,0,0,0,0,-,-), StampNext),
stamp_date_time(StampNext, Dat, 0),
date_time_value(date, Dat, DateNxt),
date_between(DateNxt, DateHigh, X).
datestd_stamp(Data, Stamp) :-
Data = date(Y,M,D),
date_time_stamp(date(Y,M,D,0,0,0,0,-,-), StampTmp),
round(StampTmp, Stamp).
I tried to improve the predicate. The execution time has definitely been reduced.
The predicate has become simpler and faster.
Old version:
?- time((bagof(X, (date_between(date(2020,1,1), date(2100,12,31), X)), Ls))).
% 680,466 inferences, 0.149 CPU in 0.149 seconds (100% CPU, 4563901 Lips)
Ls = [date(2020, 1, 1), date(2020, 1, 2), date(2020, 1, 3),
New version:
?- time((bagof(X, (date_between2(date(2020,1,1), date(2100,12,31), X)), Ls))).
% 207,106 inferences, 0.066 CPU in 0.066 seconds (100% CPU, 3157900 Lips)
Ls = [date(2020, 1, 1), date(2020, 1, 2), date(2020, 1, 3),
You can see new version of predicate:
date_between2(DateLow, DateHigh, DateLow) :-
DateLow #=< DateHigh.
date_between2(DateLow, DateHigh, X) :-
DateLow #< DateHigh,
DateLow = date(Y,M,D),
Dnxt is D + 1,
date_time_stamp(date(Y,M,Dnxt,0,0,0,0,-,-), StampNext),
stamp_date_time(StampNext, Dat, 0),
date_time_value(date, Dat, DateNxt),
date_between2(DateNxt, DateHigh, X).
I have a dataset that resembles the following:
site_id, species
1, spp1
2, spp1
2, spp2
2, spp3
3, spp2
3, spp3
4, spp1
4, spp2
I want to create a table like this:
site_id, spp1, spp2, spp3, spp4
1, 1, 0, 0, 0
2, 1, 1, 1, 0
3, 0, 1, 1, 0
4, 1, 1, 0, 0
This question was asked here, however the issue I face is that my list of species is significantly greater and so creating a massive query listing each species manually would take a significant amount of time. I would therefore like a solution that does not require this and could instead read from the existing species list.
In addition, when playing with that query, the count() function would keep adding so I would end up with values greater than 1 where multiples of the same species were present in a site_id. Ideally I want a binary 1 or 0 output.
I want to try the ReedSolomonDecoder from the ZXing library on the example given on page 10 of this paper
Basically, it encodes the message
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
using the generator polynomial
x^4 + 15x^3 + 3x^2 + x + 12
which results in
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 3, 3, 12, 12
I want to decode this in the following manner:
int[] data = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 3, 3, 12, 12};
GenericGF field = new GenericGF(?, 16, 1); // what integer should I use for primitive here?
ReedSolomonDecoder decoder = new ReedSolomonDecoder(field);
decoder.decode(data, 4);
I don't know how to create a GenericGF object from the given generator polynomial. I know that it expects a binary integer representation of the polynomial, but to do that, I would need the polynomial to be in an irreducible form, i.e. all the coefficients to be either 0 or 1. How can I achieve that from this given generator polynomial?
I'm pretty new to this as well but I think you would want to use
public static GenericGF AZTEC_PARAM = new GenericGF(0x13, 16, 1);
I've got a list of some integers, e.g. [1, 2, 3, 4, 5, 10]
And I've another integer (N). For example, N = 19.
I want to check if my integer can be represented as a sum of any amount of numbers in my list:
19 = 10 + 5 + 4
or
19 = 10 + 4 + 3 + 2
Every number from the list can be used only once. N can raise up to 2 thousand or more. Size of the list can reach 200 integers.
Is there a good way to solve this problem?
4 years and a half later, this question is answered by Jonathan.
I want to post two implementations (bruteforce and Jonathan's) in Python and their performance comparison.
def check_sum_bruteforce(numbers, n):
# This bruteforce approach can be improved (for some cases) by
# returning True as soon as the needed sum is found;
sums = []
for number in numbers:
for sum_ in sums[:]:
sums.append(sum_ + number)
sums.append(number)
return n in sums
def check_sum_optimized(numbers, n):
sums1, sums2 = [], []
numbers1 = numbers[:len(numbers) // 2]
numbers2 = numbers[len(numbers) // 2:]
for sums, numbers_ in ((sums1, numbers1), (sums2, numbers2)):
for number in numbers_:
for sum_ in sums[:]:
sums.append(sum_ + number)
sums.append(number)
for sum_ in sums1:
if n - sum_ in sums2:
return True
return False
assert check_sum_bruteforce([1, 2, 3, 4, 5, 10], 19)
assert check_sum_optimized([1, 2, 3, 4, 5, 10], 19)
import timeit
print(
"Bruteforce approach (10000 times):",
timeit.timeit(
'check_sum_bruteforce([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 200)',
number=10000,
globals=globals()
)
)
print(
"Optimized approach by Jonathan (10000 times):",
timeit.timeit(
'check_sum_optimized([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 200)',
number=10000,
globals=globals()
)
)
Output (the float numbers are seconds):
Bruteforce approach (10000 times): 1.830944365834205
Optimized approach by Jonathan (10000 times): 0.34162875449254027
The brute force approach requires generating 2^(array_size)-1 subsets to be summed and compared against target N.
The run time can be dramatically improved by simply splitting the problem in two. Store, in sets, all of the possible sums for one half of the array and the other half separately. It can now be determined by checking for every number n in one set if the complementN-n exists in the other set.
This optimization brings the complexity down to approximately: 2^(array_size/2)-1+2^(array_size/2)-1=2^(array_size/2 + 1)-2
Half of the original.
Here is a c++ implementation using this idea.
#include <bits/stdc++.h>
using namespace std;
bool sum_search(vector<int> myarray, int N) {
//values for splitting the array in two
int right=myarray.size()-1,middle=(myarray.size()-1)/2;
set<int> all_possible_sums1,all_possible_sums2;
//iterate over the first half of the array
for(int i=0;i<middle;i++) {
//buffer set that will hold new possible sums
set<int> buffer_set;
//every value currently in the set is used to make new possible sums
for(set<int>::iterator set_iterator=all_possible_sums1.begin();set_iterator!=all_possible_sums1.end();set_iterator++)
buffer_set.insert(myarray[i]+*set_iterator);
all_possible_sums1.insert(myarray[i]);
//transfer buffer into the main set
for(set<int>::iterator set_iterator=buffer_set.begin();set_iterator!=buffer_set.end();set_iterator++)
all_possible_sums1.insert(*set_iterator);
}
//iterator over the second half of the array
for(int i=middle;i<right+1;i++) {
set<int> buffer_set;
for(set<int>::iterator set_iterator=all_possible_sums2.begin();set_iterator!=all_possible_sums2.end();set_iterator++)
buffer_set.insert(myarray[i]+*set_iterator);
all_possible_sums2.insert(myarray[i]);
for(set<int>::iterator set_iterator=buffer_set.begin();set_iterator!=buffer_set.end();set_iterator++)
all_possible_sums2.insert(*set_iterator);
}
//for every element in the first set, check if the the second set has the complemenent to make N
for(set<int>::iterator set_iterator=all_possible_sums1.begin();set_iterator!=all_possible_sums1.end();set_iterator++)
if(all_possible_sums2.find(N-*set_iterator)!=all_possible_sums2.end())
return true;
return false;
}
Ugly and brute force approach:
a = [1, 2, 3, 4, 5, 10]
b = []
a.size.times do |c|
b << a.combination(c).select{|d| d.reduce(&:+) == 19 }
end
puts b.flatten(1).inspect