Combining conditional classifier probabilities - classification

I have several document classifiers trying to predict the correct document type for a document. For a given file, each classifier outputs a list of the probabilities of each document type. I'm trying to combine the predictions of these different classifiers into a single probability list.
As I want to be able to manually decide how trusted / impactful every classifier will be, I started using a weighted average to combine the predictions.
If we look at an example with three classifiers (One, Two, Three) and three document types (T1, T2, T3). With the weighted average method, I would compute
Presult(T1) = (1 * Pone(T1) + 2 * Ptwo(T1) + 2 * Pthree(T1)) / (1 + 2 + 2) = 0.5
╔════════════╦═════════════╦═════════╦═════════╦═════════╗
║ Classifier ║ Coefficient ║ P(T1) ║ P(T2) ║ P(T3) ║
╠════════════╬═════════════╬═════════╬═════════╬═════════╣
║ One ║ 1 ║ 0.7 ║ 0.1 ║ 0.2 ║
╠════════════╬═════════════╬═════════╬═════════╬═════════╣
║ Two ║ 2 ║ 0.8 ║ 0.0 ║ 0.2 ║
╠════════════╬═════════════╬═════════╬═════════╬═════════╣
║ Three ║ 2 ║ 0.1 ║ 0.2 ║ 0.7 ║
╚════════════╩═════════════╩═════════╩═════════╩═════════╝
╔════════════╦═════════════╦═════════╦═════════╦═════════╗
║ Results ║ / ║ 0.5 ║ 0.1 ║ 0.4 ║
╚════════════╩═════════════╩═════════╩═════════╩═════════╝
This simple approach seems to work, but things get more complicated.
Actually, some of my classifiers are specialized: they do no apply to the entire input domain (the full list of document types), but only to a subdomain. For example, I might have a classifier that, given an IRS form, can determine the respective probabilities of the document being a W-2, a W-3 or a 1040 form. In that case, the output probabilities are conditional probabilities.
Let's say that classifiers Two and Three are specialized classifiers. Classifier Two only apply to types T1 and T2, and classifier Three only applies to type T2 and T3. My new table could look something like this:
╔════════════╦═════════════╦═════════╦═════════╦═════════╗
║ Classifier ║ Coefficient ║ P(T1) ║ P(T2) ║ P(T3) ║
╠════════════╬═════════════╬═════════╬═════════╬═════════╣
║ One ║ 1 ║ 0.5 ║ 0.3 ║ 0.2 ║
╠════════════╬═════════════╬═════════╬═════════╬═════════╣
║ Two ║ 2 ║ 0.2 ║ 0.8 ║ N/A ║
╠════════════╬═════════════╬═════════╬═════════╬═════════╣
║ Three ║ 2 ║ N/A ║ 0.4 ║ 0.6 ║
╚════════════╩═════════════╩═════════╩═════════╩═════════╝
╔════════════╦═════════════╦═════════╦═════════╦═════════╗
║ Results ║ / ║ ? ║ ? ║ ? ║
╚════════════╩═════════════╩═════════╩═════════╩═════════╝
However, in that case, it does not really makes sense to use a weighted average to compute the final probabilities, because we would be adding probabilities on different domains. I've tried drafting several approaches but could not find anything adapted. Do you have any idea or pointers to existing methods to combine these different predictions into one?
Thank you for reading ;)
PS: I'm sorry for the lack of mathematical formalism of the question, but wasn't sure how to write it correctly.

Related

Querying last row of sorted column where value is less than specific amount from parquet file

I have a large parquet file where the data in one of the columns is sorted. A very simplified example is below.
X Y
0 1 Red
1 5 Blue
2 8 Green
3 12 Purple
4 15 Blue
5 17 Purple
I am interested in querying the last value of column Y given that X is less than some amount in the most efficient way possible using python.
I am guaranteed that column X is sorted in ascending order.
As an example, given that X is less than 11, I would expect a Y value of "Green".
I have tried the following:
columns='Y'
filters=[('X','<',11]
pd.read_parquet('my_data.parquet',filters=filters,columns=columns).tail(1)
The code above "works" but I am hoping optimizations are possible as this query is run 1M+ times per day.
The parquet file is too large to be read into memory.
I cannot put a starting value for column "X", as there is no guarantee of the size of the gap between values of X. For example, if I were to require "X > 10 and X < 11" I would not get a value for Y returned.
I was hoping given the fact the data is sorted there is a way to optimize this.
I am open to using DuckDB or some other library to do this.
I think that's what .search_sorted() is for.
You can also use .scan_parquet() to lazy load the data instead of .read_parquet()
You may need to use when/then to handle the case of the first row being a match - and using index 0 instead of row - 1 - or the case of there being no match (if that's possible.)
(pl.scan_parquet("search.parquet")
.select(
pl.col("Y")
.take(pl.col("X").search_sorted(11, side="left") - 1)
).collect()
shape: (1, 1)
┌───────┐
│ Y │
│ --- │
│ str │
╞═══════╡
│ Green │
└───────┘

tm to tidytext conversion

I am trying to learn tidytext. I can follow the examples on tidytext website so long as I use the packages (janeaustenr, eg). However, most of my data are text files in a corpus. I can reproduce the tm to tidytext conversion example for sentiment analysis (ap_sentiments) on the tidytext website. I am having trouble, however, understanding how the tidytext data are structured. For example, the austen novels are stored by "book" in the austenr package. For my tm data, however, what is the equivalent for calling the vector for book? Here is the specific example for my data:
'cname <- file.path(".", "greencomments" , "all")
I can then use tidytext successfully after running the tm preprocessing:
practice <- tidy(tdm)
practice
partysentiments <- practice %>%
inner_join(get_sentiments("bing"), by = c(term = "word"))
partysentiments
# A tibble: 170 x 4
term document count sentiment
<chr> <chr> <dbl> <chr>
1 benefit 1 1.00 positive
2 best 1 2.00 positive
3 better 1 7.00 positive
4 cheaper 1 1.00 positive
5 clean 1 24.0 positive
7 clear 1 1.00 positive
8 concern 1 2.00 negative
9 cure 1 1.00 positive
10 destroy 1 3.00 negative
But, I can't reproduce the simple ggplots of word frequencies in tidytext. Since my data/corpus are not arranged with a column for "book" in the dataframe, the code (and therefore much of the tidytext functionality) won't work.
Here is an example of the issue. This works fine:
practice %>%
count(term, sort = TRUE)
# A tibble: 989 x 2
term n
<chr> <int>
1 activ 3
2 air 3
3 altern 3
but, what how to I arrange the tm corpus to match the structure of the books in the austenr package? Is "document" the equivalent of "book"? I have text files in folders for the corpus. I have tried replacing this in the code, and it doesn't work. Maybe I need to rename this? Apologies in advance - I am not a programmer.

remove a lesser duplicate

In KDB, I have the following table:
q)tab:flip `items`sales`prices!(`nut`bolt`cam`cog`bolt`screw;6 8 0 3 0n 0n;10 20 15 20 0n 0n)
q)tab
items sales prices
------------------
nut 6 10
bolt 8 20
cam 0 15
cog 3 20
bolt
screw
In this table, there are 2 duplicate items (bolt). However since the first 'bolt' contains more information. I would like to remove the 'lesser' bolt.
FINAL RESULT:
items sales prices
------------------
nut 6 10
bolt 8 20
cam 0 15
cog 3 20
screw
As far as I understand, If I used the 'distinct' function its not deterministic?
One way to do it is to fill forward by item, then bolt will inherit the previous values.
q)update fills sales,fills prices by items from tab
items sales prices
------------------
nut 6 10
bolt 8 20
cam 0 15
cog 3 20
bolt 8 20
screw
This can also be done in functional form where you can pass the table and by columns:
{![x;();(!). 2#enlist(),y;{x!fills,/:x}cols[x]except y]}[tab;`items]
If "more information" means "least nulls" then you could count the number of nulls in each row and only return those rows by item that contain the fewest:
q)select from #[tab;`n;:;sum each null tab] where n=(min;n)fby items
items sales prices n
--------------------
nut 6 10 0
bolt 8 20 0
cam 0 15 0
cog 3 20 0
screw 2
Although would not recommend this approach as it requires working with rows rather than columns.
Because those two rows contain different data, they are considered distinct.
It depends on how you define "more information". You would probably need to provide more examples, but some possibilities:
Delete rows with null sales value
q)delete from tab where null sales
items sales prices
------------------
nut 6 10
bolt 8 20
cam 0 15
cog 3 20
Retrieve rows with max sales value for each item
q)select from tab where (sales*prices) = (max;sales*prices) fby items
items sales prices
------------------
nut 6 10
bolt 8 20
cam 0 15
cog 3 20

How do I identify the correct clustering algorithm for the available data?

I have the sample data of flight routes, number of searches for that route, gross profit for the route, number of transactions for the route. I want to bucket flight routes which shows similar characteristics based on above mentioned variables. What are the steps to fix on the particular clustering algorithm?
Below is sample data which I would like to cluster.
Route Clicks Impressions CPC Share of Voice Gross-Profit Number of Transactions Conversions
AAE-ALG 2 25 0.22 $4.00 2 1
AAE-CGK 5 40 0.21 $6.00 1 1
AAE-FCO 1 25 0.25 $13.00 4 1
AAE-IST 8 58 0.30 $18.00 3 2
AAE-MOW 22 100 0.11 $1.00 6 5
AAE-ORN 11 70 0.21 $22.00 3 2
AAE-ORY 8 40 0.18 $3.00 4 4
For me it seems an N dimension clustering problem where N is the number of features, N = 7 (Route, Clicks, Impressions, CPC, Share of Voice, Gross-Profit, Number of Transactions, Conversions).
I think if you preprocess the feature values to be able to interpret distance on them you can apply K-means for clustering your data.
E.g. Route can be represented by the distance* of the airports: dA than you can find diff between 2 distances* that will be the distance between them: d = ABS(dA - dA')
Don't forget to scale your features.

printing FFT at multiple points in tabulated form (matrix form)

I've simulated the voltage at multiple points in time domain using C++. The output is printed out in tabulated form (i.e. time in first column and voltage at each point in the following columns).
I'm new to Matlab, but using it to fft the previous file in which i need to get a table of frequency in first column followed by frequency domain values for each point (in the same file).
I've tried all means to produce such table but it's always displayed in one column matrix (i.e. all data are set in one column matrix).
I need it to be in the form:
╔═══════════════════════════════════════════════╗
║ *f V1(f) V2(f) ..... Vn(f)* ║
╠═══════════════════════════════════════════════╣
║ ║
║ f1 .. .. .. ║
║ ║
║ f2 .. .. .. ║
║ ║
║ f3 .. .. .. ║
╚═══════════════════════════════════════════════╝
Also, if i'm able to create such matrix how can i get its transpose (to fft it once more w.r.t to space) ?
The code is as follows:
itr=importdata('filename.itr');
L=length(itr);
T=itr(L,1);
dt=itr(2,1);
t=(0:dt:T-dt);
fs=1/dt;
FR_length=L;
[M,N]=size(itr(1:end,1:end));
f=-FR_length/2:FR_length/2-1;
f=f3.*(fs/FR_length);
for n=2:N
FR=fft(itr(:,n),FR_length);
end
can anyone help me with this?
Many Thanks :)
I changed some of your code based on what I assume is the structure of input itr:
[L,N]=size(itr);
dt=itr(2,1);
fs=1/dt;
f=-L/2:L/2-1;
f=f*(fs/L);
FR =fftshift(fft(itr(:,2:end)));
disp([f' FR])
The last line will display your data as a table on the command window.
Note that I removed the loop since matlab allows vectorized notation. Also added fftshift to get your frequencies and amplitudes to line up correctly.
You can use save or fprintf to write to file, for instance as follows (you'll need to change the format string to match the number of data columns):
fid=fopen('test.dat','w')
fprintf(fid,'%f %f %f \n',[f ; real(FR)' ; imag(FR)'])
fclose(fid)