Issue when trying to select an option from a list for scraping - Python - select

I am trying to scrape the table contained in the following page: https://predictioncenter.org/casp14/results.cgi?view=tables&target=T1024&model=1&groups_id=
At the top of the table, I want to change model "1" by "- All -".
I was writing the following lines of code:
link = f"https://predictioncenter.org/casp14/results.cgi?view=tables&target=T1024&model=- All -&groups_id="
browser.get(link)
but this isn't working.
When I replace model=- All - by model=1 the code works, so I suspect there is something going on with my - All - option, but I can't figure out what.
Full code below with the loop through all Targets and Model options (the version above was simplified):
from bs4 import BeautifulSoup,NavigableString, Tag
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import csv
import os
import numpy as np
os.chdir('THE DIRECTORY WHERE YOUR CHROMEDRIVER IS')
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
browser = webdriver.Chrome(executable_path='THE DIRECTORY WHERE YOUR CHROMEDRIVER IS/chromedriver')
browser.get("https://predictioncenter.org/") #open page in browser
df = pd.DataFrame()
x = browser.find_elements(By.XPATH, "//a[contains(#id, 'ygtvlabelel6')]")[0].click()
x = browser.find_elements(By.XPATH, "//a[contains(#href, 'results.cgi')]")[0].click()
x = browser.find_elements(By.XPATH, "//a[contains(#id, 'a_T1024')]")[0].click()
content = browser.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
#Get all possible options
options = soup.find("select",{"name":"target"}).findAll("option")
list_prot = []
for i in options:
name = i.text
list_prot.append(name)
type_model = soup.find("select",{"name":"model"}).findAll("option")
model_t=[]
for i in type_model:
name = i.text
model_t.append(name)
mod=model_t[0]
i=0
final=pd.DataFrame()
for target in list_prot:
print(i)
link = f"https://predictioncenter.org/casp14/results.cgi?view=tables&target={target}&model={mod}&groups_id="
browser.get(link)

Here is one way of getting that table containing -All- results:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
headers = {
'Origin': 'https://predictioncenter.org',
'Referer': 'https://predictioncenter.org/casp14/results.cgi',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
}
payload = {
'target': 'T1024',
'groups_id': '',
'model': '',
'submit': 'Submit',
'order': '',
'field': '',
'view': 'results',
'lga_4_view': 'brief',
'lga_5_view': 'brief',
'dsc_view': 'brief',
'dali_view': 'brief',
'molprb_view': 'brief'
}
url = 'https://predictioncenter.org/casp14/results.cgi'
r = requests.post(url, headers=headers, data=payload)
results_table = bs(r.text, 'html.parser').select_one('table[class="table_results"]')
df = pd.read_html(str(results_table))[0]
print(df)
Result in terminal:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
0 General General General General General LGA Sequence Dependent (4Å) Full LGA Sequence Dependent (4Å) Full LGA Sequence Dependent (4Å) Full LGA Sequence Independent (4Å) Full LGA Sequence Independent (4Å) Full MAMMOTH Dali Full Molprobity Full lDDT SphGr CAD CAD RPF QCS QCS SOV CE CoDM DFM Handed. TM TM FlexE ASE
1 # Model GR# GR Name Charts GDT_TS NP_P Z-M1-GDT AL0_P AL4_P Z-score Z-Score MP-Score Global score SG AA SS RPF QCS contS SOV CE CoDM DFM Handed. TMscore TMalign FlexE ASE
2 1. T1024TS427_3 427 AlphaFold2 A D I G 79.22 100.00 NaN 82.61 96.16 13.16 58.6 1.05 0.89 99.62 0.82 0.69 0.90 96.35 96.86 82.30 7.84 0.99 0.04 0.97 0.93 0.93 0.35 90.02
3 2. T1024TS427_5 427 AlphaFold2 A D I G 71.67 100.00 NaN 69.82 86.70 10.97 54.8 1.10 0.88 99.62 0.82 0.68 0.88 94.42 96.61 79.70 7.74 0.98 0.06 0.95 0.88 0.88 0.35 83.46
4 3. T1024TS226_5 226 s Zhang-TBM A D I G 65.73 100.00 NaN 71.10 85.42 11.57 41.8 1.94 0.64 84.27 0.65 0.39 0.71 82.88 81.55 83.80 7.64 0.92 0.33 0.92 0.87 0.87 3.40 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
509 508. T1024TS170_4 170 s BhageerathH-Plus A D I G 9.46 100.00 NaN 1.02 6.65 0.66 2.2 0.97 0.20 12.40 0.40 0.07 0.24 25.99 41.23 40.90 4.74 0.21 2.02 0.53 0.23 0.42 201.20 NaN
510 509. T1024TS063_5 063 s ACOMPMOD A D I G 9.02 100.00 NaN 0.26 0.51 -0.07 0 4.98 0.06 10.10 0.30 0.05 0.16 21.49 29.85 37.60 3.70 0.08 2.17 0.50 0.17 0.27 746.50 92.81
511 510. T1024TS305_1 305 s CAO-SERVER A D I G 8.76 100.00 -3.36 0.00 0.00 -0.80 0 4.74 0.11 18.67 0.41 0.07 0.23 26.36 48.22 45.30 3.50 0.09 2.32 0.48 0.22 0.30 151.85 18.11
512 511. T1024TS342_2 342 CUTSP A D I G 8.76 100.00 NaN 0.00 0.00 0.75 4.4 2.98 0.20 12.66 0.42 0.08 0.26 24.47 43.00 42.50 5.33 0.15 2.23 0.49 0.15 0.28 410.17 NaN
513 512. T1024TS217_5 217 CAO-QA1 A D I G 8.44 100.00 NaN 1.02 5.37 -1.15 0 NaN 0.08 36.92 0.42 0.00 0.10 23.51 48.36 56.20 3.70 0.21 2.00 0.50 0.20 0.32 167.41 17.24
514 rows × 29 columns
You may want to select another row for table headers - see relevant pandas documentation here

You can select '- all -' from the model dropdown using selenium as it requires to click on the dropdown, then select the desired value using select_by_index() method and it should work as expectation.
Full Working code:
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
URL = 'https://predictioncenter.org/casp14/results.cgi?view=tables&target=T1024&model=1&groups_id='
driver.get(URL)
driver.maximize_window()
time.sleep(5)
WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.table > tbody tr:first-child > td:nth-child(3)'))).click()
time.sleep(2)
dropdown=Select(driver.find_element(By.CSS_SELECTOR,'select#model'))
time.sleep(2)
dropdown.select_by_index(0)
time.sleep(2)
soup = BeautifulSoup(driver.page_source, "html.parser")
table= soup.select_one('.table_results')
df = pd.read_html(str(table))[0]
print(df)
driver.quit() # close browser
Output:
0 1 2 3 4 ... 24 25 26 27 28
0 General General General General General ... Handed. TM TM FlexE ASE
1 # Model GR# GR Name Charts ... Handed. TMscore TMalign FlexE ASE
2 # NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
3 NaN Model NaN NaN NaN ... NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
.. ... ... ... ... ... ... ... ... ... ... ...
592 508. T1024TS170_4 170 s BhageerathH-Plus A D I G ... 0.53 0.23 0.42 201.20 NaN
593 509. T1024TS063_5 063 s ACOMPMOD A D I G ... 0.50 0.17 0.27 746.50 92.81
594 510. T1024TS305_1 305 s CAO-SERVER A D I G ... 0.48 0.22 0.30 151.85 18.11
595 511. T1024TS342_2 342 CUTSP A D I G ... 0.49 0.15 0.28 410.17 NaN
596 512. T1024TS217_5 217 CAO-QA1 A D I G ... 0.50 0.20 0.32 167.41 17.24
[597 rows x 29 columns]

Related

vectorise foor loop with a variable that is incremented in each iteration

I am trying to optimise the running time of my code by getting rid of some for loops. However, I have a variable that is incremented in each iteration in which sometimes the index is repeated. I provide here a minimal example:
a = [1 4 2 2 1 3 4 2 3 1]
b = [0.5 0.2 0.3 0.4 0.1 0.05 0.7 0.3 0.55 0.8]
c = [3 5 7 9]
for i = 1:10
c(a(i)) = c(a(i)) + b(i)
end
Ideally, I would like to compute it by writting:
c(a) = c(a) + b
but obviously it would not give me the same results since I have to recalculate the value for the same index several times so this way to vectorise it would not work.
Also, I am working in Matlab or Octave in case that this is important.
Thank you very much for any help, I am not sure that it is possible to be vectorise.
Edit: thank you very much for your answers so far. I have discovered accumarray, which I did not know before and also understood why changing the for loop between Matlab and Octave was giving me such different times. I also understood my problem better. I gave a too simple example which I thought I could extend, however, what if b was a matrix?
(Let's forget about c at the moment):
a = [1 4 2 2 1 3 4 2 3 1]
b =[0.69 -0.41 -0.13 -0.13 -0.42 -0.14 -0.23 -0.17 0.22 -0.24;
0.34 -0.39 -0.36 0.68 -0.66 -0.19 -0.58 0.78 -0.23 0.25;
-0.68 -0.54 0.76 -0.58 0.24 -0.23 -0.44 0.09 0.69 -0.41;
0.11 -0.14 0.32 0.65 0.26 0.82 0.32 0.29 -0.21 -0.13;
-0.94 -0.15 -0.41 -0.56 0.15 0.09 0.38 0.58 0.72 0.45;
0.22 -0.59 -0.11 -0.17 0.52 0.13 -0.51 0.28 0.15 0.19;
0.18 -0.15 0.38 -0.29 -0.87 0.14 -0.13 0.23 -0.92 -0.21;
0.79 -0.35 0.45 -0.28 -0.13 0.95 -0.45 0.35 -0.25 -0.61;
-0.42 0.76 0.15 0.99 -0.84 -0.03 0.27 0.09 0.57 0.64;
0.59 0.82 -0.39 0.13 -0.15 -0.71 -0.84 -0.43 0.93 -0.74]
I understood now that what I would be doing is rowSum per group, and given that I am using Octave I cannot use "splitapply". I tried to generalise your answers, but accumarray would not work for matrices and also I could not generalise #rahnema1 solution. The desired output would be:
[0.34 0.26 -0.93 -0.56 -0.42 -0.76 -0.69 -0.02 1.87 -0.53;
0.22 -1.03 1.53 -0.21 0.37 1.54 -0.57 0.73 0.23 -1.15;
-0.20 0.17 0.04 0.82 -0.32 0.10 -0.24 0.37 0.72 0.83;
0.52 -0.54 0.02 0.39 -1.53 -0.05 -0.71 1.01 -1.15 0.04]
that is "equivalent" to
[sum(b([1 5 10],:))
sum(b([3 4 8],:))
sum(b([6 9],:))
sum(b([2 7],:))]
Thank you very much, If you think I should include this in another question instead of adding the edit I will do so.
Original question
It can be done with accumarray:
a = [1 4 2 2 1 3 4 2 3 1];
b = [0.5 0.2 0.3 0.4 0.1 0.05 0.7 0.3 0.55 0.8];
c = [3 5 7 9];
c(:) = c(:) + accumarray(a(:), b(:));
This sums the values from b in groups defined by a, and adds that to the original c.
Edited question
If b is a matrix, you can use
full(sparse(repmat(a, 1, size(b,1)), repelem(1:size(b,2), size(b,1)), b))
or
accumarray([repmat(a, 1, size(b,1)).' repelem(1:size(b,2), size(b,1)).'], b(:))
Matrix multiplication and implicit expansion and can be used (Octave):
nc = numel(c);
c += b * (1:nc == a.');
For input of large size it may be more memory efficient to use sparse matrix:
nc = numel(c);
nb = numel(b);
c += b * sparse(1:nb, a, 1, nb, nc);
Edit: When b is a matrix you can extend this solution as:
nc = numel(c);
na = numel(a);
out = sparse(a, 1:na, 1, nc, na) * b;

Matrix columns correlations, excluding self-correlation, Matlab

I have a couple of matrices (1800 x 27) that represent subjects and their recordings (3 minutes equivalent for each of 27 subjects). Each column represents a subject.
I need to do intercorrelation between subjects, let's say to correlate F to G, G to H, and H to F for all 27 subjects.
I use CORR command corr(B) where B is a matrix and it returns the next example:
1 0.07 -0.05 0.10 0.04 0.12
0.07 1 -0.02 -0.08 0.17 0.03
-0.05 -0.02 1 0.04 0.16 0.13
0.10 -0.08 0.04 1 -0.04 0.34
0.04 0.18 0.16 -0.04 1 0.13
How can I adjust the code to exclude self-correlation (eg F to F) so I won't get "1" numerals?
(it's present in each row/column)
I have to perform some transformations afterwards, like Fisher Z-Transformation, which returns inf for each "1", and as result, I can't use further calculations.

Visualize matrix with colormap in grid

I have a matrix that looks like this:
0.06 -0.22 -0.10 0.68 NaN -0.33
0.04 -0.07 0.12 0.23 NaN -0.47
NaN NaN NaN NaN NaN 0.28
0.37 0.36 0.14 0.58 -0.14 -0.15
NaN 0.11 0.24 0.71 -0.13 NaN
0.57 0.53 0.41 0.65 -0.43 0.03
I want to color in each value based on a colormap. In Python, I know I can use imshow to assign a color to each box. How can I do it in MATLAB?
You could use imshow as well, but every pixel would have the size of a pixel of your screen. So you may rather use imagesc.
A = [...
0.06 -0.22 -0.10 0.68 NaN -0.33;
0.04 -0.07 0.12 0.23 NaN -0.47;
NaN NaN NaN NaN NaN 0.28;
0.37 0.36 0.14 0.58 -0.14 -0.15;
NaN 0.11 0.24 0.71 -0.13 NaN;
0.57 0.53 0.41 0.65 -0.43 0.03 ]
imagesc(A)
And then you can apply any colormap you want or create your own one.
colormap(jet)
colorbar
If you don't like how imagesc handles your NaNs consider using pcolor
pcolor(A)
colormap(jet)
colorbar
with shading flat you can get rid of the grid lines.

Selecting elements from a matrix in matlab

I have a matrix which is 36 x 2, but I want to seperate the elements to give me 18, 2 x 2 matrices from top to bottom. E.g. if I have a matrix:
1 2
3 4
5 6
7 8
9 10
11 12
13 14
... ...
I want to split it into seperate matrices:
M1 = 1 2
3 4
M2 = 5 6
7 8
M3 = 9 10
11 12
..etc.
maybe the following sample code could useful:
a=rand(36,2);
b=reshape(a,2,2,18)
then with the 3rd index of b you can access your 18 2x2 matrices, eg. b(:,:,2) gives the second 2x2 matrix.
I think that the direct answer to your question is:
sampledata = [...
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18; ...
0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 1.19 1.20 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29 1.30 1.31 1.32 1.33 1.34 1.35 1.36];
for ix = 1:(size(sampledata,2)/2)
assignin('base',['M' sprintf('%02d',ix)], sampledata(:,ix*2+[-1 0]))
end
This creates 18 variables, named 'M01' through 'M18', with pieces of the sampledata matrix.
However, please don't use dynamic variable names like this. It will complicate every other piece of code that it touches. Use a cell array, a 3D array (as suggested by #Johannes_Endres +1 BTW), or structure. Anything that removes the need for you to write something like this later on:
%PLEASE DO NOT USE THIS
%ALSO DO NOT BACK YOURSELF INTO A CORNER WHERE YOU HAVE TO DO IT IN THE FUTURE
varNames = who('M*');
for ix = 1:length(varNames )
str = ['result(' num2str(ix) ') = some_function(' varNames {ix} ');'];
eval(str);
end
I've seen code like this, and it is slow and extremely cumbersome to maintain, not to mention the headache and pain to your internal beauty-meter.
x = reshape(1:36*2,[2 36])'
k = 1
for i = 1:2:35
eval(sprintf('M%d = x(%d:%d,:);',k,i,i+1));
k = k+1;
end

matlab: sorting and random

I need to sort out few small matrices from 1 huge raw matrix ...according to sorting 1st column (1st column contain either 1, 2, or 3)...
if 1st column is 1, then randomly 75% of the 1 save in file A1, 25% of the 1 save in file A2.
if 1st column is 2, then randomly 75% of the 2 save in file B1, 25% of the 2 save in file B2.
if 1st column is 3, then randomly 75% of the 3 save in file C1, 25% of the 3 save in file C2.
how am i going to write the code?
Example:
a raw matrix has 15 rows x 6 columns:
7 rows are 1 in 1st column, 5 rows are 2 in 1st column, and 3 rows are 3 in 1st column.
1 -0.05 -0.01 0.03 0.07 0.11
1 -0.4 -0.36 -0.32 -0.28 -0.24
1 0.3 0.34 0.38 0.42 0.46
1 0.75 0.79 0.83 0.87 0.91
1 0.45 0.49 0.53 0.57 0.61
1 0.8 0.84 0.88 0.92 0.96
1 0.05 0.09 0.13 0.17 0.21
2 0.5 0.54 0.58 0.62 0.66
2 0.4 0.44 0.48 0.52 0.56
2 0.9 0.94 0.98 1.02 1.06
2 0.85 0.89 0.93 0.97 1.01
2 0.75 0.79 0.83 0.87 0.91
3 0.36 0.4 0.44 0.48 0.52
3 0.6 0.64 0.68 0.72 0.76
3 0.4 0.44 0.48 0.52 0.56
7 rows got 1 in 1st column, randomly take out 75% of 7 rows (which is 7*0.75=5.25) to be new matrix (5rows x 6 columns), the rest of 25% become another new matrix
5 rows got 2 in 1st column, randomly take out 75% of 5 rows (which is 5*0.75=3.75) to be new matrix (4rows x 6 columns), the rest of 25% become another new matrix
3 rows got 3 in 1st column, randomly take out 75% of 3 rows (which is 3*0.75=2.25) to be new matrix (2rows x 6 columns), the rest of 25% become another new matrix
Result:
A1=
1 -0.4 -0.36 -0.32 -0.28 -0.24
1 0.3 0.34 0.38 0.42 0.46
1 0.75 0.79 0.83 0.87 0.91
1 0.8 0.84 0.88 0.92 0.96
1 -0.05 -0.01 0.03 0.07 0.11
B1=
2 0.9 0.94 0.98 1.02 1.06
2 0.85 0.89 0.93 0.97 1.01
2 0.5 0.54 0.58 0.62 0.66
2 0.75 0.79 0.83 0.87 0.91
C1=
3 0.36 0.4 0.44 0.48 0.52
3 0.4 0.44 0.48 0.52 0.56
here is one possible solution to your problem using the function randperm:
% Create matrices
firstcol=ones(15,1);
firstcol(8:12)=2;
firstcol(13:15)=3;
mat=[firstcol rand(15,5)];
% Sort according to first column
A=mat(mat(:,1)==1,:);
B=mat(mat(:,1)==2,:);
C=mat(mat(:,1)==3,:);
% Randomly rearrange lines
A=A(randperm(size(A,1)),:);
B=B(randperm(size(B,1)),:);
C=C(randperm(size(C,1)),:);
% Select first 75% lines (rounding)
A1=A(1:round(0.75*size(A,1)),:);
A2=A(round(0.75*size(A,1))+1:end,:);
B1=B(1:round(0.75*size(B,1)),:);
B1=B(round(0.75*size(B,1))+1:end,:);
C1=C(1:round(0.75*size(C,1)),:);
C1=C(round(0.75*size(C,1))+1:end,:);
Hope it helps.