Summarise data using dplyr, missing column - select

I am a total newbie with R, so I will visit this site a lot for the coming months. To get familiar with R I am trying to get some (old) datasets in R. I already encounter my first problem.
I have a dataset (microarray data) with the following columns, probe_name, gene_name, systemic_name, log_fc, ave_expr and p_value. I have multiple read-outs for gene_name so I want to summarise this. However, I do need to keep my systemic_name. So far I created this
df_expression <- microarray_data |>
group_by(gene_name) |>
summarise(
mean_log_fc = mean(log_fc, na.rm = TRUE),
mean_ave_exp = mean(ave_expr, na.rm = TRUE),
mean_p_val = mean(p_value, na.rm = TRUE),
mean_adj_p_val = mean(adj_p_val, na.rm = TRUE)
)
This results in mean values for the columns, but I lost the systemic_name column. I tried several things, without succes. Should be simple, but I cannot figure this out
I tried to add the select function, but then the code stops at summarise()..

Related

gtsummary::tbl_regression() - Obtain Random Effects from GLMM Zero-Inflated Model

When trying to create a table with the conditional random effects in r using the gtsummary function tbl_regression from a glmmTMB mixed effects negative-binomial zero-inflated model, I get duplicate random effects rows.
Example (using Mollie Brooks' Zero-Inflated GLMMs on Salamanders Dataset):
data(Salamanders)
head(Salamanders)
library(glmmTMB)
zinbm2 = glmmTMB(count~spp + mined +(1|site), zi=~spp + mined + (1|site), Salamanders, family=nbinom2)
zinbm2_table_cond <- tbl_regression(
zinbm2,
tidy_fun = function(...) broom.mixed::tidy(..., component = "cond"),
exponentiate = TRUE,
estimate_fun = purrr::partial(style_ratio, digits = 3),
pvalue_fun = purrr::partial(style_sigfig, digits = 3))
zinbm2_table_cond
Output:
Random Effects Output (cond)
When extracting the random effects from de zero-inflated part of the model I get the same problem.
Example:
zinbm2_table_zi <- tbl_regression(
zinbm2,
tidy_fun = function(...) broom.mixed::tidy(..., component = "zi"),
exponentiate = TRUE,
estimate_fun = purrr::partial(style_ratio, digits = 3),
pvalue_fun = purrr::partial(style_sigfig, digits = 3))
zinbm2_table_zi
Output:
Random Effects Output (zi)
The problem persists if I specify the effects argument in broom.mixed.
tidy_fun = function(...) broom.mixed::tidy(..., effects = "ran_pars", component = "cond"),
Looking at confidence intervals in both outputs it seems that somehow it is extracting random effects from both parts of the model and changing the estimate of the zero-inflated random effects (in 1st image; opposite in the 2nd image) to match the conditional part estimate while keeping the CI.
I am not knowledgeable enough to understand why this is happening. Since both rows have the same label I am having difficulty removing the wrong one.
Any tips on how to avoid this problem or a workaround to remove the undesired rows?
If you need more info, let me know.
Thank you in advance.
PS: Output images were changed to link due to insufficient reputation.

Spark, Scala, Databricks, combine and add columns

Using Spark/Scala to attempt a "simple" query. I have a file which, after line 1 below runs, looks like this
EmpReg,EmpOT,RegPay,OTPay
Alice,Alice,400,20
Bob,Bob,300,0
Carol,Carol,450,120
Dan,Dan,400,200
Ellen,Ellen,360,40
The first and third columns (EmpReg, RegPay) come from one source and the second and third columns (EmpOT, OTPay) come from a second source. My objective is output that looks like this.
Emp,Pay
Alice,420
Bob,300
Carol,570
Dan,600
Ellen,400
Here is the code that I have been trying, at least what I have saved.
var q2 = q.join(q1, q("EmpReg") === q1("EmpOT"), "fullouter")
//q2 = q2.select("EmpReg", ($"RegPay" + $"OTPay"))
//q2 = q2.groupBy($"EmpReg".sum($"RegPay" + $"OTPay"))
var add = q2.select(($"RegPay" + $"OTPay"))
//q2 = q2.sum("RegPay", "OTPay")
//q2 = q2.groupBy("EmpReg", "EmpOT")
//var q2 = q.join(q1).where("EmpReg") === "EmpOT"))
//q2 = q2.select("EmpReg").sum("RegPay", "OTPay")
//q2.show
add.show
[q] is the first file which represents regular pay. [q1] is the second file which represents overtime pay. [q2] is the combination shown in the first example above. Primary keys are [EmpReg] and [EmpOT]. don't really need to combine [EmpReg] and [EmpOT] since they are the same, and it doesn't make any difference which I use.
I really need to add [RegPay] and [OTPay] to get [Pay], but for the life of me I can't get it to work. The lines commented out return various errors. I can add the two pay columns, and select an appropriate employee column, but can't seem to do it in one query. I am constrained to use Scala on Databricks. Othewise, I might do something like this.
select q.EmpReg as Emp, (q.RegPay + q1.OTPay) as Pay
from q join q1 on q.EmpReg = q1.EmpOT
(Why can't things ever be simple?)
You can use a similar approach as in your SQL query:
val q2 = q.join(q1, q("EmpReg") === q1("EmpOT"), "fullouter")
val add = q2.select(q("EmpReg").as("Emp"), (q("RegPay") + q1("OTPay")).as("Pay"))
Your code has this line
q2.select("EmpReg", ($"RegPay" + $"OTPay"))
which should work if you add $ before "EmpReg". You can't have both strings and columns in the select statement. This works in Python but not Scala.

How do i exclude some elements of a list from further calculations

So I have a list of stars and their respective distances. My assignment is to find which stars are in a certain distance (+- 10parsec). I want to exclude some of them from further calculations in the program. The thing is I don't want to remove them completely so remove, pop etc isn't helping me. I still want those stars on the list to be present in my output csv. I just want a line saying something like those stars which don't support the if statement, don't use them in this calculation. So i guess the output would be blank for those.
I suppose it is an if or for statement, to mark those bad stars as False and then down the line use calculation that excludes those faulty stars.
I'm a physics student and this is my first python program ever! Please be cool about my ignorance...
Edit: forgive me if i include useless stuff i don't really know what's important. I also use uncertainties library if its of any use
column_names = ['id','pi','s_pi','v_r' ,'s_v', 'dis', 'X',
'ra_h', 'ra_m', 'ra_s','dec_d', 'dec_m',
'dec_s', 'ma', 's_ma', 'md', 's_md']
data = pd.read_csv("hyades_data.dat", skiprows=2, sep='\s+',
names=column_names)
calculations with all
v_r = unumpy.uarray(data['v_r'], data['s_v'])
ma = unumpy.uarray(data['ma'], data['s_ma'])
md = unumpy.uarray(data['md'], data['s_md'])
mi = unumpy.sqrt(ma**2+md**2)
r_m = v_r*unumpy.tan(th)/(4.74*mi/1000)
diff = np.abs(r_pc - r_m)
'''
if np.abs(dist-46.43) <=10:
r_m=True
else r_m=False
at this point i want to make the distiction
'''
mean_diff = diff.mean()
print("Mean : ")
print(mean_diff)
print(a_ref,d_ref)
df_va=pd.DataFrame(v_r)
df_mi = pd.DataFrame(mi)
df_rm = pd.DataFrame(r_m)
df_rpc = pd.DataFrame(r_pc)
df_diff = pd.DataFrame(diff)
#df_mean_diff = pd.DataFrame(mean_diff)
ve = v_r*np.tan(th)
output = pd.concat([data['id'], ra, dec, th_d, df_mi, df_rm, df_rpc,
df_diff,df_va], axis=1)
output.columns = ['id','ra', 'dec', 'th_d','mi', 'r_m', 'r_pc',
'dist_diff','va']
output.to_csv('results.csv', index=False)

PySpark filtering gives inconsistent behavior

So I have a data set where I do some transformations and the last step is to filter out rows that have a 0 in a column called frequency. The code that does the filtering is super simple:
def filter_rows(self, name: str = None, frequency_col: str = 'frequency', threshold: int = 1):
df = getattr(self, name)
df = df.where(df[frequency_col] >= threshold)
setattr(self, name, df)
return self
The problem is a very strange behavior where if I put a rather high threshold like 10, it works fine, filtering out all the rows below 10. But if I make the threshold just 1, it does not remove the 0s! Here is an example of the former (threshold=10):
{"user":"XY1677KBTzDX7EXnf-XRAYW4ZB_vmiNvav7hL42BOhlcxZ8FQ","domain":"3a899ebbaa182778d87d","frequency":12}
{"user":"lhoAWb9U9SXqscEoQQo9JqtZo39nutq3NgrJjba38B10pDkI","domain":"3a899ebbaa182778d87d","frequency":9}
{"user":"aRXbwY0HcOoRT302M8PCnzOQx9bOhDG9Z_fSUq17qtLt6q6FI","domain":"33bd29288f507256d4b2","frequency":23}
{"user":"RhfrV_ngDpJex7LzEhtgmWk","domain":"390b4f317c40ac486d63","frequency":14}
{"user":"qZqqsNSNko1V9eYhJB3lPmPp0p5bKSq0","domain":"390b4f317c40ac486d63","frequency":11}
{"user":"gsmP6RG13azQRmQ-RxcN4MWGLxcx0Grs","domain":"f4765996305ccdfa9650","frequency":10}
{"user":"jpYTnYjVkZ0aVexb_L3ZqnM86W8fr082HwLliWWiqhnKY5A96zwWZKNxC","domain":"f4765996305ccdfa9650","frequency":15}
{"user":"Tlgyxk_rJF6uE8cLM2sArPRxiOOpnLwQo2s","domain":"f89838b928d5070c3bc3","frequency":17}
{"user":"qHu7fpnz2lrBGFltj98knzzbwWDfU","domain":"f89838b928d5070c3bc3","frequency":11}
{"user":"k0tU5QZjRkBwqkKvMIDWd565YYGHfg","domain":"f89838b928d5070c3bc3","frequency":17}
And now here is some of the data with threshold=1:
{"user":"KuhSEPFKACJdNyMBBD2i6ul0Nc_b72J4","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"EP1LomZ3qAMV3YtduC20","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"UxulBfshmCro-srE3Cs5znxO5tnVfc0_yFps","domain":"d69cb6f62b885fec9b7d","frequency":1}
{"user":"v2OX7UyvMVnWlDeDyYC8Opk-va_i8AwxZEsxbk","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"4hu1uE2ucAYZIrNLeOY2y9JMaArFZGRqjgKzlKenC5-GfxDJQQbLcXNSzj","domain":"68b588cedbc66945c442","frequency":0}
{"user":"5rFMWm_A-7N1E9T289iZ65TIR_JG_OnZpJ-g","domain":"68b588cedbc66945c442","frequency":1}
{"user":"RLqoxFMZ7Si3CTPN1AnI4hj6zpwMCJI","domain":"68b588cedbc66945c442","frequency":1}
{"user":"wolq9L0592MGRfV_M-FxJ5Wc8UUirjqjMdaMDrI","domain":"68b588cedbc66945c442","frequency":0}
{"user":"9spTLehI2w0fHcxyvaxIfo","domain":"68b588cedbc66945c442","frequency":1}
I should note that before this step I perform some other transformations, and I've noticed weird behaviors in Spark in the past sometimes doing very simple things like this after a join or a union can give very strange results where eventually the only solution is to write out the data and read it back in again and do the operation in a completely separate script. I hope there is a better solution than this!

Filtering Data From Scraped Tweets Using rtweet Package

meta_mueller <- search_tweets("mueller", n = 250000, retryonratelimit = TRUE)
Within the dataframe is a column "geo_coords". A majority upon visual scan are c(NA,NA).
I have dplyr installed (other packages are fine, too) and I want to identify any rows that do not equal c(NA,NA).
filter(!is.na(meta_mueller(geo_coords))
This did not work.
Solution:
meta_mueller_location = select(meta_mueller, place_full_name)
meta_mueller_location_filter = filter(meta_mueller_location,
place_full_name != "NA")
Instead of geo_coords I used the command on "place_full_name" column which was only NA not c(NA,NA). This was a better solution for my needs.