I have count matrices from the rhapsody platform that I turn into a singlecellobject using the function SingleCellExperiment.
I have multiple samples running over 2 batches that I'm merging using the scMerge function (without correction).
when merging samples from the same dataset it only merges identical genes that are present in the single (non-merged) datasets which makes me to drop from 25k to 10k unique genes.
Is there a way to circumvent this issue? Or do you think it would not affect downstream analysis since these genes will anyways be dropped after merging the two badges with Harmony?
the code I used for merging is the following
sce_list_batch1 <- list((S1), (S2), (S3), (S4), (S5), (S6))
sce_batch1<- sce_cbind(sce_list_batch1, method = "intersect", exprs = c("counts"), colData_names = TRUE)
so I noticed it adds a batch correction by default. I have now added the piece cut_off_batch = 0 and it now includes all the genes
Related
first time asking a question on here and really hoping to get some help. I don't believe this question is out there yet.
I have a dataframe of 7,000,000+ observations, each with 140 variables. I am trying filter the data down to a smaller cohort using a set of multiple criteria, any of which would allow for inclusion in the smaller, filtered dataset.
I have tried two methods to search my data:
The first strategy does filter_all() on all variables and searches for my criteria for inclusion
filteredData <- filter_all(rawData, any_vars(. %in% c(criteria1, criteria2, criteria3)))
The second strategy does a series of which() functions, also trying to identify every row that contains one of my criteria.
filteredData <- rawData[which(rawData$criteria1 == "criteria" | rawData$criteria2 == "criteria | rawData$criteria3 == "criteria"),]
These results will accurately pull one or two rows meeting this criteria, however I don't believe all 7,000,000 rows are being searched. I added a row label to my rawData set and saw that the function successfully pulled row #60,192. I am expecting hundreds of rows in the final search and am very confused why only a couple from early on in the dataframe get accurately identified.
My questions:
Do the filter_all() and which() functions have size limits that they stop searching after?
Does anyone have a suggestion on how to filter/search based on multiple criteria on a very large dataset?
Thank you!
I'm just getting started with using spark, I've previously used python with pandas. One of the common things I do very regularly is compare datasets to see which columns have differences. In python/pandas this looks something like this:
merged = df1.merge(df2,on="by_col")
for col in cols:
diff = merged[col+"_x"] != merged[col+"_y"]
if diff.sum() > 0:
print(f"{col} has {diff.sum()} diffs")
I'm simplifying this a bit but this is the gist of it, and of course after this I'd drill down and look at for example:
col = "col_to_compare"
diff = merged[col+"_x"] != merged[col+"_y"]
print(merged[diff][[col+"_x",col+"_y"]])
Now in spark/scala this is turning out to be extremely inefficient. The same logic works, but this dataset is roughly 300 columns long, and the following code takes about 45 minutes to run for a 20mb dataset, because it's submitting 300 different spark jobs in sequence, not in parallel, so I seem to be paying the startup cost of spark 300 times. For reference the pandas one takes something like 300ms.
for(col <- cols){
val cnt = merged.filter(merged("dev_" + col) <=> merged("prod_" + col)).count
if(cnt != merged.count){
println(col + " = "+cnt + "/ "+merged.count)
}
}
What's the faster more spark way of doing this type of thing? My understanding is I want this to be a single spark job where it creates one plan. I was looking at transposing to a super tall dataset and while that could potentially work it ends up being super complicated and the code is not straightforward at all. Also although this example fits in memory, I'd like to be able to use this function across datasets and we have a few that are multiple terrabytes so it needs to scale for large datasets as well, whereas with python/pandas that would be a pain.
I have written a class which performs standard scaling over grouped data.
class Scaler:
.
.
.
.
def __transformOne__(self, df_with_stats, newName, colName):
return df_with_stats\
.withColumn(newName,
(F.col(colName)-F.col(f'avg({colName})'))/(F.col(f'stddev_samp({colName})')+self.tol))\
.drop(colName)\
.withColumnRenamed(newName, colName)
def transform(self, df):
df_with_stats = df.join(....) #calculate stats here by doing a groupby and then do a join
return reduce(lambda df_with_stats, kv: self.__transformOne__(df_with_stats, *kv),
self.__tempNames__(), df_with_stats)[df.columns]
The idea is to save the mean and variances in columns and simply do a column subtraction/division on the column i want to scale. This part is done in the function transformOne. So basically its an arithmetic operation on one column.
If i want to scale multiple columns I just call the function transformOne multiple times but a bit more efficiently using functools.reduce (see the function transform. The class works fast enough for a single column but when I have multiple columns it takes too much time.
I have no idea about internals of spark so im a complete newbie. Is there a way i can improve this computation over multiple columns ?
My solution does a lot of calls to withColumn function. Hence i changed the solution by using select instead of withColumn. There is substantial difference in the physical plans of both the approaches. For my application I improved from 15 minutes to 2 minutes using select. More information about this in this SO post.
I have a simple query which make a GROUP BY using two fields:
#facturas =
SELECT a.CodFactura,
Convert.ToInt32(a.Fecha.ToString("yyyyMMdd")) AS DateKey,
SUM(a.Consumo) AS Consumo
FROM #table_facturas AS a
GROUP BY a.CodFactura, a.DateKey;
#table_facturas has 4100 rows but query takes several minutes to finish. Seeing the graph explorer I see it uses 2500 vertices because I'm having 2500 CodFactura+DateKey unique rows. I don't know if it normal ADAL behaviour. Is there any way to reduce the vertices number and execute this query faster?
First: I am not sure your query actually will compile. You would need the Convert expression in your GROUP BY or do it in a previous SELECT statement.
Secondly: In order to answer your question, we would need to know how the full query is defined. Where does #table_facturas come from? How was it produced?
Without this information, I can only give some wild speculative guesses:
If #table_facturas is coming from an actual U-SQL Table, your table is over partitioned/fragmented. This could be because:
you inserted a lot of data originally with a distribution on the grouping columns and you either have a predicate that reduces the number of rows per partition and/or you do not have uptodate statistics (run CREATE STATISTICS on the columns).
you did a lot of INSERT statements, each inserting a small number of rows into the table, thus creating a big number of individual files. This will "scale-out" the processing as well. Use ALTER TABLE REBUILD to recompact.
If it is coming from a fileset, you may have too many small files in the input. See if you can merge them into less, larger files.
You can also try to hint a small number of rows in your query that creates #table_facturas if the above does not help by adding OPTION(ROWCOUNT=4000).
I have written a small model in Matlab. This model analyses several supply nodes to meet the required amount of demand, in a demand node. Supply nodes are specified in a vector, in which for each timestep the available supply is given.
To meet the demand, supply nodes are analysed subsequently whether they can meet the demand, and the fluxes from the supply nodes to the demand node are updated accordingly. This analysis now uses a fixed order, which is defined by the script code. In pseudocode:
for timestep=1:end
if demand(timestep) > supply_1(timestep)
supply_1_demand(timestep) = supply_1(timestep)
else
supply_1_demand(timestep) = demand(timestep)
end
if remaining_demand(timestep) > supply_2(timestep)
supply_2_demand(timestep) = supply_2(timestep)
else
supply_2_demand(timestep) = demand(timestep)
end
# etcetera, etcetera
end
However, this order in which the supply nodes are analysed must be varied. I would like to read this order from a table, where the order of analysis is given by the order in which they are presented in the table. Thus, the table can look like this
1 supply_4
2 supply_1
3 supply_5
# etcetera
Is there a way of reading variable names from such a table? Preferably, this would be without using eval, as this is very slow (as I've heard), and the model will be extended to quite a lot of nodes and fluxes.
Maybe you can use structures:
varNames={'supp_1','supp_2','supp_3'};
supply.(varNames{1}) = 3; %%% set a variable by name
display(supply.(varNames{1})) %%% get value by name
ans =
3