I see COCO2017 has 80 classes 118k training and 5k validation dataset(122k images).
I have a question here. Does the number of images per classes(1525 images per class) which is ~ 122k / 80?
the COCO dataset is not an evenly distributed dataset, i.e., all the classes do not have the same number of images. So, let me show you a way to find out the number of images in any class you wish.
I am using the PyCoco API to work with the COCO dataset. Let's find out the number of images in the 'person' class of the COCO dataset. Here is a code gist to filter out any class from the COCO dataset:
# Define the class (out of the 80 COCO classes)
filterClasses = ['person']
# Fetch class IDs only corresponding to the filterClasses
catIds = coco.getCatIds(catNms=filterClasses)
# Get all images containing the above Category IDs
imgIds = coco.getImgIds(catIds=catIds)
print("Number of images containing the class:", len(imgIds))
There, we get the number of images corresponding to 'person' in the dataset!
I have recently written an entire post on exploring and manipulating the COCO dataset. Do have a look to get more details and the entire code.
Related
first time asking a question on here and really hoping to get some help. I don't believe this question is out there yet.
I have a dataframe of 7,000,000+ observations, each with 140 variables. I am trying filter the data down to a smaller cohort using a set of multiple criteria, any of which would allow for inclusion in the smaller, filtered dataset.
I have tried two methods to search my data:
The first strategy does filter_all() on all variables and searches for my criteria for inclusion
filteredData <- filter_all(rawData, any_vars(. %in% c(criteria1, criteria2, criteria3)))
The second strategy does a series of which() functions, also trying to identify every row that contains one of my criteria.
filteredData <- rawData[which(rawData$criteria1 == "criteria" | rawData$criteria2 == "criteria | rawData$criteria3 == "criteria"),]
These results will accurately pull one or two rows meeting this criteria, however I don't believe all 7,000,000 rows are being searched. I added a row label to my rawData set and saw that the function successfully pulled row #60,192. I am expecting hundreds of rows in the final search and am very confused why only a couple from early on in the dataframe get accurately identified.
My questions:
Do the filter_all() and which() functions have size limits that they stop searching after?
Does anyone have a suggestion on how to filter/search based on multiple criteria on a very large dataset?
Thank you!
I have count matrices from the rhapsody platform that I turn into a singlecellobject using the function SingleCellExperiment.
I have multiple samples running over 2 batches that I'm merging using the scMerge function (without correction).
when merging samples from the same dataset it only merges identical genes that are present in the single (non-merged) datasets which makes me to drop from 25k to 10k unique genes.
Is there a way to circumvent this issue? Or do you think it would not affect downstream analysis since these genes will anyways be dropped after merging the two badges with Harmony?
the code I used for merging is the following
sce_list_batch1 <- list((S1), (S2), (S3), (S4), (S5), (S6))
sce_batch1<- sce_cbind(sce_list_batch1, method = "intersect", exprs = c("counts"), colData_names = TRUE)
so I noticed it adds a batch correction by default. I have now added the piece cut_off_batch = 0 and it now includes all the genes
I have applied a join on file and existing Cassandra table via joinWithCassandraTable. Now, I want to apply a filter on joinCassandraRDD. Here is the code and functionality which I have written for extraction of data:
var outrdd = sc.textFile("/usr/local/spark/bin/select_element/src/main/scala/file_small.txt")
.map(_.toString).map(Tuple1(_))
.joinWithCassandraTable(settings.keyspace, settings.table)
.select("id", "listofitems")
Here "/usr/local/spark/bin/select_element/src/main/scala/file_small.txt" is the text file which is having a list of ids. Now, I have some elements in another list, say userlistofitems=["jas", "yuk"], I need to search 'userlistofitems' sublist from 'listofitems' column of joinCassandraRDD.
We have around 2Million ids where we have several user_lists for which we have to extract the data from Cassandra. We are using versions spark=2.4.4, scala=2.11.12, and spark-cassandra-connector=spark-cassandra-connector-2.4.2-3-gda70746.jar.
Any help is highly appreciated.
References Used:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc,
https://www.youtube.com/watch?v=UsenTP029tM
I have created a cube on two dimensions in spark using scala. The data is coming from two different dataframes. The names are "borrowersTable" and 'loansTable". They have been created with the "createOrReplaceTempView" option so that it is possible to run sql queries on them. The goal was to create the cube on two dimensions (gender and department) summing up the total number of loans for books for a library. With the command
val cube=spark.sql("""
select
borrowersTable.department,borrowersTable.gender,count(loansTable.bibno)
from borrowersTable,loansTable
where borrowersTable.bid=loansTable.bid
group by borrowersTable.gender,borrowersTable.department with cube;
""")
i create the cube which has this result:
Then using the command
cube.write.format("csv").save("file:///....../data/cube")
Spark creates a folder named cube which includes 34 files named part*.csv which include columns for department, gender, and sum of loans (every group by).
The goal here is to create files taking the names of the first two columns (attributes) in this way: for GroupBy (Attr1, Attr2) the file should be named Attr1_Attr2.
e.g. For (Economics, M) the file should be named Economics_M. For (Mathematics, null) it should be Mathematics_null and so on. Any help would be appreciated.
When you call df.write.format("...").save("...") each Spark executor saves partitions it holds into corresponding part* file. This is the mechanism for storing and loading big files and you can not change it. However you can try the following alternatives whatever works better in you case:
partitionBy:
cube
.write
.partitionBy("department", "gender")
.format("csv")
.save("file:///....../data/cube")
This will create subfolders with names like department=Physics/gender=M still containing part* files inside. This structure can be later loaded back to Spark and used for effective joins by partitioned columns.
collect
val csvRows = cube
.collect()
.foreach {
case Row(department: String, gender: String, _) =>
// just the simple way to write CSV, you can use any CSV lib here as well
Files.write(Paths.get(s"$department_$gender.csv"), s"$department,$gender".getBytes(StandardCharsets.UTF_8))
}
If you call collect() you receive you data frame on driver side as Array[Row] and then you can do with it whatever you want. The important limitation of this approach is that you data frame should fit into driver's memory.
I would like to create some marks, where the information of the size comes from one dataset and the information of the color comes from another dataset.
Is this possible?
Or can I update created marks (created with dataset 1) by using information from a second dataset?
Yes, you can do it.
You can use lookup transform provided there is a lookup key in both datasets.
In this example, 'category' is the key that performs lookup transform