How to filter out a hemisphere of Allen Brain Atlas data with the BSB? - allen-sdk

I would like to reconstruct a single (i.e., left) hemisphere, so that it takes less resources to be plotted and less time to compile and eventually simulate. I have configured an AllenStructureLoader and I use it in my PlacemenStrategy. Can I tell the PlacementStrategy to place cells in only 1 of the 2 hemispheres? Would such filter be used in the connectivity as well?

The AllenStructureLoader loads entire Allen structures and no filtering is available if they belong to structures with the same ID.
For now your best bet would be to subclass the AllenStructureLoader and to override its get_voxelset method. I'm not sure if the Allen Brain Atlas provides hemisphere metadata to do such a filter, but the brain is rather symmetrical, so you may just get away with filtering out the half-width of the total region:
class HemisphereLoader(AllenStructureLoader):
def get_voxelset(self):
vs = super().get_voxelset()
# Take out the voxels of `vs` that you're interested in
return vs
Alternatively you could use the AllenStructureLoader, or the Allen SDK in a script to load both hemispheres, export it to NRRD, filter the NRRD using your favorite tools, save that file, and load your preprocessed NRRD file with an NrrdLoader:
"partitions": {
"hemi": {
"type": "nrrd",
"source": "my_file.nrrd"
}
}

Related

Is it possible to merge raster bands from several folders using GDAL?

I have two folders containing about 15 000 .tif files. Each file in the first folder is a raster with 5 bands, named AA_"number" meaning it looks like
AA_1.tif,
AA_2.tif,
...,
AA_15000.tif.
Each file in the second folder is a raster with 2 bands named BB_"number" and looks like
BB_1.tif,
BB_2.tif,
...,
BB_15000.tif.
My goal is to add bands 1-3 from first file from folder AA with band 1 from the first file in folder BB to create a 4 band raster, and make 15000 4 band rasters. After doing some research and testing things out in QGIS I believe the tool Merge from GDAL could solve this task, but I have not been able make it find the right files in different folders. And as I have 2x 15 000 files, it is not possible to do this selection manually. Is there anyone who know a smart solution to this, preferably using GDAL or QGIS?
There are many ways to do this, and it really depends on what the exact use case is. Like the type of analysis/visualization that needs to be done on the result.
With this many files, it could for example be nice to merge them using a VRT. That will avoid creating redundant data, but whether that's actually the best solution depends. Just stacking them in a new tiff-file would of course also work.
Unfortunately, creating a VRT using gdalbuildvrt / gdal.BuildVRT is not possible with multi-band inputs.
If your inputs are homogeneous in terms of properties, it should be fairly simple to set up a template where you fill in the file locations and write the VRT to disk. For more inputs with heterogeneous properties it might still be possible, but you'll have to be careful to take it all into account.
Conceptually such a VRT would look something like:
<VRTDataset rasterXSize="..." rasterYSize="...">
<SRS>...</SRS>
<GeoTransform>....</GeoTransform>
<VRTRasterBand dataType="..." band="1">
<ComplexSource>
<SourceFilename relativeToVRT="0">//some_drive/aa_folder/aa_file1.tif</SourceFilename>
<SourceBand>1</SourceBand>
...
</ComplexSource>
</VRTRasterBand>
<VRTRasterBand dataType="..." band="2">
<ComplexSource>
<SourceFilename relativeToVRT="0">//some_drive/aa_folder/aa_file1.tif</SourceFilename>
<SourceBand>2</SourceBand>
...
</ComplexSource>
</VRTRasterBand>
<VRTRasterBand dataType="..." band="3">
<ComplexSource>
<SourceFilename relativeToVRT="0">//some_drive/aa_folder/aa_file1.tif</SourceFilename>
<SourceBand>3</SourceBand>
...
</ComplexSource>
</VRTRasterBand>
<VRTRasterBand dataType="..." band="4">
<ComplexSource>
<SourceFilename relativeToVRT="0">//some_drive/bb_folder/bb_file1.tif</SourceFilename>
<SourceBand>1</SourceBand>
...
</ComplexSource>
</VRTRasterBand>
</VRTDataset>
You can first use gdalbuildvrt on some of your files to find all the properties that need to be filled in, like projection, pixel dimensions etc. That will work, but gdalbuildvrt will only be able to take the first band from the inputs. If all bands have homogeneous properties (like nodata value etc), that should be fine as a reference.

Open Street Map tag to grab all emerged land

I'm looking to download the geometries of all emerged land (everything within the coastal line) in Python using OSMNX, but can't seem to find a general tag that would do it.
Right now, I'm using:
t = {'landuse':['commercial', 'industrial', 'residential', 'farmland', 'construction', 'education', 'retail', 'cemetery', 'grass', 'garages', 'depot', 'port', 'railway', 'recreation_ground', 'religious', 'yes', '*'], 'leisure':['park']}
land = ox.geometries_from_polygon(bbox, tags=t)
But I still have many holes...
So, in short, is there an OSM tag to grab all emerged land?
The additive approach, i.e. combining all sorts of landuses, won't get you all the way to the result you want. As you've noticed, you'll end up with white spots. You could get closer by considering even more tags, such as some values of the natural=* key, but ultimately there simply is land that is not covered by any such polygon in OSM.
Instead, you should look at OSM coastline data. As this can be tricky to process, you might want to get pre-processed data from osmdata.openstreetmap.de, such as their land polygons.

Why FastText test of a model return only 1 exemple when my test file contains 135

I'm trying to test the model (model.bin) i've made with fastText on a test file (test.txt). In this test file, i have 135 labelised data. I'm expecting from fastText to test my model on this number of example, but instead, it only test it over 1 example. Where does come from this problem ?
I've already tried to do such a thing with another model and another testing file and all worked nicely.
this is how I test my model. model_baby.bin is the model, and test.data.txt is my testing file.
./fasttext test model_baby.bin test.data.txt
N 1
P#1 1
R#1 0.0164
Number of examples: 1
And here is an extract from my testing file
__label__4.0 I love the fact you can hide your stuff. Only down is that the straps to hold it at midpoint and bottom could be better designed for your car. It's got plenty of room which is great. __label__5.0 This hid our ipad wonderfully. Especially for those quick stops where we all had jump out and use the restroom. It zipped, folded and held all our stuff for the kids in the back seat. __label__3.0
As i have more than 1 labelised example in my testing file, I expect the output "Number of examples: " to be at least more than 1 but the actual one is "1"
From the official documentation (https://fasttext.cc/docs/en/supervised-tutorial.html): Each line of the text file contains a list of labels, followed by the corresponding document. All the labels start by the __label__ prefix, which is how fastText recognize what is a label or what is a word.
I don't understand very much your extract. I think it should be like this:
__label__4.0 I love the fact you can hide your stuff. Only down is that the straps to hold it at midpoint and bottom could be better designed for your car. It's got plenty of room which is great.
__label__5.0 This hid our ipad wonderfully. Especially for those quick stops where we all had jump out and use the restroom. It zipped, folded and held all our stuff for the kids in the back seat.
__label__3.0 ...

Get more of the metadata from the Neurotransmitter study using Allen SDK

I am downloading all the images from the Neurotransmitter study of the Allen Brain Atlas using this script:
from allensdk.api.queries.image_download_api import ImageDownloadApi
from allensdk.config.manifest import Manifest
import pandas as pd
import os
#getting transmitter study
#product id from http://api.brain-map.org/api/v2/data/query.json?criteria=model::Product
nt_datasets = image_api.get_section_data_sets_by_product([27])
#an instance of Image Api for downloading
image_api = ImageDownloadApi()
for index, row in df_nt.iterrows():
#get section dataset id
section_dataset_id= row['id']
#each section dataset id has multiple image sections
section_images = pd.DataFrame(
image_api.section_image_query(
section_data_set_id=section_dataset_id)
)
for section_image_id in section_images['id'].tolist():
file_path = os.path.join('/path/to/save/dir/',
str(section_image_id) + '.jpg' )
Manifest.safe_make_parent_dirs(file_path)
image_api.download_section_image(section_image_id,
file_path=file_path,
downsample=downsample)
This script downloads presumably all the available ISH experiments. However, I am wondering what would be the best way to get more of the metadata as follows:
1) type of ISH experiment, known as "gene" (for example whether an image is MBP-stained, Nissl-stained or etc). Shown in red circle below.
2) Structure and correspondence to the atlas image (annotations, for example to see to which part of brain a section belongs to). I think this could be acquired with tree_search but not sure how. Shown in red circles below from two different webpages on Allen website.
3) The scale of the image, for example how big one pixel is in the downloaded image (e.g., 0.001x0.001 mm). I would require this for image analysis with respect to MRI, for example. Shown below in the red circle.
All the above information are somehow available on the website, my question is whether you could help me to do this programmatically via the SDK.
EDIT:
Also would be great to download "Nissl" stains programmatically, as they do not show using the above loop iteration. The picture is shown below.
To access this information, you'll need to formulate a somewhat complex API query.
from allensdk.api.queries.rma_api import RmaApi
api = RmaApi()
data_set_id = 146586401
data = api.model_query('SectionDataSet',
criteria='[id$eq%d]' % data_set_id,
include='section_images,treatments,probes(gene),specimen(structure)')
print("gene symbol: %s" % data[0]['probes'][0]['gene']['acronym'])
print("treatment name: %s" % data[0]['treatments'][0]['name'])
print("specimen location: %s" % data[0]['specimen']['structure']['name'])
print("section xy resolution: %f um" % data[0]['section_images'][0]['resolution'])
gene symbol: MBP
treatment name: ISH
specimen location: Cingulate Cortex
section xy resolution: 1.008000 um
Without doing a deep dive on the API data model, SectionDataSets have constituent SectionImages, Treatments, Probes, and source Specimens. Probes target Genes, and Specimens can be associated with a Structure. The query is downloading all of that information for a single SectionDataSet into a nested dictionary.
I don't remember how to find the specimen block extent. I'll update the answer if I find it.

Find all paths starting from source node with Perl

First I would like to clarify that I have very little experience with Graph Theory and the proper algorithms to parse a directed graph, and that I've searched here on SO but didn't quite find what I was looking for. Hopefully you guys can help me :)
I have a large directed graph (around 3000 nodes) that has several subgraphs made out of connected nodes, and the subgraphs are not connected to each other. Here is a small representative graph of the data I have here:
I am writing a Perl script to find all possible paths starting from each source node to the sink nodes and store them in an array of arrays. So, for this graph, the possible paths would be:
1,2,3,4,5,6
1,2,3,4,5,7
1,8,9,10
11,12,13
11,14
15,16,17
The way I've done this search in my script was to use the Graph module in the following steps:
Find all source nodes in the graph and store them in an array
Find all sink nodes in the graph and store them in an array
Find all pairs short paths with the Floyd-Warshall algorithm
Search the APSP Floyd-Warshall graph object if exist a path between a source node and a sink node. If there is a path, store it in array of arrays. If there isn't a path, do nothing.
Here is the part of my script that does it:
#Getting all source nodes in the graph:
my #source_nodes = $dot_graph->source_vertices();
my #sink_nodes = $dot_graph->sink_vertices();
# Getting all possible paths between from source to sink nodes in the graph:
print "Calculating all possible overlaps in graph\n";
my $all_possible_paths = $dot_graph->APSP_Floyd_Warshall();
print "Done\n";
# print "Extending overlapping contigs\n";
my #all_paths;
foreach my $source (#source_nodes) {
foreach my $sink (#sink_nodes) {
my #path_vertices = $all_possible_paths->path_vertices($source, $sink);
my $path_length = $all_possible_paths->path_length($source,$sink);
#Saving only the paths that actually exist:
push (#all_paths, \#path_vertices) unless (!#path_vertices);
}
}
The problem with that is that it works fine for small graphs, but now that I have 3000 nodes, it would take a very very long time to finish (assuming that each path would take 1ms to be found, it would take 312.5 days to go through all of them). I know using the Floyd-Warshall algorithm to find all possible paths in the graph to only find the paths between sources and sinks is not efficient, but when I wrote the script I needed the results as soon as possible and my graphs were a lot smaller.
My question is how can I find the all paths starting from each source in the graph that will end in a sink node, without find all possible paths first? Is that what is called a breadth-first or a depth-first search? How to implement that with Perl (and if possible, with the Graph module)? Any help would be awesome!
P.S.: In order to make the program run faster, I started trying to breaking the initial big graph into its subgraphs and running the original script, but forking the main loop that searches for the paths between sources and sinks using Parallel::ForkManager. What do you guys think of that approach?
You're not interested in finding the shortest path, so forget about all those shortest path algorithms.
You're interested in finding all paths. This is called tree traversal, and it can be performed depth-first or breadth-first. Since you're traversing the entire tree, it doesn't really matter which approach is taken. The following performs a depth-first search using recursion:
sub build_paths {
my ($graph) = #_;
my #paths;
local *_helper = sub {
my $v = $_[-1];
my #successors = $graph->successors($v);
if (#successors) {
_helper(#_, $_)
for #successors;
} else {
push #paths, [ #_ ];
}
};
_helper($_)
for $graph->source_vertices();
return \#paths;
}
die if $graph->has_a_cycle;
my $paths = build_paths($graph);
Yes, it would be possible to parallelize the work, but I'm not writing that for you.
What concerns me the most is memory. Depending on the number of branches in the paths, you could easily end up running out of memory. Note that storing the paths as strings (of comma-separated values) would take less memory than storing them as arrays.