Where can I find data that used to be returned by allensdk.CellTypesCache.get_cells() - allen-sdk

Prior to allensdk version 0.14.5, the CellTypesCache.get_cells() function returned a large, nested structure containing information about cell morphology, ephys features, location, anatomical structure, tissue donors, etc. In version 0.14.5, the structure returned is flat and much smaller.
I see that some of this information is available through get_ephys_features() and get_morphology_features(), but I'm not sure where to find the rest. Where can I go to find out how to migrate my code to the new allensdk version?

Great question. We simplified the returned dictionary from CellTypesCache.get_cells for a few reasons:
There were a large number of fields that were variously: unexplained, not useful, distracting, and/or redundant with data returned from other functions.
The way brain structures were handled made it very difficult to filter cells by cortical layer across species.
The query involved a large number of joins and was fairly slow.
(2) was probably the most urgent issue we needed to address. The new dictionary structure is explained in a bit more detail here:
https://github.com/AllenInstitute/AllenSDK/wiki/Release-Notes-(0.14.5)
You are correct that you should look for ephys. and morphology features from CellTypesCache.get_ephys_features and CellTypesCache.get_morphology_features (or just CellTypesCache.get_all_features).
If there are any fields you were using in the old dictionary structure that are not now available in the current dictionary, let me know and we can find them again.

Related

Firestore Geopoint in an Arrays

I am working with an interesting scenario that I am not sure can work, or will work well. With my current project I am trying to find an efficient way of working with geopoints in firestore. The straight forward approach where a document can contain a geopoints field is pretty self explanatory and easy to query. However, I am having to work with a varying amount of geopoints for a single document (Article). The reason for this is because a specific piece of content may need to be available in more than one geographic area.
For example, An article may need to be available only in NYC, Denver and Seattle. Using a geopoint for each location and searching by radius, in general, is a pretty standard task if I only wanted the article to be available in Seattle, but now it needs to be available in two more places.
The solution as I see it currently is to use an array and fill it with geopoints. The structure would look something like this:
articleText (String),
sortTime (Timestamp),
tags (Array)
- ['tagA','tagB','tagC','tagD'],
availableLocations (Array)
- [(Geopoint), (Geopoint), (Geopoint), (Geopoint)]
Then performing a query to get all content within 10 miles of a specific Geopoint starting at a specific postTime.
What I don't know is if putting the geopoints in an array works well or should be avoided in favor or another data structure.
I have considered replicating an article document for each geopoint, but that does not scale very well if more than a handful of locations needs defining. I've also considered creating a "reference" collection where each point is a document that contains the documentID of an article, but this leads to reading each reference document then reading the actual document. Essentially two document reads for 1 piece of content, which can get expensive based on the Firestore pricing model, and may slow things down unnecessarily.
Am I approaching this in an acceptable way? And are there other methods that can work more efficiently?

nosql wishlist models - Struggle between reference and embedded document

I got a question about modeling wishlists using mongodb and mongoose. The idea is I need a user beeing able to have many different wishlists which contain many wishes, each wish making a reference to a single article
I was thinking about it and because a wishlist only belong to a single user I thought using embedded document for that.
Same for the wish beeing embedded to a wishlist.
So I got something like that
var UserSchema = new Schema({
...
wishlists: [wishlistSchema]
...
})
var WishlistSchema = new Schema({
...
wishes: [wishSchema]
...
})
but my question is what to do with the article ? should I use a reference or should I copy the article's data in an embedded document.
If I use embedded document I got an update problem. When the article's price change, to update every wish referencing this article become a struggle. But to access those wishes's article is a piece of cake.
If I use reference, The update is not a problem anymore but I got a probleme when I filter the wish depending on their article criteria ( when I filter the wishes depending on a price, category etc .. ).
I think the second way is probably the best but I don't know how if it's possible to build a query to filter the wish depending on the article's field. I tried a lot of things using population but nothing works very well when you need to populate depending on a nested object field. ( for exemple getting wishes where their article respond to certain conditions ).
Is this kind of query doable ?
Sry for the loooong question and for my bad English :/ but any advice would be great !
In my experience in dealing with NoSQL database (mongo, mainly), when designing a collection, do not think of the relations. Instead, think of how you would display, page, and retrieve the documents.
I would prefer embedding and updating multiple schema when there's a change, as opposed to doing a ref, for multiple reasons.
Get would be fast and easy and filter is not a problem (like you've said)
Retrieve operations usually happen a lot more often than updates and with proper indexing, you wouldn't really have to bother about performance.
It leverages on NoSQL's schema-less nature and you'll be less prone restructuring due to requirement changes (new sorting, new filters, etc)
Paging would be a lot less of a hassle, and UI would not be restricted with it's design with paging and limit.
Joining could become expensive. Redundant data might be a hassle to update but it's always better than not being able to display a data in a particular way because your schema is normalized and joining is difficult.
I'd say that the rule of thumb is that only split them when you do not need to display them together. It is not impossible to join them back if you do, but definitely more troublesome.

How to handle application death and other mid-operation faults with Mongo DB

Since Mongo doesn't have transactions that can be used to ensure that nothing is committed to the database unless its consistent (non corrupt) data, if my application dies between making a write to one document, and making a related write to another document, what techniques can I use to remove the corrupt data and/or recover in some way?
The greater idea behind NoSQL was to use a carefully modeled data structure for a specific problem, instead of hitting every problem with a hammer. That is also true for transactions, which should be referred to as 'short-lived transactions', because the typical RDBMS transaction hardly helps with 'real', long-lived transactions.
The kind of transaction supported by RDBMSs is often required only because the limited data model forces you to store the data across several tables, instead of using embedded arrays (think of the typical invoice / invoice items examples).
In MongoDB, try to use write-heavy, de-normalized data structures and keep data in a single document which improves read speed, data locality and ensures consistency. Such a data model is also easier to scale, because a single read only hits a single server, instead of having to collect data from multiple sources.
However, there are cases where the data must be read in a variety of contexts and de-normalization becomes unfeasible. In that case, you might want to take a look at Two-Phase Commits or choose a completely different concurrency approach, such as MVCC (in a sentence, that's what the likes of svn, git, etc. do). The latter, however, is hardly a drop-in replacement for RDBMs, but exposes a completely different kind of concurrency to a higher level of the application, if not the user.
Thinking about this myself, I want to identify some categories of affects:
Your operation has only one database save (saving data into one document)
Your operation has two database saves (updates, inserts, or deletions), A and B
They are independent
B is required for A to be valid
They are interdependent (A is required for B to be valid, and B is required for A to be valid)
Your operation has more than two database saves
I think this is a full list of the general possibilities. In case 1, you have no problem - one database save is atomic. In case 2.1, same thing, if they're independent, they might as well be two separate operations.
For case 2.2, if you do A first then B, at worst you will have some extra data (B data) that will take up space in your system, but otherwise be harmless. In case 2.3, you'll likely have some corrupt data in the event of a catastrophic failure. And case 3 is just a composition of case 2s.
Some examples for the different cases:
1.0. You change a car document's color to 'blue'
2.1. You change the car document's color to 'red' and the driver's hair color to 'red'
2.2. You create a new engine document and add its ID to the car document
2.3.a. You change your car's 'gasType' to 'diesel', which requires changing your engine to a 'diesel' type engine.
2.3.b. Another example of 2.3: You hitch car document A to another car document B, A getting the "towedBy" property set to B's ID, and B getting the "towing" property set to A's ID
3.0. I'll leave examples of this to your imagination
In many cases, its possible to turn a 2.3 scenario into a 2.2 scenario. In the 2.3.a example, the car document and engine are separate documents. Lets ignore the possibility of putting the engine inside the car document for this example. Its both invalid to have a diesel engine and non-diesel gas and to have a non-diesel engine and diesel gas. So they both have to change. But it may be valid to have no engine at all and have diesel gas. So you could add a step that makes the whole thing valid at all points. First, remove the engine, then replace the gas, then change the type of the engine, and lastly add the engine back onto the car.
If you will get corrupt data from a 2.3 scenario, you'll want a way to detect the corruption. In example 2.3.b, things might break if one document has the "towing" property, but the other document doesn't have a corresponding "towedBy" property. So this might be something to check after a catastrophic failure. Find all documents that have "towing" but the document with the id in that property doesn't have its "towedBy" set to the right ID. The choices there would be to delete the "towing" property or set the appropriate "towedBy" property. They both seem equally valid, but it might depend on your application.
In some situations, you might be able to find corrupt data like this, but you won't know what the data was before those things were set. In those cases, setting a default is probably better than nothing. Some types of corruption are better than others (particularly the kind that will cause errors in your application rather than simply incorrect display data).
If the above kind of code analysis or corruption repair becomes unfeasible, or if you want to avoid any data corruption at all, your last resort would be to take mnemosyn's suggestion and implement Two-Phase Commits, MVCC, or something similar that allows you to identify and roll back changes in an indeterminate state.

Thoughts on data model

For my app I thought of two different data models, but I cannot see which one would be the best both in performance and filesize. In my app I have to store Recipes, which will consist of an array with ingredients, an array with instructions, an array with tips and some properties to select some recipes (e.g. a rating, type of dish).
I thought of two different models. The first would be to convert the arrays to NSData and store them all in the Core Data model. As the array's are localized that means that there will be multiple arrays of the same kind in there (e.g. instructionsEN, instructionsFR, instructionsNL). As it is not necessary to query the arrays I'm happy with the fact that I have to convert the arrays to NSData.
The other model would be a core data that only contains the properties to filter a recipe, and an identifier to a .plist file that is stored in the main bundle or the documents directory (as some of these files will be created by us, and some are created by the user). This .plist file will contain all the instructions, ingredients etc. Again, there are multiple arrays for the same kind for different localizations.
I hope you can help my with making my decision which of these options would be best in terms of performance and diskspace. I would also appreciate it if you could think of a different solution.
If you're going to Core Data, you should generally go all the way. In that case, you would have an NSManagedObject Ingredient. I would probably put a method on Ingredient like stringValueForLocale: that would take care of returning me the best value. This means that a given ingredient can be translated once and is reusable for all recipes.
You would then have a Component entity that would have an Ingredient, a quantity value and a unit. A Recipe would have a 1:M property components that would point to these. Component should likely have an englishDescription as well, which would return a printable value like "1/4c sugar" while frenchDescription might print "50g de sucre" (note the volume/mass conversion there; Component is probably where you'd manage this.)
Instructions are a bit different, since they are less likely to be reusable. I guess you might get lucky and "Beat eggs to hard peaks." might show up in several recipes, but unless you're going to actively look for those kinds of reuse, it's probably more trouble than it's worth. Instructions are also the natural place to address cultural differences. In France, eggs are often stored at room temperature. In America, they are always refrigerated. To correctly translate a French recipe to American English, you sometimes have to include an extra step like "bring eggs to room temperature." (But it depends on the recipe, since it doesn't always matter.) It generally makes sense to do this in the instructions rather than in the Ingredients.
I'd probably create an Instructions entity with stringValuesForLocale: (that would return an array of strings). Then you could do some profiling and decide whether to break this up into separate LocalizedInstructions entities so that you didn't have to fault all of the localization. The advantage of this design is that you can change you mind later about the internal database layout, and it doesn't impact higher levels. In either case, however, I'd probably store the actual instructions as an NSData encoding an NSArray. It's probably not worth the trouble and cost of creating a bunch of individual LocalizedInstruction entities.

Whats more efficent Core Data Fetch or manipulate/create arrays?

I have a core data application and I would like to get results from the db, based on certain parameters. For example if I want to grab only the events that occured in the last week, and the events that occured in the last month. Is it better to do a fetch for the whole entity and then work with that result array, to create arrays out of that for each situation, or is it better to use predicates and make multiple fetches?
The answer depends on a lot of factors. I'd recommend perusing the documentation's description of the various store types. If you use the SQLite store type, for example, it's far more efficient to make proper use of date range predicates and fetch only those in the given range.
Conversely, say you use a non-standard attribute like searching for a substring in an encrypted string - you'll have to pull everything in, decrypt the strings, do your search, and note the matches.
On the far end of the spectrum, you have the binary store type, which means the whole thing will always be pulled into memory regardless of what kind of fetches you might do.
You'll need to describe your managed object model and the types of fetches you plan to do in order to get a more specific answer.