How might Union/Find data structures be applied to Kruskal's algorithm?

How might Union/Find data structures be applied to Kruskal's algorithm? - disjoint-sets

http://en.wikipedia.org/wiki/Disjoint_sets
http://en.wikipedia.org/wiki/Kruskal's_algorithm
Union/Find data structure being used for disjoint sets...

It is stated in the entry for Kruskal's algorithm, but you can use the union/find structure to test (via FIND) if the edge connects two different trees or whether it will form a cycle when added.
The same structure can be updated (via UNION) if the edge does not form a cycle and is added to the spanning tree.

Related

Term for diff/delta on multiple files or data structures

I would like to know whether there is a proper term to describe "diffing" of / obtaining the delta between multiple files or data structures, such that the resulting "diff" contains first a description of the parts common to all files/structures, then descriptions of how this "base" file/structure must be modified to obtain the individual ones, ideally in a hierarchical fashion if some files/structures are more similar to each other than others.
There are some questions and answers about how to do this with certain tools (e.g. DIFF utility works for 2 files. How to compare more than 2 files at a time?), but as I want to do this for a specific type of data structure (namely JSON), I'm at a loss as to what I should even search for.
This type of problem seems to me like it should be common enough to have a name such as "hierarchical diff" (which however seems to be reserved for 2-way diffs on hierarchical data structures), "commonality finding", or something like that.
I guess a related concept about hierarchical ordering of commonalities and differences is formal concept analysis, but this operates on sets of properties rather than hierarchical data structures and won't help me much.

There are multiple valid denominations :
Data comparison (or Sequence comparison)
Delta encoding
Delta compression (or Differential compression)
Algorithms:
An O(ND) Difference Algorithm and Its Variations (Eugene Myer)
A technique for isolating differences between files (Paul Heckel)
The String-to-String Correction Problem with Block Moves (Walter Tichy)
Good Wikipedia links
Longest common subsequence problem
Comparison of file comparison tools
Diff Unix Utility
Some implementations
diff-match-patch (Neil Fraser - Google)
jsdifflib
jsondiffpatch

Where can I find data that used to be returned by allensdk.CellTypesCache.get_cells()

Prior to allensdk version 0.14.5, the CellTypesCache.get_cells() function returned a large, nested structure containing information about cell morphology, ephys features, location, anatomical structure, tissue donors, etc. In version 0.14.5, the structure returned is flat and much smaller.
I see that some of this information is available through get_ephys_features() and get_morphology_features(), but I'm not sure where to find the rest. Where can I go to find out how to migrate my code to the new allensdk version?

Great question. We simplified the returned dictionary from CellTypesCache.get_cells for a few reasons:
There were a large number of fields that were variously: unexplained, not useful, distracting, and/or redundant with data returned from other functions.
The way brain structures were handled made it very difficult to filter cells by cortical layer across species.
The query involved a large number of joins and was fairly slow.
(2) was probably the most urgent issue we needed to address. The new dictionary structure is explained in a bit more detail here:
https://github.com/AllenInstitute/AllenSDK/wiki/Release-Notes-(0.14.5)
You are correct that you should look for ephys. and morphology features from CellTypesCache.get_ephys_features and CellTypesCache.get_morphology_features (or just CellTypesCache.get_all_features).
If there are any fields you were using in the old dictionary structure that are not now available in the current dictionary, let me know and we can find them again.

Bipartite graph distributed processing with dynamic programming <?>

I am trying to figure out efficient algorithm for processing Documents in distributed (FaaS to be more precise) environment.
Bruteforce approach would be O(D * F * R) where:
D is amount of Documents to process
F is amount of filters
R is highest amount of Rules in single Filter
I can assume, that:
single Filter has no more than 10 Rules
some Filters may share Rules (so it's N-to-N relation)
Rules are boolean functions (predicates) so I can try to take advantage of early cutting, meaning that if I have f() && g() && h() with f() evaluating to false then I do not have to process g() and h() at all and can return false immediately.
in single Document amount of Fields is always same (and about 5-10)
Filters, Rules and Documents are already in database
every Filter has at least one Rule
Using sharing (second assumption) I had an idea to first process Document against every Rule and then (after finishing) for every Filter using already computed Rules compute result. This way if Rule is shared then I am computing it only once. However, it doesn't take advantage of early cutting (third assumption).
Second idea is to use early cutting as slightly optimized bruteforce, but it won't use Rules sharing then.
Rules sharing looks like subproblem sharing, so probably memoization and dynamic programming will be helpful.
I have noticed, that Filter-Rule relation is bipartite graph. Not quite sure if it can help me though. I also have noticed, that I could use reverse sets and in every Rule store corresponding Set. This would however create circular dependency and may cause desynchronization problems in database.
Default idea is that Documents are streamed, and every single of them is event that will create FaaS instance to process it. However, this would probably force every FaaS instance to query for all Filters, which leaves me at O(F * D) queries because of Shared-Nothing architecture.
Sample Filter:
{
'normalForm': 'CONJUNCTIVE',
'rules':
[
{
'isNegated': true,
'field': 'X',
'relation': 'STARTS_WITH',
'value': 'G',
},
{
'isNegated': false,
'field': 'Y',
'relation': 'CONTAINS',
'value': 'KEY',
},
}
or in more condense form:
document -> !document.x.startsWith("G") && document.y.contains("KEY")
for Document:
{
'x': 'CAR',
'y': 'KEYBOARD',
'z': 'PAPER',
}
evaluates to true.
I can slightly change data model, stream something else instead of Document (ex. Filters) and use any nosql database and tools to help it. Apache Flink (event processing) and MongoDB (single query to retrieve Filter with it's Rules) or maybe Neo4j (as model looks like bipartite graph) looks like could help me, but not sure about it.
Can it be processed efficiently (with regard to - probably - database queries)? What tools would be appropriate?
I have been also wondering, if maybe I am trying to solve special case of some more general (math) problem that may have useful theorems and algorithms.
EDIT: My newest idea: Gather all Documents in cache like Redis. Then single event starts up and publishes N functions (as in Function as a Service), and every function selects F/N (amount of Filters divided by number of instances - so just evenly distributing Filters across instances) this way every Filter is fetched from database only once.
Now, every instance streams all Documents from cache (one document should be less than 1MB and at the same time I should have 1-10k of them so should fit in cache). This way every Document is selected from database only once (to cache).
I have reduced database read operations (still some Rules are selected multiple times), but still I am not taking advantage of Rule sharing across Filters. I could intentionally ignore it by using document database. This way by selecting Filter I will also get it's Rules. Still - I have to recalculate it's value.
I guess that's what I get for using Shared Nothing scalable architecture?

I realized that although my graph is indeed (in theory) bipartite but (in practice) it's going to be set of disjoint bipartite graphs (as not all Rules are going to be shared). This means, that I can process those disjoint parts independently on different FaaS instances without recalculating same Rules.
This reduces my problem to processing single bipartite connected graph. Now, I can use benefits of dynamic programming and share result of Rule computation only if memory i shared, so I cannot divide (and distribute) this problem further without sacrificing this benefit. So I thought this way: if I have already decided, that I will have to recompute some Rules, then let it be low compared to disjoint parts that I will get.
This is actually minimum cut problem, that has (fortunately) polynomial complexity known algorithm.
However, this may be not ideal in my case, because I don't want to cut any part of graph - I would like to cut graph ideally in half (divide and conquer strategy, that could be reapplied recursively till graph would be so small that could be processed in seconds in FaaS instance, that has time bound).
This means, that I am looking for cut, that would create two disjoint bipartite graphs, with possibly same amount of vertexes each (or at least similar).
This is sparsest cut problem, that is NP-hard, but has O(sqrt(logN)) approximated algorithm, that also favors less cut edges.
Currently, this does look like solution for my problem, however I would love to hear any suggestions, improvements and other answers.
Maybe it can be done better with other data model or algorithm? Maybe I can reduce it further with some theorem? Maybe I could transform it to other (simpler) problem, or at least that is easier to divide and distribute across nodes?
This idea and analysis strongly suggests using graph database.

Group multiple simulink Bus Objects into structures

Short version
I am considering to use BusObjects to implement hard interface control on a (large industrial) application using Simulink and I would like to store the BusObjects (hundrends of them) into a Matlab structure so that the entire application interface specification is well organized. However, it seems that BusObjects cant be contained into structures, nor they can reside on other workspaces other than Matlab Base. Any idea on how to handle this?
Long version
I would like the interfaces specification to be hierarchical and centralized in some way. I mean, I would like to specify the external interface of my application, then the internal interfaces, then the internal interfaces of the internal interfaces and so on. And I would like this information to be stored in one object that resembles the hierarchy. I was thinking in using an structure with BusObjects as elements.
Unfortunately, it seems that, for a bus object to work, it must be declared on the Matlab workspace as an independent variable of class BusObject. It cant be an element of an structure that is a BusObject, or an element of a cell whose elements are BusObjects or an element of a BusObject vector.
Any suggestion on how to handle this? take into account that if you have a model with dozens and dozens of blocks and more than 3 hierarchy levels, then you end up with hundreds of bus objects in the Matlab workspace without any particular structure... I think that is too messy to let it be...

Bus objects are always stored in the global workspace.
Send a request to Mathworks if you want to change this.

Biojava secondary structure prediction

Is there any method to use in biojava to predict secondary structure from a given sequence?
Or if not does any anyone how can i implement it? any source code? Any exe to recommend?

I think it is not available in BioJava but what you can do is to do a BLAST search by using BioJava libraries (http://biojava.org/wiki/BioJava:CookBook3:NCBIQBlastService) or RCSB web page and from your BLAST search you can find a protein with 3D structure. Afterwards you can use suggestDomains(Structure s) method of LocalProteinDomainParser class in org.biojava.bio.structure.domain package. Domains might give you some ideas about secondary structures.

I don't think there is secondary structure prediction from sequence in biojava. However there is secondary structure assignment given the structure, based on the DSSP algorithm, see https://github.com/biojava/biojava-tutorial/blob/master/structure/secstruc.md

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse