How do I implement object-persistence not involving loading to memory? - perl

I have a Graph object (this is in Perl) for which I compute its transitive closure (i.e. for solving the all-pairs shortest paths problem).
From this object, I am interested in computing:
Shortest path from any vertices u -> v.
Distance matrix for all vertices.
General reachability questions.
General graph features (density, etc).
The graph has about 2000 vertices, so computing the transitive closure (using Floyd-Warshall's algorithm) takes a couple hours. Currently I am simply caching the serialized object (using Storable, so it's pretty efficient already).
My problem is, deserializing this object still takes a fair amount of time (a minute or so), and consumes about 4GB of RAM. This is unacceptable for my application.
Therefore I've been thinking about how to design a database schema to hold this object in 'unfolded' form. In other words, precompute the all-pairs shortest paths, and store those in an appropriate manner. Then, perhaps use stored procedures to retrieve the necessary information.
My other problem is, I have no experience with database design, and have no clue about implementing the above, hence my post. I'd also like to hear about other solutions that I may be disregarding. Thanks!

To start with, sounds like you need two entities: vertex and edge and perhaps a couple tables for results. I would suggest a table that stores node-to-node information. If A is reachable from Y the relationship gets the reachable attribute. So here goes
Vertex:
any coordinates (x,y,...)
name: string
any attributes of a vertex*
Association:
association_id: ID
association_type: string
VertexInAssociation:
vertex: (constrained to Vertex)
association: (constrained to association)
AssociationAttributes:
association_id: ID (constrained to association)
attribute_name: string
attribute_value: variable -- possibly string
* You might also want to store vertex attributes in a table as well, depending on how complex they are.
The reason that I'm adding the complexity of Association is that an edge is not felt to be directional and it simplifies queries to consider both vertexes to just be members of a set of vertexes "connected-by-edge-x"
Thus an edge is simply an association of edge type, which would have an attribute of distance. A path is an association of path type, and it might have an attribute of hops.
There might be other more optimized schemas, but this one is conceptually pure--even if it doesn't make the first-class concept of "edge" a first class entity.
To create an minimal edge you would need to do this:
begin transaction
select associd = max(association_id) + 1 from Association
insert into Association ( association_id, association_type )
values( associd, 'edge' )
insert
into VertexInAssociation( association_id, vertex_id )
select associd, ? -- $vertex->[0]->{id}
UNION select associd, ? -- $vertex->[1]->{id}
insert into AssociationAttributes ( association_id, association_name, association_value )
select associd, 'length', 1
UNION select associd, 'distance', ? -- $edge->{distance}
commit
You might also want to make association types classes of sorts. So that the "edge" association automatically gets counted as a "reachable" association. Otherwise, you might want to insert UNION select associd, reachable, 'true' in there as well.
And then you could query a union of reachable associations of both vertexes and dump them as reachable associations to the other node if they did not exist and dump existing length attribute value + 1 into the length attribute.
However, you'd probably want an ORM for all that though, and just manipulate it inside the Perl.
my $v1 = Vertex->new( 'V', x => 23, y => 89, red => 'hike!' );
my $e = Edge->new( $v1, $v2 ); # perhaps Edge knows how to calculate distance.

Related

Merge neo4j relationships into one while returning the result if certain condition satisfies

My use case is:
I have to return whole graph in result but the condition is
If there are more than 1 relationship in between two particular nodes in the same direction then I have to just merge it into 1 relationship. For ex: Lets say there are two nodes 'm' and 'n' and there are 3 relations in between these nodes say r1, r2, r3 (in the same direction) then when I get the result after firing cypher query I should get only 1 relation in between 'n' and 'm'.
I need to perform some operations on top of it like the resultant relation that we got from merging all the relations should contain the properties and their values that I want to retain. Actually I will retain all the properties of any one of the relations that are merging depending upon the timestamp field that is one of the properties in relation.
Note : I have same properties throughout all my relations (The number of properties and name of properties are same across all relations. Values may differ for sure)
Any help would be appreciated. Thanks in advance.
You mean something like this?
Delete all except the first
MATCH (a)-[r]->(b)
WITH a,b,type(r) as type, collect(r) as rels
FOREACH (r in rels[1..] | DELETE r)
Ordering by timestamp first
MATCH (a)-[r]->(b)
WITH a,r,b
ORDER BY r.timestamp DESC
WITH a,b,type(r) as type, collect(r) as rels
FOREACH (r in rels[1..] | DELETE r)
If you want to do all those operations virtually just on query results you'd do them in your programming language of choice.

FileMaker database design with calculated fields and filtering

I am trying out Filemaker Pro 12 right now with no previous FM experience, although other basic DB experience. The issue I have is trying to do filtered queries for a report that span one-to-many relationships. Here is an example;
The 2 tables:
Sample_Replicate
PK
Sample FK
other fields
Weights
Sample_Replicate_FK (linked to PK of Sample_Replicate)
Weight
Measurement type (tare, gross, dry, ash)
Wash type (null or from list of lab assays)
I want to create a report that displays: (gross-tare), (dry-tare)/(gross-tare), (ash-tare)/(gross-tare), and (dry-tare)/(gross-tare) for all dry weights with non null wash types.
It seems that FM wants me to create columns for each of these values (which is doable as the list of lab assays changes minimally and updating the database would be acceptable, though not preferred). I have tried to add a gross wt, tare wt, etc to the Sample_Replicate table, but it only is returning the first record (tare wt) when I use calculated field and method:
tare wt field = Case ( Weights::Measurement type = "Tare"; Weights::Weights )
gross wt field = Case ( Weights::Measurement type = "Gross"; Weights::Weights )
etc...
It also seems to be failing when I add the criteria:
and Is Empty(Weights::Wash type )
Could someone point me in the right direction on this issue. Thanks
EDIT:
I came across this: http://www.filemakertoday.com/com/showthread.php/14084-Calculation-based-on-1-to-many-relationship
It seems that I can create ~15 calculated fields for each combination of measurement and wash type on the weights table, then do a sum of these columns in the sample_replicate after adding these 15 columns to the table. This seems absolutely asinine. Isn't there a better way to filter results of a one-to-many relationship in FM?
What about the following structure:
Replicate
ID
Wash Weight
Replicate ID
Type (null or from list of lab assays)
Tare
Gross
Dry
Ash
+ calculated fields
I assume you only calculate weight ratios of the same wash type. The weight types (tare, gross, etc.) are not just labels here; since you use them in formulas in specific places, they are more like roles, so I think they deserve their own fields.
add tare wt field, etc. in the Weights table but then add a calc field in your Sample_Replicate table to get the sum of all related values
ex: add field "total tare wt" to be "sum ( Weights::tare wt)"

ETL Process when and how to add in Foreign Keys T-SQL SSIS

I am in the early stages of creating a Data Warehouse based loosely on the Kimball methodology.
I am currently investigating my source data. I understand by the adding of a Primary key (not a natural key) this will then allow me to make the connections between the facts and dimensions.
Sounds like a silly question but how exactly is this done? Are there any good articles that run through this process?
I would imagine we bring in all of the Dimensions first. And when the fact data is brought over a lookup is performed that "pushes" the Foreign key into the Fact table? At what point is this done? Within SSIS whats is the "best practice" method? Is this all done in one package for example?
Is that roughly how it happens?
In this case do we have to be particularly careful in what order we load our data, or we could be loading facts for which there is no corresponding dimension?
I would imagine we bring in all of the Dimensions first. And when the
fact data is brought over a lookup is performed that "pushes" the
Foreign key into the Fact table? At what point is this done? Within
SSIS whats is the "best practice" method? Is this all done in one
package for example?
It would depend on your schema and table design.
Assuming it's star schema and the FK is based on the data value itself:
DIM1 <- FACT1 -> DIM2
^
|
FACT2 -> DIM3
you'll first fill DIM1 and DIM2 before inserting into FACT1 as you would need the FK.
Assuming it's snowflake schema:
DIM1_1
^
|
DIM1 <- FACT1 -> DIM2
you'll first fill DIM1_1 then DIM1 and DIM2 before inserting into FACT1.
Assuming the FK relation is based on something else (mostly a number) instead of the data value itself (kinda an optimization when dealing with huge amount of data and/or strings as dimension values), you won't need to wait until you insert the data into DIM table. I'm sure it's very confusing :), so I'll try to explain in short. The steps involved would be something like (assume a simple star schema with 2 tables, FACT1 and DIMENSION1):
Extract FACT and DIMENSION values from the data set you are processing.
Generate a unique number based on the DIMENSION's value (which say is a string), using a reproducible algorithm (e.g. SHA1, given same string, it always gives same number).
Insert into FACT1 table, the number and FACT values.
Insert into DIMENSION1 table, the number and DIMENSION values.
Steps 3 & 4 can be done in parallel. as long as there is NO constraint in place. A join on a numeric column would be more efficient than one of a string.
And there is no need to store the mapping for #2 because it's reproducible (just ensure you pick the right algo).
Obviously this can be extended for snowflake schema and/or multiple dimensions.
HTH

what's the utility of array type?

I'm totally newbie with postgresql but I have a good experience with mysql. I was reading the documentation and I've discovered that postgresql has an array type. I'm quite confused since I can't understand in which context this type can be useful within a rdbms. Why would I have to choose this type instead of using a classical one to many relationship?
Thanks in advance.
I've used them to make working with trees (such as comment threads) easier. You can store the path from the tree's root to a single node in an array, each number in the array is the branch number for that node. Then, you can do things like this:
SELECT id, content
FROM nodes
WHERE tree = X
ORDER BY path -- The array is here.
PostgreSQL will compare arrays element by element in the natural fashion so ORDER BY path will dump the tree in a sensible linear display order; then, you check the length of path to figure out a node's depth and that gives you the indentation to get the rendering right.
The above approach gets you from the database to the rendered page with one pass through the data.
PostgreSQL also has geometric types, simple key/value types, and supports the construction of various other composite types.
Usually it is better to use traditional association tables but there's nothing wrong with having more tools in your toolbox.
One SO user is using it for what appears to be machine-aided translation. The comments to a follow-up question might be helpful in understanding his approach.
I've been using them successfully to aggregate recursive tree references using triggers.
For instance, suppose you've a tree of categories, and you want to find products in any of categories (1,2,3) or any of their subcategories.
One way to do it is to use an ugly with recursive statement. Doing so will output a plan stuffed with merge/hash joins on entire tables and an occasional materialize.
with recursive categories as (
select id
from categories
where id in (1,2,3)
union all
...
)
select products.*
from products
join product2category on...
join categories on ...
group by products.id, ...
order by ... limit 10;
Another is to pre-aggregate the needed data:
categories (
id int,
parents int[] -- (array_agg(parent_id) from parents) || id
)
products (
id int,
categories int[] -- array_agg(category_id) from product2category
)
index on categories using gin (parents)
index on products using gin (categories)
select products.*
from products
where categories && array(
select id from categories where parents && array[1,2,3]
)
order by ... limit 10;
One issue with the above approach is that row estimates for the && operator are junk. (The selectivity is a stub function that has yet to be written, and results in something like 1/200 rows irrespective of the values in your aggregates.) Put another way, you may very well end up with an index scan where a seq scan would be correct.
To work around it, I increased the statistics on the gin-indexed column and I periodically look into pg_stats to extract more appropriate stats. When a cursory look at those stats reveal that using && for the specified values will return an incorrect plan, I rewrite applicable occurrences of && with arrayoverlap() (the latter has a stub selectivity of 1/3), e.g.:
select products.*
from products
where arrayoverlap(cat_id, array(
select id from categories where arrayoverlap(parents, array[1,2,3])
))
order by ... limit 10;
(The same goes for the <# operator...)

Relations With No Attributes

Aheo asks if it is ok to have a table with just one column. How about one with no columns, or, given that this seems difficult to do in most modern "relational" DBMSes, a relation with no attributes?
There are exactly two relations with no attributes, one with an empty tuple, and one without. In The Third Manifesto, Date and Darwen (somewhat) humorously name them TABLE_DEE and TABLE_DUM (respectively).
They are useful to the extent that they are the identity of a variety of relational operators, playing roles equivalent to 1 and 0 in ordinary algebra.
A table with a single column is a set -- as long as you don't care about ordering the values, or associating any other info with them, it seems fine. You can check for membership in it, and basically that's all you can do. (If you don't have a UNIQUE constraint on the single column I guess you could also count number of occurrences... a multiset).
But what in blazes would a table with no columns (or a relation with no attributes) mean -- or, how would it be any good?!
DEE and cartesian product form a monoid. In practice, if you have Date's relational summarize operator, you'd use DEE as your grouping relation to obtain grand-totals. There are many other examples where DEE is practically useful, e.g. in a functional setting with a binary join operator you'd get n-ary join = foldr join dee
"There are exactly two relations with no attributes, one with an empty tuple, and one without. In The Third Manifesto, Date and Darwen (somewhat) humorously name them TABLE_DEE and TABLE_DUM (respectively).
They are useful to the extent that they are the identity of a variety of relational operators, playing a roles equivalent to 1 and 0 in ordinary algebra."
And of course they also play the role of "TRUE" and "FALSE" in boolean algebra. Meaning that they are useful when propositions such as "The shop is open" and "The alarm is set" are to be represented in a database.
A consequence of this is that they can also be usefully employed in any expression of the relational algebra for their properties of "acting as an IF/ELSE" : joining to TABLE_DUM means retaining no tuples at all from the other argument, joining to TABLE_DEE means retaining them all. So joining R to a relvar S which can be equal to either TABLE_DEE or TABLE_DUM, is the RA equivalent of "if S then R else FI", with FI standing for the empty relation.
Hm. So the lack of "real-world examples" got to me, and I tried my best. Perhaps surprisingly, I got half way there!
cjs=> CREATE TABLE D ();
CREATE TABLE
cjs=> SELECT COUNT (*) FROM D;
count
-------
0
(1 row)
cjs=> INSERT INTO D () VALUES ();
ERROR: syntax error at or near ")"
LINE 1: INSERT INTO D () VALUES ();
A table with a single column would make sense as a simple lookup. Let's say you have a list of strings you want to filter against for user inputed text. That table would store the words you would want to filter out.
It is difficult to see utility of TABLE_DEE and TABLE_DUM from SQL Database perspective. After all it is not guaranteed that your favorite db vendor allows you creating one or the other.
It is also difficult to see utility of TABLE_DEE and TABLE_DUM in relational algebra. One have to look beyond that. To get you a flavor how these constants can come alive consider relational algebra put into proper mathematical shape, that is as close as it is possible to Boolean algebra. D&D Algebra A is a step in this direction. Then, one can express classic relational algebra operations via more fundamental ones and those two constants become really handy.