Please help me 'grok' triplestores - rdbms

I'm an RDBMS person from way back. I'm trying to grok triplestore. I think my "confusion" may be addressed with the answer to the following question:
How is this...
Table (Subjects):
ID Subject Details
1 Barney …
2 Fred …
3 Picture …
4 …
Table2 (Predicates):
ID Predicate Details
1 friendOf …
2 marriedTo …
3 hasTimeStamp …
4 hasGeoCoord …
5 hasEventName …
6 belongsTo …
7 containsPerson …
8 …
Table3 (Objects) - These may be Subjects as well:
ID Object SubjectID Details
1 Fred 2 …
2 Wilma NULL …
3 January 1, 2010 1530 NULL …
4 46°12′N NULL …
5 6°09′E NULL …
6 Wedding NULL …
7 Ski Trip NULL …
8 Barney 1 …
9 …
Table4 (Triplestores)
ID SubjectID PredicateID ObjectID Details
1 1 1 2 …
2 2 2 3 …
3 3 3 3 …
4 3 4 4 …
5 3 4 5 …
6 3 5 6 …
7 3 5 7 …
8 3 7 8 …
9 3 7 2 …
10 3 7 1 …
11 …
So #9 in Tripstore is: Picture containsPerson Fred
...Not a triplestore?
If it is then please comment on why this implementation (as an RDBMS) is inefficient etc.
Thanks in advance!!

It's possible, and easy to some degree, to implement a triple store on top of an RDBMS. There are several systems currently available that do this with varying degrees of success. However, they tend not perform all that well due to transitive self joins their design usually requires. This is why serious vendors who are providing a triple store backed by a relational database, such as Oracle, have customized handling to help improve their efficiency in these situations.
In my experience, native triple stores, those designed for the purpose of storing and querying RDF, always outperform solutions shoehorned on top of a relational system. So while they're very much databases and have a lot in common with a traditional RDBMS, there are still design choices in their implementation that makes them better suited for answering SPARQL queries.

Related

How would I configure analyze threshold for a table where the data is categorically different every couple months?

We host data for an auditing service. Every few months, a new audit comes out with similar questions to previous audits of the same category. Since questions can change verbiage and number, we store each question in each audit separately (we do link them through a "related_questions" table).
audits
id
name
passing_score
1
audit_1
100
2
audit_2
150
questions
id
audit_id
text
1
1
q1
2
1
q2
3
2
q1
4
2
q2
We then have a surveys and responses table. Surveys are the overall response to an audit, while responses store the individual responses to each question.
surveys
id
audit_id
overall_score
pass
1
1
120
true
2
1
95
false
3
2
200
true
4
2
100
false
responses
id
survey_id
question_id
score
1
1
1
60
2
1
2
60
3
2
1
60
4
2
2
35
5
3
3
100
6
3
4
100
7
4
3
50
8
4
4
50
The analyze threshold is base threshold + scale factor * number of tuples. The problem with this is that once an audit has finished (after a few months), we'll never receive new surveys or responses for that category of data. The new data that comes in is conceptually all that needs to be analyzed. All data is queried, but the new data has the most traffic.
If 10% is the ideal scale factor for today and analyze autoruns once every week, a couple years from now analyze may autorun once every 4 months due to the number of tuples. This is problematic when the past 3 months of data is for questions that the analyzer has never seen and so there are no helpful stats for the query planner on this data.
We could set the scale factor extremely low for this table, but that seems like a cheap solution that could cause issues in the future.
If you have a constant data modifications rate, setting autovacuum_analyze_scale_factor to 0 for that table and only using autovacuum_analyze_threshold is a good idea.

Group by specific column in PostgreSQL

I trying to get a count of person depending on the org_id.
Let's say I have 10 people in my people coming in from different organization.
I want to count no of people working in separate with each record listed.
Please click here on SQL Fiddle link to get what exactly I am trying to do.
Postgres version 9.4
Below are my table records:
id person_name emp_id org_id
1 John Mackenzie TTT104 1
2 Raven Raid TTT105 1
3 Albert Pinto TTT106 2
4 Albert Pinto1 TTT119 2
5 Ram Raman TTT108 2
6 Huge Jackman TTT109 2
7 Peter Pan TTT107 2
8 Albert Pinto2 TTT106 2
RESULT EXPECTED:
id person_name emp_id count(org_id)
1 John Mackenzie TTT104 2
2 Raven Raid TTT105 2
3 Albert Pinto TTT106 6
4 Albert Pinto1 TTT119 6
5 Ram Raman TTT108 6
6 Huge Jackman TTT109 6
7 Peter Pan TTT107 6
8 Albert Pinto2 TTT106 6
As shown in the image I want my records to look in my velocity template:
While collecting results, you The solution you are looking for is the following:
SELECT org_id, count(*)
FROM person
GROUP BY org_id;
Basically with this query you are collecting the number of people working in each distinct org_id.
Result of the query is, then:
rg_id | count
-------------
1 | 2
2 | 6
Execute query as below to solve the issue:
SELECT p.person_name,
p.emp_id,
count(p.org_id) OVER w as org
FROM person p WINDOW w AS (PARTITION BY org_id);

SPSS - merging files with duplicate cases of ID variable and new cases/variables

I have an administrative dataset for store visits from multiple years that I am trying to merge into one under the ID variable.
Each dataset has duplicates of an ID that occur during different store visits, annotated by Date. Some of the more recent data files also have new variables (Y) not contained in the old data files. Datasets from different years will also contain different cases indicated by different ID. Also, some variables may be the same for each case but at different dates. I want the merged file to retain these duplicates.
Example data files:
File 1
ID Date X
1 3 4
1 5 3
2 1 4
File 2
ID Date X Y
1 6 4 2
1 7 1 5
2 8 4 7
3 7 2 3
I want the merged file to continue listing ALL duplicate cases, as such:
ID Date X Y
1 3 4 .
1 5 3 .
1 6 4 2
1 7 1 5
2 1 4 .
2 8 4 7
3 7 2 3
I then plan to restructure (CASESTOVARS /AUTOFIX=0) the merged file so that it looks like this:
ID Date.1 Date.2 Date.3 Date.4 X.1 X.2 X.3 X.4 Y.1 Y.2 Y.3 Y.4
1 3 5 6 7 4 3 4 1 . . 2 5
2 1 8 . . 4 4 . . . 7 . .
3 7 . . . 2 . . . 3 . . .
I am having trouble with the initial merging process, however. I have tried looking up the safest way to merge files when they both have duplicate cases in order to make sure no data are lost in the process. It seems that the "Add Variables" method results in lost values for duplicate variables.
Thanks!
EDIT: If I used the "Add Variables" function and used both the ID and Date variables as the key variables, would that help avoid deletion of duplicate cases?
Why not try add cases instead of add variables? if there are no occurrences of the same Id with the same date it should work OK with the casestovars.
If there are such cases, you'll need to think what you want to do with them before you can proceed with the casestovars.
One way would be to aggregate by ID and DATE and decide if you want to e.g. add up the data vars for this case.

Drools decision table "not" statement

I have a feeling there may not be an easy answer to this question.
Lets assume this is my decision table, which operates on an object instance called "input".
CONDITION CONDITION ACTION
a == $param b != $param input.setC($param)
1 5 11
1 6 11
My case is that if a is not 1, and b is not in (5,6) then set c to 11.
However, if b is 6, the first rule will still fire since b is not 5, thus setting c to 11.
I would like to keep the organization of the columns without having to put multiple values in a column.
QUESTION: Is there some sort of header I can use which basically turns the decision table into a single rule, where b will not be in any of the rows where a is 1? Or some alternative method?
I am tempted to go with the negation of the rule:
CONDITION CONDITION ACTION
a == $param b == $param input.setC($param)
1 1 11
1 2 11
1 3 11
1 4 11
1 7 11
1 8 11
There are way more in this table and this makes it more difficult to maintain.
If you are using XLS decision table, then a similar to this one should work.
If you are familiar with drools or jbpm workbench I can provide you also solution based on Guided Decision Tables.
Hope this helps and let me know.

Aggregate path counts using HierarchyID

Business problem - understand process fallout using analytics data.
Here is what we have done so far:
Build a dictionary table with every possible process step
Find each process "start"
Find the last step for each start
Join dictionary table to last step to find path to final step
In the final report output we end up with a list of paths for each start to each final step:
User Fallout Step HierarchyID.ToString()
A 1/1/1
B 1/1/1/1/1
C 1/1/1/1
D 1/1/1
E 1/1
What this means is that five users (A-E) started the process. Assume only User B finished, the other four did not. Since this is a simple example (without branching) we want the output to look as follows:
Step Unique Users
1 5
2 5
3 4
4 2
5 1
The easiest solution I could think of is to take each hierarchyID.ToString(), parse that out into a set of subpaths, JOIN back to the dictionary table, and output using GROUP BY.
Given the volume of data, I'd like to use the built-in HierarchyID functions, e.g. IsAncestorOf.
Any ideas or thoughts how I could write this? Maybe a recursive CTE?
Restructuring the data may help with this. For example, structuring the data like this:
User Step Process#
---- ---- --------
A 1 1
A 2 1
A 3 1
B 1 2
B 2 2
B 3 2
B 4 2
B 5 2
E 1 3
E 2 3
E 1 4
E 2 4
E 3 4
Allows you to run the following query:
select step,
count(distinct process#) as process_iterations,
count(distinct user) as unique_users
from stepdata
group by step
order by step;
which returns:
Step Process_Iterations Unique_Users
---- ------------------ ------------
1 4 3
2 4 3
3 3 3
4 1 1
5 1 1
I'm not familiar with hierarchyid, but splitting out that data into chunks for analysis looks like the sort of problem numbers tables are very good for. Join a numbers table against the individual substrings in the fallout and it shouldn't be too hard to treat the whole thing as a table and analyse it on the fly, without any non-set operations.