How to aggregate synonym data with SPARQL - aggregate

I'm using sesame repository with some triple data publication like this:
<http://example.org/doc2> a qb:Observation;
foaf:Organization "Inst. of Technol.";
ps:sumPaper 3 .
<http://example.org/doc3> a qb:Observation;
foaf:Organization "Institute of Technology";
ps:sumPaper 5 .
<http://example.org/doc4> a qb:Observation;
foaf:Organization "Dong C Univ.";
ps:sumPaper 4 .
<http://example.org/doc5> a qb:Observation;
foaf:Organization "University of Dong C";
ps:sumPaper 2 .
doc2 and doc3, actually have the same organization. As well as doc 4 and doc 5, its has synonym organization.
I want to aggregate data with sparql, and I want to expect result like this :
Organization sumPaper
-----------------------------------
Insitute of Technology 8
University of Dong C 6
so, I added at repository with synonym ontology to description.
:org2 a foaf:Organization;
ps:organizationName "Inst. of Technol";
owl:sameAs :org3.
:org3 a foaf:Organization;
ps:organizationName "Institute of Technology".
:org4 a foaf:Organization;
ps:organizationName "Dong C Univ.";
:org5 a foaf:Organization;
ps:organizationName "University of Dong C";
owl:sameAs :org4.
please help me...I'm so confused to make sparql statement to get result that I expected.

You're complicating things with owl:sameAs, Try this instead:
:org1 a foaf:Organization ;
ps:organizationName "Inst. of Technol", "Institute of Technology" .
:org2 a foaf:Organization ;
ps:organizationName "Dong C Univ.", "University of Dong C" .
You can then do the following:
select ?org (SUM(?sumP) as ?sum)
{
?ob a qb:Observation ;
ps:sumPaper ?sumP ;
foaf:Organization ?orgName .
# Lookup org based on synonyms
?org ps:organizationName ?orgName .
}
group by ?org
Although that will give you org identifiers. If that bothers you:
select (SAMPLE(?orgName) as ?name) (SUM(?sumP) as ?sum)
...
or even add an rdsf:label or skos:prefLabel to each org in your synonym file.

Related

direct answer to sparql select query of equivalent class for graphdb?

I have an "EquivalentTo" definition in Protege of a class EquivClass as (hasObjProp some ClassA) and (has_data_prop exactly 1 rdfs:Literal)
Is there a form of SPARQL query for GraphDB 9.4 to get the "direct" answer to a select query of an equivalent class without having to collect and traverse all the constituent blank nodes explicitly? Basically, I'm looking for a short cut. I'm not looking to get instances of the equivalent class, just the class definition itself in one go. I've tried to search for answers, but I'm not really clear on what possibly related answers are saying.
I'd like to get something akin to
(hasObjProp some ClassA) and (has_data_prop exactly 1 rdfs:Literal)
as an answer to the SELECT query on EquivClass. If the answer is "not possible", that's enough. I can write the blank node traversal with the necessary properties myself.
Thanks!!
Files are -
Ontology imported into GraphDB: tester.owl - https://pastebin.com/92K7dKRZ
SELECT of all triples from GraphDB *excluding* inferred triples: tester-graphdb-sparql-select-all-excl-inferred.tsv - https://pastebin.com/fYdG37v5
SELECT of all triples from GraphDB *including* inferred triples: tester-graphdb-sparql-select-all-incl-inferred.tsv - https://pastebin.com/vvqPH1FZ
Added sample query in response to #UninformedUser. I use "select *" for example, but really I'm interested in the "end results", ie, ?fp, ?fo, ?rop, ?roo. Essentially, I'm looking for something simpler and more succinct than what I have below.The example I posted only has a single intersection ("and" clause). In my real world set, there are multiple equiv classes with different numbers of "and" clauses.
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX : <http://www.semanticweb.org/ontologies/2020/9/tester#>
select * where {
:EquivClass owl:equivalentClass ?bneq .
?bneq ?p ?bnhead .
?bnhead rdf:first ?first .
?first ?fp ?fo .
?bn3 rdf:rest ?rest .
?rest ?rp ?ro .
?ro ?rop ?roo .
filter(?bn3 != owl:Class && ?ro!=rdf:nil)
}
You can unroll the list using a property path:
prefix : <http://www.semanticweb.org/ontologies/2020/9/tester#>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select * {
:EquivClass owl:equivalentClass/owl:intersectionOf/rdf:rest*/rdf:first ?restr.
?restr a owl:Restriction .
optional {?restr owl:onProperty ?prop}
optional {?restr owl:cardinality ?cardinality}
optional {?restr owl:someValuesFrom ?class}
}
This returns:
| | restr | prop | cardinality | class |
| 1 | _:node3 | :hasObjProp | | :ClassA |
| 2 | _:node5 | :has_data_prop | "1" ^^xsd:nonNegativeInteger | |

uniq by only a part of the line

I am trying to consolidate an email list, but I want to uniq (or uniq -i -u) by the email address, not the entire line so that we don't have duplicates.
list 1:
Company A <companya#companya.com>
Company B <companyb#companyb.com>
Company C <companyc#companyc.com>
list 2:
firstname lastname <firstname#gmail.com>
Fake Person <companyb#companyb.com>
Joe lastnanme <joe#gmail.com>
the current output is
Company A <companya#companya.com>
Company B <companyb#companyb.com>
Company C <companyc#companyc.com>
firstname lastname <firstname#gmail.com>
Fake Person <companyb#companyb.com>
Joe lastnanme <joe#gmail.com>
the desired output would be
Company A <companya#companya.com>
Company B <companyb#companyb.com>
Company C <companyc#companyc.com>
firstname lastname <firstname#gmail.com>
Joe lastnanme <joe#gmail.com>
(as companyb#companyb.com is listed in both)
How can I do that?
given your file format
$ awk -F'[<>]' '!a[$2]++' files
will print the first instance of duplicate content in angled brackets. Or if there is no content after the email, you don't need to un-wrap the angled brackets
$ awk '!a[$NF]++' files
Same can be done with sort as well
$ sort -t'<' -k2,2 -u files
side-effect is output will be sorted which can be desired (or not).
N.B. For both alternatives the assumption is angled brackets don't appear anywhere else than the email wrappers.
Here is one in awk:
$ awk '
match($0,/[a-z0-9.]+#[a-z.]+/) { # look for emailish string *
a[substr($0,RSTART,RLENGTH)]=$0 # and hash the record using the address as key
}
END { # after all are processed
for(i in a) # output them in no particular order
print a[i]
}' file2 file1 # switch order to see how it affects output
Output
Company A <companya#companya.com>
Company B <companyb#companyb.com>
Company C <companyc#companyc.com>
Joe lastnanme <joe#gmail.com>
firstname lastname <firstname#gmail.com>
Script looks for very simple emailish string (* see the regex in the script and tune it to your liking) which it uses to hash the whole records,last instance wins as the earlier onse are overwritten.
uniq has an -f option to ignore a number of blank-delimited fields, so we can sort on the third field and then ignore the first two:
$ sort -k 3,3 infile | uniq -f 2
Company A <companya#companya.com>
Company B <companyb#companyb.com>
Company C <companyc#companyc.com>
firstname lastname <firstname#gmail.com>
Joe lastnanme <joe#gmail.com>
However, this isn't very robust: it breaks as soon as there aren't exactly two fields before the email address as the sorting will be on the wrong field and uniq will compare the wrong fields.
Check karakfa's answer to see how uniq isn't even required here.
Alternatively, just checking for uniqueness of the last field:
awk '!e[$NF] {print; ++e[$NF]}' infile
or even shorter, stealing from karakfa, awk '!e[$NF]++' infile
Could you please try following.
awk '
{
match($0,/<.*>/)
val=substr($0,RSTART,RLENGTH)
}
FNR==NR{
a[val]=$0
print
next
}
!(val in a)
' list1 list2
Explanation: Adding explanation of above code.
awk ' ##Starting awk program here.
{ ##Starting BLOCK which will be executed for both of the Input_files.
match($0,/<.*>/) ##Using match function of awk where giving regex to match everything from < to till >
val=substr($0,RSTART,RLENGTH) ##Creating variable named val whose value is substring of current line starting from RSTART to value of RLENGTH, basically matched string.
} ##Closing above BLOCK here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when 1st Input_file named list1 will be read.
a[val]=$0 ##Creating an array named a whose index is val and value is current line.
print $0 ##Printing current line here.
next ##next will skip all further statements from here.
}
!(val in a) ##Checking condition if variable val is NOT present in array a if it is NOT present then do printing of current line.
' list1 list2 ##Mentioning Input_file names here.
Output will be as follows.
Company A <companya#companya.com>
Company B <companyb#companyb.com>
Company C <companyc#companyc.com>
firstname lastname <firstname#gmail.com>
Joe lastnanme <joe#gmail.com>
Perhaps I don't understand the question !
but you can try this awk :
awk 'NR!=FNR && $3 in a{next}{a[$3]}1' list1 list2

SAS hash merge -- smaller dataset as hash object

I'm using the %HASHMERGE macro found at http://www.sascommunity.org/mwiki/images/2/22/Hashmerge.sas and the following example datasets:
data working;
length IID TYPE $12;
input IID $ TYPE $;
datalines;
B 0
B 0
A 1
A 1
A 1
C 2
D 3
;
run;
data master;
length IID FIRST_NAME MIDDLE_NAME LAST_NAME SUFFIX_NAME $12;
input IID $ FIRST_NAME $ MIDDLE_NAME $ LAST_NAME $ SUFFIX_NAME;
datalines;
X John James Smith Sr
Z Sarah Marie Jones .
Y Tim William Miller Jr
C Nancy Lynn Brown .
B Carol Elizabeth Collins .
A Wayne Mark Rooney .
;
run;
On the working dataset, I'm trying to attach the _NAME variables from the master dataset using this hash merge. The output looks fine and IS the desired output. However, in my real-life scenario the master dataset is too large to fit into a hash object and the macro keeps placing it as the hash object. I'd ultimately like to flip these two datasets to where the working dataset is the hash object, but I cannot get the desired output when I flip the code. Below is the part of the macro that produces the desired output and needs adjusted, but I am unsure how to set this up:
data OUTPUT;
if 0 then set MASTER (keep=IID FIRST_NAME MIDDLE_NAME LAST_NAME SUFFIX_NAME)
WORKING (keep=IID);
declare hash h_merge(dataset:"MASTER"); /* I want WORKING to be the hash object since it's smaller! */
rc=h_merge.DefineKey("IID");
rc=h_merge.DefineData("FIRST_NAME","MIDDLE_NAME","LAST_NAME","SUFFIX_NAME");
rc=h_merge.DefineDone();
do while(not eof);
set WORKING (keep=IID) end=eof;
call missing(FIRST_NAME,MIDDLE_NAME,LAST_NAME,SUFFIX_NAME);
rc=h_merge.find();
output;
end;
drop rc;
stop;
run;
Desired output:
IID FIRST_NAME MIDDLE_NAME LAST_NAME SUFFIX_NAME
---------------------------------------------------
B Carol Elizabeth Collins
B Carol Elizabeth Collins
A Wayne Mark Rooney
A Wayne Mark Rooney
A Wayne Mark Rooney
C Nancy Lynn Brown
D
While it's feasible to do what you say, I doubt you'll get that from a non-purpose-built macro. That's because it's not the normal way to do that; typically you want to keep the main dataset in its form and put the relational dataset in the hash table. Usually the sizes are reversed of course - the relational table is usually smaller than the main table.
Personally I would not use hash for this particular case. I'd use a format (or three). Just as fast as a hash and has less of the size issues (since it doesn't have to fit in memory), though it eventually would slow down (but not break!) due to size.
Format solution:
data working;
length IID TYPE $12;
input IID $ TYPE $;
datalines;
B 0
B 0
A 1
A 1
A 1
C 2
D 3
;
run;
data master;
length IID FIRST_NAME MIDDLE_NAME LAST_NAME SUFFIX_NAME $12;
input IID $ FIRST_NAME $ MIDDLE_NAME $ LAST_NAME $ SUFFIX_NAME;
datalines;
X John James Smith Sr
Z Sarah Marie Jones .
Y Tim William Miller Jr
C Nancy Lynn Brown .
B Carol Elizabeth Collins .
A Wayne Mark Rooney .
;
run;
data for_fmt;
set master;
retain type 'char';
length fmtname $32
label $255
start $255
;
start=iid;
*first;
label=first_name;
fmtname='$FIRSTNAMEF';
output;
*last;
label=last_name;
fmtname='$LASTNAMEF';
output;
*middle;
label=middle_name;
fmtname='$MIDNAMEF';
output;
*suffix;
label=suffix_name;
fmtname='$SUFFNAMEF';
output;
if _n_=1 then do;
start=' ';
label=' ';
hlo='o';
fmtname='$FIRSTNAMEF';
output;
fmtname='$LASTNAMEF';
output;
fmtname='$MIDNAMEF';
output;
fmtname='$SUFFNAMEF';
output;
end;
run;
proc sort data=for_fmt;
by fmtname start;
run;
proc format cntlin=for_fmt;
quit;
data want;
set working;
first_name = put(iid,$FIRSTNAMEF.);
last_name = put(iid,$LASTNAMEF.);
middle_name = put(iid,$MIDNAMEF.);
suffix_name = put(iid,$SUFFNAMEF.);
run;
That said...
If you do want to do this in a hash table, what you'd need to do is, for each row in MASTER, do a FIND in the working table, then if successful a REPLACE, then FIND_NEXT and REPLACE until that fails.
The problem? You're doing at least one find per master row, which you yourself noted is very large. If WORKING is 100k and MASTER is 100M, then you're doing 1000 finds for each match. That's very expensive, and probably means you're better off with some other solution.

Return first and last words in a person name - postgres

I have a list of names and I want to separate the first and last words in a person's name.
I was trying to use the "trim" function without success.
Can someone explain how could I do it?
table:
Names
Mary Johnson Angel Smith
Dinah Robertson Donald
Paul Blank Power Silver
Then I want to have as a result:
Names
Mary Smith
Dinah Donald
Paul Silver
Thanks,
You can do it simply with regular expressions, like:
substring(trim(name) FROM '^([^ ]+)') || ' ' || substring(trim(name) FROM '([^ ]+)$')
Of course it would only work you are 100% there is always supplied at least a first and a last name. I'm not 100% sure it is the case for everybody in the World. For instance: would that work for names in Chinese? I'm not sure and I avoid doing any assumption about people names. The best is to simply ask the user two fields, one for the "name" and another for "How would you like to be called?".
Another approach, which takes advantage of Postgres string processing built-in functions:
SELECT split_part(name, ' ', 1) as first_token,
split_part(name, ' ', array_length(regexp_split_to_array(name, ' '), 1)) as last_token
FROM mytable
Here's how I extracted full names from emails with a dot in them, eg Jeremy.Thompson#abc.com
SELECT split_part(email, '.', 1) || ' ' || replace(split_part(email, '.', 2), '#abc','')
FROM people
Result:
Jeremy | Thompson
You can easily replace the dot with a space:
SELECT split_part(email, ' ', 1) || ' ' || replace(split_part(email, ' ', 2), '#abc','')
FROM people

How can I fix Unicode issues in the dataset returned from my SPARQL query?

At the moment, I am getting rows with Unicode decode issues, while using SPARQL on Dbpedia (using Virtuoso servers). This is an example of what I am getting Knut %C3%85ngstr%C3%B6m.
The right name is Knut Ångström. Cool, now how do I fix this? My crafted query is:
select distinct (strafter(str(?influencerString),str(dbpedia:)) as ?influencerString) (strafter(str(?influenceeString),str(dbpedia:)) as ?influenceeString) where {
{ ?influencer a dbpedia-owl:Person . ?influencee a dbpedia-owl:Person .
?influencer dbpedia-owl:influenced ?influencee .
bind( replace( str(?influencer), "_", " " ) as ?influencerString )
bind( replace( str(?influencee), "_", " " ) as ?influenceeString )
}
UNION
{ ?influencee a dbpedia-owl:Person . ?influencer a dbpedia-owl:Person .
?influencee dbpedia-owl:influencedBy ?influencer .
bind( replace( str(?influencee), "_", " " ) as ?influenceeString )
bind( replace( str(?influencer), "_", " " ) as ?influencerString )
}
}
The DBpedia wiki explains that the identifiers for resources in the English DBpedia dataset use URIs, not IRIs, which means that you'll end up with encoding issues like this.
3. Denoting or Naming “things”
Each thing in the DBpedia data set is denoted by a de-referenceable
IRI- or URI-based reference of the form
http://dbpedia.org/resource/Name, where Name is derived from the URL
of the source Wikipedia article, which has the form
http://en.wikipedia.org/wiki/Name. Thus, each DBpedia entity is tied
directly to a Wikipedia article. Every DBpedia entity name resolves to
a description-oriented Web document (or Web resource).
Until DBpedia release 3.6, we only used article names from the English
Wikipedia, but since DBpedia release 3.7, we also provide localized
datasets that contain IRIs like http://xx.dbpedia.org/resource/Name,
where xx is a Wikipedia language code and Name is taken from the
source URL, http://xx.wikipedia.org/wiki/Name.
Starting with DBpedia release 3.8, we use IRIs for most DBpedia entity
names. IRIs are more readable and generally preferable to URIs, but
for backwards compatibility, we still use URIs for DBpedia resources
extracted from the English Wikipedia and IRIs for all other languages.
Triples in Turtle files use IRIs for all languages, even for English.
There are several details on the encoding of URIs that should always
be taken into account.
In this particular case, it looks like you don't really need to break up the identifier so much as get a label for the entity.
## If things were guaranteed to have just one English label,
## we could simply take ?xLabel as the value that we want with
## `select ?xLabel { … }`, but since there might be more than
## one, we can group by `?x` and then take a sample from the
## set of labels for each `?x`.
select (sample(?xLabel) as ?label) {
?x dbpedia-owl:influenced dbpedia:August_Kundt ;
rdfs:label ?xLabel .
filter(langMatches(lang(?xLabel),"en"))
}
group by ?x
SPARQL results
Simplifying your query a bit, we can have this:
select
(sample(?rLabel) as ?influencerName)
(sample(?eLabel) as ?influenceeName)
where {
?influencer dbpedia-owl:influenced|^dbpedia-owl:influencedBy ?influencee .
dbpedia-owl:Person ^a ?influencer, ?influencee .
?influencer rdfs:label ?rLabel .
filter( langMatches(lang(?rLabel),"en") )
?influencee rdfs:label ?eLabel .
filter( langMatches(lang(?eLabel),"en") )
}
group by ?influencer ?influencee
SPARQL results
If you don't want language tags on those results, then add a call to str():
select
(str(sample(?rLabel)) as ?influencerName)
(str(sample(?eLabel)) as ?influenceeName)
where {
?influencer dbpedia-owl:influenced|^dbpedia-owl:influencedBy ?influencee .
dbpedia-owl:Person ^a ?influencer, ?influencee .
?influencer rdfs:label ?rLabel .
filter( langMatches(lang(?rLabel),"en") )
?influencee rdfs:label ?eLabel .
filter( langMatches(lang(?eLabel),"en") )
}
group by ?influencer ?influencee
SPARQL results