SPARQL: slow processing of result of subquery - group-by

I'm currently learning SPARQL, and I can't wrap my head around why what seems to me like a very straightforward query takes a large amount to time. I'm trying to count the number of articles per author in a journal, using the OpenCitations project (SPARQL endpoint https://opencitations.net/sparql, I also downloaded a dump of an earlier version).
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX pro: <http://purl.org/spar/pro/>
PREFIX frbr: <http://purl.org/vocab/frbr/core#>
PREFIX fabio: <http://purl.org/spar/fabio/>
SELECT $author (COUNT($author) as $cnt) WHERE {
# narrowing down journal down to articles (over issues and volumes)
$jnl a fabio:Journal;
dcterms:title "Nature" .
$volm frbr:partOf $jnl .
$issue frbr:partOf $volm .
$artcl frbr:partOf $issue .
# selecting author
?artcl pro:isDocumentContextFor ?artcl_atrbts .
?artcl_atrbts pro:isHeldBy ?author.
# making sure that author is a person
$author foaf:familyName $y .
}
GROUP BY $author
ORDER BY DESC($cnt)
LIMIT 10
This works as expected, and takes around 3 seconds on the dump, and maybe 5 on the OpenCitations endpoint.
However now I also want to get the actual names of the authors, so my idea was to use the previous query as a subquery:
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX pro: <http://purl.org/spar/pro/>
PREFIX frbr: <http://purl.org/vocab/frbr/core#>
PREFIX fabio: <http://purl.org/spar/fabio/>
SELECT $author $last_name $cnt WHERE {
$author foaf:familyName $last_name.
{
SELECT $author (COUNT($author) as $cnt) WHERE {
$jnl a fabio:Journal;
dcterms:title "Nature" .
$volm frbr:partOf $jnl .
$issue frbr:partOf $volm .
$artcl frbr:partOf $issue .
?artcl pro:isDocumentContextFor ?artcl_trbts .
?artcl_trbts pro:isHeldBy ?author.
$author foaf:familyName $y .
}
GROUP BY $author
ORDER BY DESC($cnt)
LIMIT 10
}
}
ORDER BY DESC($cnt)
This now takes around 15 seconds on the dump, (more than a minute on the online endpoint), even though it seems to me all it is doing is looking up the 10 values of the givenName for the authors. If I include the first name (foaf:givenName) as well, the query can take even longer. Furthermore, when I select names without grouping by author, it executes within a split second:
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX pro: <http://purl.org/spar/pro/>
PREFIX frbr: <http://purl.org/vocab/frbr/core#>
PREFIX fabio: <http://purl.org/spar/fabio/>
SELECT $author $first_name $last_name WHERE {
$jnl a fabio:Journal;
dcterms:title "Nature" .
$volm frbr:partOf $jnl .
$issue frbr:partOf $volm .
$artcl frbr:partOf $issue .
?artcl pro:isDocumentContextFor ?artcl_trbts .
?artcl_trbts pro:isHeldBy ?author.
$author foaf:familyName $last_name .
$author foaf:givenName $first_name .
}
LIMIT 10
Can somebody tell me what am I doing wrong here? Thanks in advance!

Related

postgresql, My query is too slow. What's the problem? (Takes more than 1 minute)

Below is the query that I use.
I want to see the results within 2 or 3 seconds, but it takes more than a minute.
with meta as (
select
item_name
, item_sp
, grade
from
meta_info
where
item_sp = 'Bc_1'
and grade_no = (
select
max(grade_no)
from
meta_info
where
item_sp = 'Bc_1'
)
) select
*
from (
select
m.grade
, i.item_sp
, i.regist_date
, i.serial_key
, row_number() over(partition by i.serial_key order by m.grade) as serial_key_number
from
item_info i, meta m
where
i.item_sp = 'Bc_1'
and i.regist_date = '20210314'
and i.regist = true
and i.item_name = m.item_name
and i.item_sp = m.item_sp
) i
where
not exists (select
serial_key
from
item_info ii
where
ii.item_sp = 'Bc_1'
and ii.regist_date < '20210314'
and i.serial_key = ii.serial_key)
and i.serial_key_number = 1;
The total number of tables used is meta_info and item_info.
meta_info contains the basic information of the product, and item_info is a table that stores the grade and serial key of each product by date.
In the item_info table, the serial key by product is not a key value, so it can be duplicated.
Here's the problem.
A query that compares all serial keys prior to a particular registration date to look up unregistered serial keys once, and extracts only the highest-rated serial key values because there are duplicates of the serial key values by grade.
But there are more than 10 million item_info data.
Below is the table structure.
1. meta_info
item_sp grade item_name grade_no
ac_1 A BOOK 2
ac_1 B FOOD 2
bc_1 A WATER 2
cc_1 C MOUSE 2
. . . .
. . . .
. . . .
2. item_info
item_no(key) item_sp item_name serial_key regist_date regist
1 ac_1 BOOK fgd5756ffdsf 20210314 true
2 ac_1 FOOD bnffdhtj 20210314 true
3 bc_1 WATER fdfh4fsdfsf 20210314 true
4 cc_1 MOUSE htt55434 20210314 true
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Almost all of the time is going to the last index scan in your plan. You should be able to greatly improve it by adding an index on item_info (serial_key, item_sp, regist_date)

Export data from SQL database to CSV file, some rows of data get split into more than one row

When I export data from my SQL database to CSV file, some rows of data (records) get split into more than one row, as if there is a CR. I know that one reason is the following: One of the columns of the data is "Notes" that contains text that sometimes does contain a CR; I understand why this causes a new row in the CSV, but I would like that not to happen, either. How can I strip the CR, but add a period+space to format the Note so it's readable even without the CR?
However, I also get the extra row even if there is no CR, meaning the CSV has a blank row after a record, or the Note is on an extra line. I've included a screenshot of a portion of the CSV file to illustrate this and also illustrate that not all records show the behavior.
Here is my code. I did not write this, I inherited it. Also, I am not very experienced writing code.
header('Content-Type: application/msexcel-tab');
header('Content-Disposition: attachment; filename="Invaders of Texas Data -- '.date("Y-m-d").'.xls"');
$whereclause = '';
$passclause = '';
$satellite = $_REQUEST['satellite'];
$collector = $_REQUEST['collector'];
$sn = $_REQUEST['sn'];
$cn = $_REQUEST['cn'];
if ($satellite){
$whereclause .= " AND `satellite_id` = ".$satellite." ";
$passclause .= "&satellite=".$satellite;
}
if ($collector){
$whereclause .= " AND `collector_id` = ".$collector." ";
$passclause .= "&collector=".$collector;
}
if ($sn){
$whereclause .= " AND `plant_id` LIKE '".$sn."' ";
$passclause .= "&sn=".$sn;
}
if ($cn){
$whereclause .= " AND `plant_id` LIKE '".$cn."' ";
$passclause .= "&cn=".$cn;
}
$count_sql = "
SELECT COUNT(*) AS `counttotal`
FROM `inv_sites`
WHERE 1
$whereclause
AND `valid` LIKE 'Yes'
;
";
//echo $count_sql;
$count_total = mysql_fetch_array(mysql_query($count_sql));
$sql = "
SELECT *
FROM `inv_sites`
WHERE 1
$whereclause
AND `valid` LIKE 'Yes'
ORDER BY `collection_date` ASC
;
";
$the_result = mysql_query($sql);
?>
Invaders of Texas
www.texasinvasives.org
Exported: <?= date("Y-m-d G:i"); ?>
Obs_ID Date USDA Species Time_Spent Satellite Collector Lat Long Location_Error Loc_Err_Units Disturbance Patch_Type Abundance Validated Valid_Name Valid_Date Notes
<?php
if ($this_row = mysql_fetch_array($the_result)){
do {
?>
<?=$this_row['site_id'];?> <?=$this_row['collection_date'];?> <?=$this_row['plant_id']?> <?=sn_from_usda($this_row['plant_id'])?> <?=$this_row['collection_time'];?> <?=satellite_from_id($this_row['satellite_id']);?> <?=$this_row['collector_id'];?> <?=$this_row['latitude'];?> <?=$this_row['longitude'];?> <?=$this_row['error'];?> <?=$this_row['error_unit'];?> <?=$this_row['disturbance'];?> <?=$this_row['patch_type'];?> <?=$this_row['abundance'];?> <?=$this_row['valid'];?> <?=$this_row['valid_name'];?> <?=$this_row['valid_date'];?> <?=$this_row['notes'];?>
<?php
} while ($this_row = mysql_fetch_array($the_result));
}
?>
I'd appreciate any help!! Thanks.
You could replace the newlines in the PHP or the SQL query.
You have the following line above.
<?=$this_row['site_id'];?> <?=$this_row['collection_date'];?> <?=$this_row['plant_id']?> <?=sn_from_usda($this_row['plant_id'])?> <?=$this_row['collection_time'];?> <?=satellite_from_id($this_row['satellite_id']);?> <?=$this_row['collector_id'];?> <?=$this_row['latitude'];?> <?=$this_row['longitude'];?> <?=$this_row['error'];?> <?=$this_row['error_unit'];?> <?=$this_row['disturbance'];?> <?=$this_row['patch_type'];?> <?=$this_row['abundance'];?> <?=$this_row['valid'];?> <?=$this_row['valid_name'];?> <?=$this_row['valid_date'];?> <?=$this_row['notes'];?>
Try replacing it with the below (the change is on the very end).
<?=$this_row['site_id'];?> <?=$this_row['collection_date'];?> <?=$this_row['plant_id']?> <?=sn_from_usda($this_row['plant_id'])?> <?=$this_row['collection_time'];?> <?=satellite_from_id($this_row['satellite_id']);?> <?=$this_row['collector_id'];?> <?=$this_row['latitude'];?> <?=$this_row['longitude'];?> <?=$this_row['error'];?> <?=$this_row['error_unit'];?> <?=$this_row['disturbance'];?> <?=$this_row['patch_type'];?> <?=$this_row['abundance'];?> <?=$this_row['valid'];?> <?=$this_row['valid_name'];?> <?=$this_row['valid_date'];?> <?=trim(preg_replace('/\s+/', ' ', $this_row['notes']));?>
The preg_replace allows you to use regular expressions in php to remove the newlines.
If this doesn't work you may need to alter your SQL query to remove the newline from the database query.
See this post
Pete

PhpOrient query returns negative Rid-s by default

I'm trying to retrieve a simple graph consisting of some Assignments that are linked to each other, however after querying one set of those assignments, the Rid-s that are returned are all negative and have nothing to do with the Rid-s in the database, so I can't run other query-s based on those Rid-s, how should I go around this, or am I doing something wrong?
Here is the code snippet responsible:
$records = $this->client->queryAsync('select rID, value, schedule, priority, type from Assignment where type = 5');
foreach ($records as $record)
{
$id = $record->getRid();
$rid = $id->__toString();
$return[$rid] = $this->client->query('TRAVERSE out("Assignment") FROM ' . $rid . ' WHILE $depth <= 5');
}
and the error that I receive:
com.orientechnologies.orient.core.exception.ORecordNotFoundException: The record with id '#-2:0' was not found
However in the database the first id is: #18:0
Hi Pirate's Lost Pearl,
is probably a problem with the transactions, orientdb makes negative RID-s when they are temporary. After the commit, the RID-s are changed to positive, here the doc
There are a couple of errors in your code:
First off you should change your __toString(); into _toString(); using a single underscore.
Then fix the $this->client->query by either switching quotation marks at the end such as " WHILE $depth <= 5" or concatenate the variable while keeping the same quotes ' WHILE ' . $depth . ' <= 5'.
OrientDB Docs | getRid()

Additional information in one API call

Given below my code outputs number of Posts, Likes, Comments, and Shares for my Facebook page. It considers only present one week data. This is done using fql qpproach (I know fql is deprecated but my code works fine because the app I am using is an older app). I have few questions to ask and need help in getting that done:
I need help in converting this code using the latest api approach so that it can work on newly created apps.
The code outputs total number of likes, comments and shares just fine but is it possible to get the usernames of those who liked, commented and share the posts of my page in the past one week and update the values in a database table? If yes, how?
If point #2 is possible, can this be done in one API call?
Appreciate some assistance here.
<?php
require('config.php');
session_start();
?>
<?php
$facebook = new Facebook(array(
'appId' => '012myappid210',
'secret' => 'abc012myappsecret210cba',
'cookie' => true,
));
$token=$_SESSION['token'];
$pageid='88978302070'; //(my facebook page id)
$d=strtotime("now");
$d2=strtotime('now - 7 days');
$fqlAPIParams = array(
'method' => 'fql.query',
'query' =>'
SELECT post_id,comments,message,likes,created_time,share_count
FROM stream
WHERE actor_id = '.$pageid.' AND
source_id = '.$pageid.' AND created_time <= '.$d.' AND created_time >= '.$d2.'
LIMIT 250' ,
'access_token'=>$token
);
$result = $facebook->api($fqlAPIParams);
$postCount = 0;
$likescount=0;
$commentscount=0;
$sharescount=0;
foreach($result as $post)
{
$shares=$post['share_count'];
$likes=$post['likes']['count'];
$comments=$post['comments']['count'];
$likescount+=$likes;
$commentscount+=$comments;
$sharescount+=$shares;
$postCount++;
}
echo "Post " . $postCount . " Likes " . $likescount . " Comments " . $commentscount . " Shares " . $sharescount;
?>
Here is the output of the code you suggested:
Post: Likes: 25 Comments: 5 Shares: 30
Post: Likes: 0 Comments: 0 Shares:
Post: Likes: 25 Comments: 25 Shares: 54
PageID 88978302070 has 2 posts in the past 30 days (on 31st Oct and 21st Oct)
Few changes what I made:
1. I changed 'until' to 'since' 88978302070/posts?fields=likes,comments,shares&since=-30 days.
2. I changed the inner loop because one more issue I found was... instead of $post['likes'] and $post['comments'], it should have been $post['likes']['data'] and $post['comments']['data']
Code is working but Now the problem is:
1. It lists three posts whereas the page has only 2. One additional post is shown with 0 likes, 0 shares, 0 comments. Not sure where it is coming from.
2. Like count is incorrect. It only displays and lists a maximum of 25 likes and comments. I tried to put a limit of 999999 but it displays maximum 1000. Is there any solution to this? Actual like count for the two posts are 24483
3. Comment count is incorrect. It has to be 90 but the code lists 25+5=30
4. It does not lists usernames of those who shared the posts.
Without using FQL, you can do the following:
$d2 = strtotime( '- 7 days' );
$result = $facebook->api( $page_id . '/posts?fields=likes,comments,shares&until=' . $d2 );
foreach ( $result['data'] as $post ) {
echo 'Post: ' . $post['id'] . ' Likes: ' . count( $post['likes']['data'] ) . ' Comments: ' . count( $post['comments']['data'] ) . ' Shares: ' . $post['shares']['count'] . '<br/>';
}
The above API result also includes the name and IDs for all the users who have liked / commented on each post. You can do another foreach loop in the code to cycle through the likes / comments and update your database accordingly.
Example:
// main loop to cycle through posts
foreach ( $result['data'] as $post ) {
// inner loop to cycle through likes
foreach ( $post['likes'] as $like ) {
echo 'User: ' . $like['id'] . ' Name: ' . $like['name'];
// code to check / update db here
}
}
Running the above code for page 177526890164 (Narendra Modi) using limit=2 returns the following. If you are seeing a blank page or an error, you're clearly doing something wrong:
The best way to test would be to make the API call, and then add print_r( $result ) after to see if Facebook returns any data. If it's empty, there is something wrong with your API call.

How can I fix Unicode issues in the dataset returned from my SPARQL query?

At the moment, I am getting rows with Unicode decode issues, while using SPARQL on Dbpedia (using Virtuoso servers). This is an example of what I am getting Knut %C3%85ngstr%C3%B6m.
The right name is Knut Ångström. Cool, now how do I fix this? My crafted query is:
select distinct (strafter(str(?influencerString),str(dbpedia:)) as ?influencerString) (strafter(str(?influenceeString),str(dbpedia:)) as ?influenceeString) where {
{ ?influencer a dbpedia-owl:Person . ?influencee a dbpedia-owl:Person .
?influencer dbpedia-owl:influenced ?influencee .
bind( replace( str(?influencer), "_", " " ) as ?influencerString )
bind( replace( str(?influencee), "_", " " ) as ?influenceeString )
}
UNION
{ ?influencee a dbpedia-owl:Person . ?influencer a dbpedia-owl:Person .
?influencee dbpedia-owl:influencedBy ?influencer .
bind( replace( str(?influencee), "_", " " ) as ?influenceeString )
bind( replace( str(?influencer), "_", " " ) as ?influencerString )
}
}
The DBpedia wiki explains that the identifiers for resources in the English DBpedia dataset use URIs, not IRIs, which means that you'll end up with encoding issues like this.
3. Denoting or Naming “things”
Each thing in the DBpedia data set is denoted by a de-referenceable
IRI- or URI-based reference of the form
http://dbpedia.org/resource/Name, where Name is derived from the URL
of the source Wikipedia article, which has the form
http://en.wikipedia.org/wiki/Name. Thus, each DBpedia entity is tied
directly to a Wikipedia article. Every DBpedia entity name resolves to
a description-oriented Web document (or Web resource).
Until DBpedia release 3.6, we only used article names from the English
Wikipedia, but since DBpedia release 3.7, we also provide localized
datasets that contain IRIs like http://xx.dbpedia.org/resource/Name,
where xx is a Wikipedia language code and Name is taken from the
source URL, http://xx.wikipedia.org/wiki/Name.
Starting with DBpedia release 3.8, we use IRIs for most DBpedia entity
names. IRIs are more readable and generally preferable to URIs, but
for backwards compatibility, we still use URIs for DBpedia resources
extracted from the English Wikipedia and IRIs for all other languages.
Triples in Turtle files use IRIs for all languages, even for English.
There are several details on the encoding of URIs that should always
be taken into account.
In this particular case, it looks like you don't really need to break up the identifier so much as get a label for the entity.
## If things were guaranteed to have just one English label,
## we could simply take ?xLabel as the value that we want with
## `select ?xLabel { … }`, but since there might be more than
## one, we can group by `?x` and then take a sample from the
## set of labels for each `?x`.
select (sample(?xLabel) as ?label) {
?x dbpedia-owl:influenced dbpedia:August_Kundt ;
rdfs:label ?xLabel .
filter(langMatches(lang(?xLabel),"en"))
}
group by ?x
SPARQL results
Simplifying your query a bit, we can have this:
select
(sample(?rLabel) as ?influencerName)
(sample(?eLabel) as ?influenceeName)
where {
?influencer dbpedia-owl:influenced|^dbpedia-owl:influencedBy ?influencee .
dbpedia-owl:Person ^a ?influencer, ?influencee .
?influencer rdfs:label ?rLabel .
filter( langMatches(lang(?rLabel),"en") )
?influencee rdfs:label ?eLabel .
filter( langMatches(lang(?eLabel),"en") )
}
group by ?influencer ?influencee
SPARQL results
If you don't want language tags on those results, then add a call to str():
select
(str(sample(?rLabel)) as ?influencerName)
(str(sample(?eLabel)) as ?influenceeName)
where {
?influencer dbpedia-owl:influenced|^dbpedia-owl:influencedBy ?influencee .
dbpedia-owl:Person ^a ?influencer, ?influencee .
?influencer rdfs:label ?rLabel .
filter( langMatches(lang(?rLabel),"en") )
?influencee rdfs:label ?eLabel .
filter( langMatches(lang(?eLabel),"en") )
}
group by ?influencer ?influencee
SPARQL results