I am looking into developing Iphone app that will use Data from wikipedia. I learned that you can query wiki using sparql end point. Does any one know any websites that can be used to query such data. I am trying to use DBPedia but sometimes i get timeout errors. I am looking for something more stable. Do you think that it would be very slow if i am getting a large result set?
Thank you for all the responses.
Another sparql endpoint that can be used to query DBpedia dataset is lod.openlinksw.com. It is backed with more servers, but has a lag in updating dataset. Anyways, you need to construct queries that retrieve small result sets to achieve better response time.
I presume you are querying against the official DBpedia SPARQL endpoint hosted in Virtuoso ? If so being a community driven and hosted endpoint there are restrictions on use, particularly in terms of the size of the result sets that can be returned which has a restricted 2000 rows to protect it from abuse. Should you want to obtain larger result sets then it is recommended the SPARQL LIMIT and OFFSET clauses be used to retrieve results in chunks of 2000.
If you really want a dedicated DBpedia SPARQL endpoint with no restrictions and under your control, OpenLink Software provide an option to instantiate and host your own DBpedia EC2 AMI in the cloud which is an exact replica of the official DBpedia SPARQL endpoint both hosted in Virtuoso.
Note the LOD Cloud Cache hosted in Virtuoso does also include the DBpedia Datasets from the official DBpedia SPARQL endpoint along with most of the major datasets in the LOD Cloud, and hosted on a larger Virtuoso clustered server allowing a maximum of 100000 rows per query.
Related
I have a MongoDB Atlas cluster that serves many customers. Each customer has its own database on the cluster.
I would like to reduce my application's impact on MongoDB data transfer costs, which have been increasing for the last few days, but the billing info provided by Atlas does not break down prices per database. Therefore, I have no way of knowing which customers are costly and what are the most costly queries in terms of data transfer.
Moreover, using the prices on a daily basis and a few queries, I cannot correlate insertion of resources in my application with prices. For example, let's say my resources are Cats, one day it will cost 5$ of data transfer with 5000 Cats inserted in total in the databases, but the next day, it's going to cost 13$ with 1500 Cats inserted.
Do you know of tools or something in the Atlas dashboard I might've missed that could help me better track costs per customer, or say, a cost per Cat (in my example) so that I build a pricing model for my customers?
Thank you
You are most likely going to need separate projects and deployments.
A MongoDB client instance is generally capable of using any database on the server (subject to authorization rules and APIs provided in the language in question), therefore to get a breakdown of data transfer by database would require the server to track bytes transferred per operation and then aggregate those counts. As far as I know this isn't a feature that currently exists.
The most practical way of tracking this today is probably writing a layer on top of the driver on the client side that would look at data actually received.
Whats the best AWS database for the below requirement
I need to store around 50,000 - 1,00,000 entries in the database.
Each of the entry would have a String as a key and a Json array as the value.
I should be able to retrieve the JSON array using the key.
The size of JSON data is around 20-30KB
I expect around 10,000 - 40,000 reads per hour.
Around 50,000 - 1,00,000 writes/week
I have to consider the cost as well.
Ease of integration/development
I am bit confused between MongoDB, DynamoDB and PostgreSQL. Please share your thoughts on this.
DynamoDB:-
DynamoDB is a fully managed proprietary NoSQL database service that supports key-value and document data structures. For the typical use case that you have described in OP, it would serve the purpose.
DynamoDB can handle more than 10 trillion requests per day and support
peaks of more than 20 million requests per second.
DynamoDB has good AWS SDK for all operations. The read and write capacity units can be configured for the table.
DynamoDB tables using on-demand capacity mode automatically adapt to
your application’s traffic volume. On-demand capacity mode instantly
accommodates up to double the previous peak traffic on a table. For
example, if your application’s traffic pattern varies between 25,000
and 50,000 strongly consistent reads per second where 50,000 reads per
second is the previous traffic peak, on-demand capacity mode instantly
accommodates sustained traffic of up to 100,000 reads per second. If
your application sustains traffic of 100,000 reads per second, that
peak becomes your new previous peak, enabling subsequent traffic to
reach up to 200,000 reads per second.
One point to note is that it doesn't allow to query the table based on non-key attributes. This means if you don't know the hash key of the table, you may need to do full table scan to get the data. However, there is a Secondary Index option which you can explore to get around the problem. You may need to have all the Query Access Patterns of your use case before you design and make informed decision.
MongoDB:-
MongoDB is not a fully managed service on AWS. However, you can setup the database using AWS service such as EC2, VPC, IAM, EBS etc. This requires some AWS cloud experience to setup the database. The other option is to use MongoDB Atlas service.
MongoDB is more flexible in terms of querying. Also, it has a powerful aggregate functions. There are lots of tools available to query the database directly to explore the data like SQL.
In terms of Java API, the Spring MongoDB can be used to perform typical database operation. There are lots of open source frameworks available on various languages for MongoDB (example Mongoose Nodejs) as well.
The MongoDB has support for many programming languages and the APIs are mature as well.
PostgreSQL:-
PostgreSQL is a fully managed database on AWS.
PostgreSQL has become the preferred open source relational database
for many enterprise developers and start-ups, powering leading
geospatial and mobile applications. Amazon RDS makes it easy to set
up, operate, and scale PostgreSQL deployments in the cloud.
I think I don't need to write much about this database and its API. It is very mature database and has good APIs.
Points to consider:-
Query Access Pattern
Easy setup
Database maintenance
API and frameworks
Community support
As part of a CMS I'm developing I've got MongoDB as the primary datastore which feeds to ElasticSearch and Redis. All this is configured decleratively.
I'm currently trying to develop a declarative api in JSON (A DSL of sorts) which, when implemented, will enable me to write uniform queries in JSON, but at the backend these datastores work in tandem to come up with the result. Federated search if you will.
Now, while fleshing out the supported types of queries for this Json api, I've come across a class of queries not (efficiently) supported by my current setup: graph-based queries, like friend-of-friend, RDF-queries, etc. Something I'd like to support as well.
So I'm looking for a way to introduce a GraphDB into this ecosystem with the best fit. I should probably say the app-layer sits in Node.js.
I've come across lots of articles comparing Neo4J (a popular GraphDB) vs MongoDB, but not so much of actual use-cases, real world scenarios in which the 2 are complemented.
Any pointers highly appreciated.
You might want to take a look at structr[1], which has a RESTful graph database backend that you can configure using Java beans. In future versions, there will be a configuration option using REST calls only, so that you can fire up a structr server and configure and use it as a standalone graph database backend.
Just contact us on twitter or via email.
(disclaimer: I'm one of the developers of structr, so this comment may not be 100% impartial :))
[1] http://structr.org
The databases are very much complementary.
Use MongoDB to store your raw data/system of record and load the raw data into Neo4j for additional insights/analysis. When you are dealing with unstructured data, you want to store the information in a datastore which is conducive to unstructure data - MongoDB fits the bill (as does other similar NOSQL databases). While Neo4j is considered a NOSQL database, it doesn't fit the bill for unstructured data. Because you have to determine what is a relationship, what is a node, and what properties are stored for each - it's better suited when you have semi-structured data and some understanding of the type of analysis you want to do.
A great architecture is store your unstructured data in MongoDB and use jobs to load them into Neo4j. This allows you to re-load your graph if you figure out there are new pieces of information you'd like to store in the graph for additional analysis.
They are definitely NOT replacements for each other. They fit very different use cases.
I'm reading about ArangoDB and it is more interesting but I can't find where in the documentation how ArangoDB scales. Does ArangoDB scale and can it use sharding like MongoDB or CouchDB?
EDIT
ArangoDB supports sharding since Version 2.0.
Version 3.0 will bring VelocyPack, which is a binary JSON representation optimized for compactness, parseability and composeability. It supersedes the shape concept / shaped JSON.
/EDIT
I am the chief architect of ArangoDB.
monkegjinni is right, ArangoDB did not support sharding, but replication. Why?
Short version:
Offering a support for fairly complex data models like graphs and documents gets into conflicts with how sharding works. However, with the efficiency of modern SSD and computers, we believe that almost all projects no longer need sharding. Today's computer will easily store all data on a single nodes. What these projects need is replication for load distribution which is supported by ArangoDB.
Long version:
There are actually to separate scaling issues.
The first issue is distributing the request over several servers to balance the request load.
ArangoDB will support this through synchronous replication of writes and distribution of the read requests.
Note that most database systems follow a very similar path, i.e., they support distributing the requests either with restricted consistency guarantees or they allow writes only on one node and distribute the read requests. They have this restriction because distributing write requests and supporting full consistency is impossible to do efficiently. And doing it inefficiently will negate the gain that we wanted to achieve through distribution.
The second issue is distributing the data over several servers to allow larger datasets.
ArangoDB does not support distributing the data over several servers.
We have made this decision, because distributing the data over several servers always comes at a price.
This price can be very explicit. For example it can be that the data model is very limited. This is the route that key value stores such as Dynamo or RIAK have taken. Here the data model and the supported queries are so simple, that it is always possible to direct a query to the server (or the small number of servers) on which the requested value live.
Note that we do believe that this approach is valid for some applications (e.g. Amazons database). But we believe that the number of applications that truly need to store so much data that they must distribute it over a large number of servers and must therefore restrict the access pattern to key-value is very small.
Or the price can be hidden. This is for example the case if the data is distributed and the database system allows general queries. In that case the query must be distributed over all servers (because the data you are looking for may live on any of the servers). That makes the queries inefficient.
The ArangoDB approach is rather to squeeze the most onto one server (well ArangoDB supports multiple servers - but to support availability). For this it uses two main strategies.
One strategy is to make use of SSDs. Note that the capacity of SSDs is growing larger at an incredible rate (you can buy a Terabyte of SSD for by far less money that a second server would cost you). And endurance (the total amount of data that can be written to a SSD) goes up to Petabytes (now that vendors finally get the wear leveling algorithms right) - so reliability of SSDs is no longer an issue. And the performance of those SSDs is very nice (closer to main memory than to ordinary disks).
The other strategy is to store the data efficiently. ArangoDB uses shapes to store documents: A shape is the information which attributes and attribute types a document has - all document with the same shape share the representation of this information. This means that documents can be stored in less space than the JSON or BSON representation would require.
As I understand, it does not allow sharding (prior to version 2.0), but replication.
From the link
AvocadoDB effortlessly permits replications. We like the “zero-admin principle”. Making replications with AvocadoDB is really easy: Insert IP address and go!
Following replication types are intended for version 2:
master-master synchronous,
master-master asynchronous,
master-slave synchronous,
master-slave asynchronous
What's currently the best choice to persist graph-like structures? Graph databases (e.g. Neo4j) or RDF triple stores (e.g. Virtuoso)?
For example, we have the following use case:
the weakly connected graph (similar to the one of scholarly papers in a collection) with nearly 10M nodes;
quite rare updates;
critical operations: retrieving particular sub-graphs, updating nodes in a given sub-graph, re-computing link analysis measures (e.g. HITS or PageRank) after updating some nodes.
Providing the standard API to query the data for third party applications (a la Facebook's or Twitter's) is desired as well.
With Virtuoso you have the following working for you:
-- SPARQL, SQL, SPASQL (SPARQL inside SQL), and SQL inside SPARQL support (e.g. for dealing with N-ary relations via magic/function predicates/properties.
-- works as a compact engine (e.g., as exploited via KDE Desktop) or massive DBMS as demonstrated via the live 17 Billion Triples+ LOD Cloud Cache or the smaller DBpedia live instance.
-- includes Full Text indexing and text patterns in SPARQL (via bif:contains) it also included XPath/Xquery (via xcontains)
-- Acid or Non Acid mode ditto Schema-Last when dealing with Property Graph Store
-- Via Transformation Middleware it can pull data from 80+ data sources (includes REST APIs, SOAP services, Hypermedia Resource, ODBC or JDBC accessible relational data sources etc..) and transform into Transient or Persistent Linked Data graphs
-- Linked Data publishing is automatic i.e., post DBMS record creation you have in-built Linked Data Pages that as views into the DBMS. No messing around re. URL-Rewrite rules, 303 redirects or anything like that. InterWeb scale Super Keys just work!
That's it for now :-)
For horizontal scale (thus small to medium sized databases) graph databases like neo4j will currently give better performance for graph traversals. Triplestores are catching up though. The big advantage of a Triple Store compared to a graph database is that data dumps and query language are standardized, which means its a lot easier to move to another product and prevent vendor lock-in.