OrientDB SQL vs Gremlin - orientdb

What query language in OrientDB provides the fastest solution: SQL or Gremlin? Gremlin is very attractive because it is universal for other graph libraries, however does this require a big translation in OrientDB or none at all (What is the latency)?

EDIT
As #decebal pointed out this is not a good test case scenario.
Please discard the below benchmarks.
power of a graph database comes from relationships, those queries are
obviously biased towards simple structures which reflects the only
conclusion that when you want simple data structures you are better of
using documents rather than graphs... can't compare apples with pears
========
I ran some tests and SQL is noticeably faster.
The code I used:
long startTime = System.currentTimeMillis();
// Object ret = orientGraph.command(new OCommandGremlin("g.v('9:68128').both().both()")).execute();
String oSqlCommand = "SELECT expand(out().out()) FROM V where user_id='9935'";
Object ret = orientGraph.command(new OCommandSQL(oSqlCommand)).execute();
Iterable<Vertex> vertices = (Iterable<Vertex>) ret;
long endTime = System.currentTimeMillis();
long operationTime = endTime - startTime;
System.out.println("Operation took " + operationTime + " ms");
I just the wikitalk dataset.
The Gremlin command took around 42541 seconds whereas the SQL command took just 1831 ms on average.
Tests were run on Linux Debian 64-bit VM 4GB RAM, 1024MB Heap, 2048MB diskcache.

Related

Spark Scala - Comparing Datasets Column by Column

I'm just getting started with using spark, I've previously used python with pandas. One of the common things I do very regularly is compare datasets to see which columns have differences. In python/pandas this looks something like this:
merged = df1.merge(df2,on="by_col")
for col in cols:
diff = merged[col+"_x"] != merged[col+"_y"]
if diff.sum() > 0:
print(f"{col} has {diff.sum()} diffs")
I'm simplifying this a bit but this is the gist of it, and of course after this I'd drill down and look at for example:
col = "col_to_compare"
diff = merged[col+"_x"] != merged[col+"_y"]
print(merged[diff][[col+"_x",col+"_y"]])
Now in spark/scala this is turning out to be extremely inefficient. The same logic works, but this dataset is roughly 300 columns long, and the following code takes about 45 minutes to run for a 20mb dataset, because it's submitting 300 different spark jobs in sequence, not in parallel, so I seem to be paying the startup cost of spark 300 times. For reference the pandas one takes something like 300ms.
for(col <- cols){
val cnt = merged.filter(merged("dev_" + col) <=> merged("prod_" + col)).count
if(cnt != merged.count){
println(col + " = "+cnt + "/ "+merged.count)
}
}
What's the faster more spark way of doing this type of thing? My understanding is I want this to be a single spark job where it creates one plan. I was looking at transposing to a super tall dataset and while that could potentially work it ends up being super complicated and the code is not straightforward at all. Also although this example fits in memory, I'd like to be able to use this function across datasets and we have a few that are multiple terrabytes so it needs to scale for large datasets as well, whereas with python/pandas that would be a pain.

Batch insert in PostgreSQL extremely slow (F#)

The code is in F#, but it's generic enough that it'll make sense to anyone not familiar with the language.
I have the following schema:
CREATE TABLE IF NOT EXISTS trades_buffer (
instrument varchar NOT NULL,
ts timestamp without time zone NOT NULL,
price decimal NOT NULL,
volume decimal NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_instrument ON trades_buffer(instrument);
CREATE INDEX IF NOT EXISTS idx_ts ON trades_buffer(ts);
My batches are made of 500 to 3000 records at once. To get an idea of the performance, I'm developing on a 2019 MBP (i7 CPU), running PostgreSQL in a Docker container.
Currently it will take between 20 and 80 seconds to insert a batch; while the size / time is not really linear, it scales somewhat.
I'm using this lib https://github.com/Zaid-Ajaj/Npgsql.FSharp as a wrapper around Npgsql.
This is my insertion code:
let insertTradesAsync (trades: TradeData list) : Async<Result<int, Exception>> =
async {
try
let data =
trades
|> List.map (fun t ->
[
"#instrument", Sql.text t.Instrument.Ticker
"#ts", Sql.timestamp t.Timestamp.DateTime
"#price", Sql.decimal t.Price
"#volume", Sql.decimal (t.Quantity * t.Price)
]
)
let! result =
connectionString
|> Sql.connect
|> Sql.executeTransactionAsync [ "INSERT INTO buffer_trades (instrument, ts, price, volume) VALUES (#instrument, #ts, #price, #volume)", data]
|> Async.AwaitTask
return Ok (List.sum result)
with ex ->
return Error ex
}
I checked that the connection step is extremely fast (<1ms).
pgAdmin seems to show that the PostGreSQL is mostly idle.
I ran profiling on the code and none of this code seem to take any time.
It's as if the time spent was in the driver, between my code and the database itself.
Since I'm a newbie with PostGreSQL, I could also be doing something horribly wrong :D
Edit:
I have tried a few things:
use the TimeScale plugin, made for time series
move the data from a docker volume to a local folder
run the code on a ridiculously large PostgreSQL AWS instance
and the results are the same.
What I know now:
no high CPU usage
no high ram usage
no hotspot in the profile on my code
pgAdmin shows the db is mostly idle
having an index, or not, has no impact
local or remote database gives the same results
So the issue is either:
how I interact with the DB
in the driver I'm using
Update 2:
The non async version of the connector performs significantly better.

SQLite slower than expected when querying

I have a fairly large (3,000,000 rows) SQLite database. It consist of one table.
The table has an integer id column, a text-based tag column, a timestamp column saved as an int, and 15 double number columns.
I have a unique index on the tag and timestamp columns, since I always look entries up using both.
I need to run though the database and do quite a few calculations. Mainly calling a bunch of select statements.
The complexity of the select statements is really simple.
I am using the GRDB library.
Here is an example query.
do {
try dbQueue.read { db in
let request = try DataObject
.filter(Columns.tag == tag)
.filter(Columns.dateNoDash = date)
.fetchOne(db)
}
} catch { Log.msg("Was unable to query database. Error: \(error)") }
When I run the debugged trace on the queries my program generates (using explain query plan), I can see that the index is being used.
I have to loop over a lot of queries, so I benchmarked a segment of the queries. I am finding that 600 queries roughly take 28 seconds. I am running the program on a 10-core iMac Pro. This seems slow. I was always under the impression that SQLite was faster.
The other code in the loop basically adds certain numbers together and possible creates an average, so nothing complex and computationally expensive.
I tried to speed things up by adding the following configuration to the database connection.
var config = Configuration()
config.prepareDatabase { db in
try db.execute(sql: "PRAGMA journal_mode = MEMORY")
try db.execute(sql: "PRAGMA synchronous = OFF")
try db.execute(sql: "PRAGMA locking_mode = EXCLUSIVE")
try db.execute(sql: "PRAGMA temp_store = MEMORY")
try db.execute(sql: "PRAGMA cache_size = 2048000")
}
let dbQueue = try DatabaseQueue(path: path, configuration: config)
Is there anything I can do to speed things up? Is GRDB slowing things down? Am I doing anything wrong? Should I be using a different database like mySQL or something?
Thanks for any tips/input

Most efficient way to import data to Prisma / PostgreSQL service?

I have a Prisma (1.14.2) service running that is attached to a PostgreSQL database. I need to insert a lot of nodes with a one-to-many relation to the PostgreSQL database via the Prisma connector. Right now I am doing this in the following way. The strokes and samples arrays hold a lot of nodes:
for (let strokeIndex = 0; strokeIndex < painting.strokes.length; strokeIndex++) {
const stroke = painting.strokes[strokeIndex];
const samples = stroke.samples;
const createdStroke = await PrismaServer.mutation.createStroke({
data: {
myId: stroke.id,
myCreatedAt: new Date(stroke.createdAt),
brushType: stroke.brushType,
color: stroke.color,
randomSeed: stroke.randomSeed,
painting: {connect: {myId: jsonPainting.id}},
samples: { create: samples }
}
});
}
In the case of 128 strokes and 128 samples for each stroke (i.e. 16348 samples in total) this takes about 38 seconds.
I am wondering if there is a way to speed up the process? Especially since the number of strokes and samples can get much higher. I could use prisma import ..., which showed a 6x speedup. But I want to avoid the required conversion to the Normalized Data Format (NDF).
I read about speeding up the INSERTs in PostgreSQL in general, but I am not sure if and how I can apply that to the Prisma connector.
Round trips (querying your API, that query your database, and returns the values) takes a lot of time.
Actually, it certainly takes more time than the actual SQL query.
To solve your problem, you should probably batch your queries. There are many ways to do it, you can use a GraphQL query batching package, or simply do it like this :
mutation {
q1: createStroke(color: "red") {
id
}
q2: createStroke(color: "blue") {
id
}
}
Remember that Prisma will time out queries taking longer than 45s, therefore you may want to limit your batches size.
Given n the number of queries per batch, it will divide the number of round trips by n.
Since Prisma version 2.20.0, you should be able to use .createMany({}) now.
Probably this answer won't be helpful since you asked 3years ago...
https://www.prisma.io/docs/concepts/components/prisma-client/crud#create-multiple-records

Slow performance when using high offset

For example, I want to retrieve all data from citizen table which contains about 18K rows.
String sqlResult = "SELECT * FROM CITIZEN";
Query query = getEntityManager().createNativeQuery(sqlResult);
query.setFirstResult(searchFrom);
query.setMaxResults(searchCount); // searchCount is 20
List<Object[]> listStayCit = query.getResultList();
Everything was fine until "searchFrom" offset was large ( 17K or something ). For example, it took 3-4 mins to get 20 rows ( 17,000 to 17,020 ). So is there any better way to get it faster but not via tunning the DB ?
P/s: Sorry for my bad English
You Could use batch queries.
A good article explaining solution to ur problem is available here:
http://java-persistence-performance.blogspot.in/2010/08/batch-fetching-optimizing-object-graph.html