With OrientDB and Java I have a program with this query to calculate the shortest path to 300000 nodes for the all pairs of nodes but the program is very slow someone can help me.
String query = "select expand(shortestPath("+source.getId()+","+dest.getId()+",'OUT'))";
Related
This is a general question but I am hoping someone can answer it. I am comparing query execution times between MongoDB and Spark SQL. Specifically I have created a MongoDB collection of 1 million entries from a .csv file and ran a few queries on it using the mongosh in Compass. Then using Spark Shell and the Spark - MongoDB connector I inserted this database from MongoDB into Spark as an RDD. After that I converted the RDD into a Dataframe and started running Spark SQL queries on it. I ran the same queries on Spark SQL as in MongoDB while calculating the query execution times in both instances. The result was that in fairly simple queries like
SELECT ... FROM ... WHERE ... ORDER BY ...
MongoDB was significantly faster than Spark SQL. In one of those examples the respective execution times were around 800ms for MongoDB while for Spark SQL it was around 1800ms
From my understanding a Spark dataframe automatically makes the code distribute and run in parallel so the Spark SQL query should be faster than the MongoDB query. Can anyone explain?
I was right, Spark SQL should be faster on the Dataframe compared to the MongoDB queries. It appears that using the connector to import the data is what causes the execution time problem. I tried inserting the data from a .csv file straight into a Dataframe and the query time was much faster than both the query on imported data(from MongoDB Connector) and MongoDB.
I have tried single node cluster and 3 node cluster on my local machine to fetch 2.5 million entries from cassandra using spark but in both scenarios it is takes 30 seconds just for SELECT COUNT(*) from table. I need this and similarly other counts for real time analytics.
SparkSession.builder().getOrCreate().sql("SELECT COUNT(*) FROM data").show()
Cassandra isn't designed to iterate over the entire data set in a single expensive query like this. If theres 10 petabytes in data for example this query would require reading 10 petabytes off disk, bring it into memory, stream it to coordinator which will resolve the tombstones/deduplication (you cant just have each replica send a count or you will massively under/over count it) and increment a counter. This is not going to work in a 5 second timeout. You can use aggregation functions over smaller chunks of the data but not in a single query.
If you really want to make this work like this, query the system.size_estimates table of each node, and for each range split according to the size such that you get an approximate max of say 5k per read. Then issue a count(*) for each with a TOKEN restriction for each of the split ranges and combine value of all those queries. This is how spark connector does its full table scans in the SELECT * rrds so you just have to replicate that.
Easiest and probably safer and more accurate (but less efficient) is to use spark to just read the entire data set and then count, not using an aggregation function.
How much does it take to run this query directly without Spark? I think that it is not possible to parallelize COUNT queries so you won't benefit from using Spark for performing such queries.
I'm trying to run a GetMongo Processor in Apache NiFi. I can get a base query to run just fine and output the records to my hard drive (just for initial testing. Will go to a Hadoop client eventually)
My problem is I want to run the query every 10 minutes and return ONLY new records in the last 10 minutes. The query I have tested on my local Mongo client is:
{"createdAt": {$gte: new Date(ISODate().getTime() - 1000 * 60 * 5)}}
At first, I thought it didn't like the dynamic part, so I tried putting in a static timestamp. But NiFi has told me every single query I have tried is invalid.
There are some decent guides but are very specific to SQL processors in NiFi and wondering if anyone has experience with creating a flow based on dynamic queries with Mongo in NiFi. Thanks so much in advance.
I am trying out Apache Drill to execute a query on a mongo connection. Simple COUNT(1) queries are taking too long. On the order of 20 seconds per query. When I try to connect using any other mongo connector and run the same query it takes miliseconds. I have also seen people talking online about their mongo queries taking 2 seconds. I can live with 2 seconds but 20 is too much.
Here is the query:
select count(*) from mongo.test.contacts
Here is the Query Profile for the query.
It seems that some optimizations should be applied for your case. It will be very helpful if you will create a Jira ticket [1] with details:
DDL for MongoDB table, version of MongoDB and info from log files (because it is not clear what Drill did all this time).
Simple reproduce of your case can help to solve this issue more quickly.
Thanks.
[1] https://issues.apache.org/jira/projects/DRILL/issues/
I have a document structure like this
{
id,
companyid,
fieldA1,
valueA1,
fieldA2,
valueA2,
.....
fieldB15,
valueB15,
fieldF150
valueF150
}
my job is to multiply fieldA1*valueA1 , fieldA2*valueA2 and sum it up to new field A_sum = sum( a fields * a values), B_sum = sum(b fields * b value), C_sum , etc
then in the next step I have to generate final_sum = ( A_sumA_val + B_SumB_val .....)
I have modeled to use aggregation framework with 3 projections for the three steps of calculations - now on this point I get about 100 sec for 750.000 docs, I have index only on _id which is a GUID. CPU is at 15%
I tried to group in order to force parallel ops and load more of cpu but seems is staking longer.
What else can I do to make it faster, means for me to load more cpu, use more paralelism?
I dont need for match as I have to process all docs.
You might get it done using sharding, as the scanning of the documents would be done in parallel.
Simply measure the time your aggregation needs now, and calculate the number of shards you need using
((t/100)+1)*s
where t is the time the aggregation took in seconds and s is the number of existing shards (1 if you have a standalone or replica set), rounded up, of course. The 1 is added to be sure that the overhead of doing an aggregation in a sharded environment is leveraged by the additional shard.
my only solution is to split the collection into smaller collections (same space after all) and command computation per smaller collections (via c# console line) using parallel library so I can raise CPU to 70%.
That reduces the time from aprox 395s, 15%CPU (script via robomongo, all docs) to 25-28s, 65-70%cpu (c# console app with parallelism)
using grouping did not help in my case.
sharding is not an option now.