PrestoDB Mongo query taking too much time - mongodb

I am running a query in PrestoDB through a MongoDB connector. The query fetches data from a single collection in MongoDB. The query is something like:
SELECT studentId, classId, sum(date_diff('DAY', entryTime, (CASE WHEN (exitTime <= TIMESTAMP '2018-04-15 23:59:59 UTC') THEN exitTime ELSE TIMESTAMP '2018-04-15 23:59:59 UTC' END))) as timeSpent
FROM mongodb.school.student WHERE entryTime BETWEEN TIMESTAMP '2017-10-30 00:00:00 UTC' AND TIMESTAMP '2018-05-15 23:59:59 UTC' AND contains(classId, '1234') AND subject = 'Maths'
GROUP BY classId, studentId
ORDER BY timeSpent DESC;
I have about 8 million records in the collection and this query takes about 45 seconds to execute.
My PrestoDB is set up on a single Ubuntu instance acting as coordinator and worker with total RAM of 8GB. The jvm.config file looks like:
-server
-Xmx8G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+AggressiveOpts
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
The config.properties file has the following config:
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080
The -Xmx8G was earlier -Xmx4G but I changed it to -Xmx8G to try but performance was almost the same. Am I:
Using instance with too low RAM (8GB)?
Should I try running PrestoDB as a cluster? What configuration is expected if there are to be about 60 million records in that collection with this query?
Or is it something with my current configuration itself?

Please run EXPLAIN ANALYZE for your query in Presto and show us the output.
It should be clear which part of the query takes most of the time.

Related

Fetch "current datetime" from Postgres Database instead of PDI server/client time using Table Input step

Fetch "current datetime" from Postgres Database instead of PDI server/client time using Table Input step.
When I use "Table Input" step to get the Postgres Timestamp, I am getting PDI server/client time.
How to get Postgres DB timestamp instead PDI Server/Client time.
, TO_CHAR(CURRENT_TIMESTAMP, 'YYYY/MM/DD HH24:MI:SS') p_to_date1
, TO_CHAR(LOCALTIMESTAMP, 'YYYY/MM/DD HH24:MI:SS') p_to_date2
, TO_CHAR(CAST(CURRENT_TIMESTAMP AS TIMESTAMP), 'YYYY/MM/DD HH24:MI:SS') p_to_date3
p_to_date1 --> 2023/02/03 11:34:54
p_to_date2 --> 2023/02/03 11:34:54
p_to_date3 --> 2023/02/03 11:34:54

Solr 4.5 not saving time correctly

I have defined a date field in the Solr, I'm using DIH to populate value from DB to Solr. InsertTs value in solr always storing either 4:00:00 or 5:00:00 but the date part is stored properly.
Solr Value: 2013-11-07T05:00:00Z or 2015-05-13T04:00:00Z
DB Value: 07-11-13 02:29:53.00 PM or 07-11-13 12:00:00.00 AM
Schema.xml: INSERTTS is defined as type "date"
DIH: name="INSERTTS" column="INSERTTS"
DIH Query:
SELECT TO_DATE(TO_CHAR(INSERTTS, 'yyyy-mm-dd hh24:mi:ss'), 'yyyy-mm-dd hh24:mi:ss') AS INSERTTS FROM EMPLOYEE
InsertTs is defined as TimeStamp in the db.
Solr is Running on Tomcat server in Linux machine. Linux machine is in EDT timezone.
DB is Oracle 11g and in UTC timezone.
Issue was with JDBC driver it was not fetching time part from the date field.

"Lost connection to MySQL server during query" in Google Cloud SQL

I am having a weird, recurring but not constant, error where I get "2013, 'Lost connection to MySQL server during query'". These are the premises:
a Python app runs around 15-20minutes every hour and then stops (hourly scheduled by cron)
the app is on a GCE n1-highcpu-2 instance, the db is on a D1 with a per package pricing plan and the following mysql flags
max_allowed_packet 1073741824
slow_query_log on
log_output TABLE
log_queries_not_using_indexes on
the database is accessed only by this app and this app only so the usage is the same, around 20 consecutive minutes per hour and then nothing at all for the other 40 minutes
the first query it does is
SELECT users.user_id, users.access_token, users.access_token_secret, users.screen_name, metadata.last_id
FROM users
LEFT OUTER JOIN metadata ON users.user_id = metadata.user_id
WHERE users.enabled = 1
the above query joins two tables that are each around 700 lines longs and do not have indexes
after this query (which takes 0.2 seconds when it runs without problems) the app starts without any issues
Looking at the logs I see that each time this error presents itself the interval between the start of the query and the error is 15 minutes.
I've also enabled the slow query log and those query are registered like this:
start_time: 2014-10-27 13:19:04
query_time: 00:00:00
lock_time: 00:00:00
rows_sent: 760
rows_examined: 1514
db: foobar
last_insert_id: 0
insert_id: 0
server_id: 1234567
sql_text: ...
Any ideas?
If your connection is idle for the 15 minute gap the you are probably seeing GCE disconnect your idle TCP connection, as described at https://cloud.google.com/compute/docs/troubleshooting#communicatewithinternet. Try the workaround that page suggests:
sudo /sbin/sysctl -w net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_intvl=60 net.ipv4.tcp_keepalive_probes=5
(You may need to put this configuration into /etc/sysctl.conf to make it permanent)

mongo.input.query with $date not filtering input to hadoop

I have a sharded input collection that I want to filter on before sending it to my hadoop cluster for map reduce computations.
I have this parameter in my $ hadoop jar - command
mongo.input.query='{_id.uuid:"device-964693"}'
and it works. The output does not mapreduce any data that does not satisfy this query.
This however does not work:
mongo.input.query='{_id.day:{\\$lt:{\\$date:1388620740000}}}'
no data is being produced as output.
1388620740000 represents the date Wed Jan 01 2014 23:59:00 GMT+0000 (GMT).
The setup is using hadoop 2.2, mongo 2.4.9, this connector version (2.2-1.2.0).
No error messages, just a standard hadoop success message.
Is my syntax incorrect or what did I miss?
Could you point me to some debugging tools/methods for this?
Debugging methods:
in mongo:
db.currentOp(true).inprog.forEach(
function(d){
if(d.ns == "test.collection" && d.query.query["_id.day"] )
printjson(d);
})
a terminal-friendly syntax:
$ hadoop jar... ...mongo.input.query='{"_id.day":{"$lt":{"$date":"2014-01-19T23:00:00Z"}}}'

Can I log query execution time in PostgreSQL 8.4?

I want to log each query execution time which is run in a day.
For example like this,
2012-10-01 13:23:38 STATEMENT: SELECT * FROM pg_stat_database runtime:265 ms.
Please give me some guideline.
If you set
log_min_duration_statement = 0
log_statement = all
in your postgresql.conf, then you will see all statements being logged into the Postgres logfile.
If you enable
log_duration
that will also print the time taken for each statement. This is off by default.
Using the log_statement parameter you can control which type of statement you want to log (DDL, DML, ...)
This will produce an output like this in the logfile:
2012-10-01 13:00:43 CEST postgres LOG: statement: select count(*) from pg_class;
2012-10-01 13:00:43 CEST postgres LOG: duration: 47.000 ms
More details in the manual:
http://www.postgresql.org/docs/8.4/static/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHEN
http://www.postgresql.org/docs/8.4/static/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHAT
If you want a daily list, you probably want to configure the logfile to rotate on a daily basis. Again this is described in the manual.
I believe OP was actually asking for execution duration, not the timestamp.
To include the duration in the log output, open pgsql/<version>/data/postgresql.conf, find the line that reads
#log_duration = off
and change it to
log_duration = on
If you can't find the given parameter, just add it in a new line in the file.
After saving the changes, restart the postgresql service, or just invoke
pg_ctl reload -D <path to the directory of postgresql.conf>
e.g.
pg_ctl reload -D /var/lib/pgsql/9.2/data/
to reload the configuration.
I think a better option is to enable pg_stat_statements by enabling the PG stats extension. This will help you to find the query execution time for each query nicely recorded in a view