sys_log - entries repeat them self very often - typo3

I have a TYPO3 6.2 running. today, I took a look in the sys_log table and found many entries written in a very short time like:
"10:34:25 praktikant#LIVE DB Aktualisieren Record 'Kontaktformular
Online-Demo' (tt_content:19697) was updated. (Online). (msg#1.2.10)"
This records repeats him self approx ten times with the same timestamp.
another example:
"10:29:24 praktikant#LIVE DB Aktualisieren Record '[Kein Titel]'
(tt_content:19696) was updated. (Online). (msg#1.2.10)"
and this record repeats him self again twelf times, with the same timestamp.
It is not connected to this user, others have this too.
Is that normal? will blow up this table very fast.

Related

What is the best way to move millions of data from one postgres database to another?

So we have a task at the moment where I need to move millions of records from one database to another.
To complicate things slightly I need to change an id on each record before inserting the data.
How it works is we have 100 stations in database a.
Each station contains 30+ sensors.
Each sensor contains readings for about the last 10 years.
These readings are anywhere from 15minute interval to daily interval.
So each station can have at least 5m records.
database b has the same structure as database a.
The reading table contains the following fields
id: primary key
sensor_id: int4
value: numeric(12)
time: timestamp
What I have done so far for one station is.
Connect to database a and select all readings for station 1
Find all corresponding sensors in database b
Change the sensor_id from database a to it's new sensor_id from database b
Chunk the updated sensor_id data to groups of about 5000 parameters
Loop over the chunks and do a mass insert
In theory, this should work.
However, I am getting errors saying duplicate key violates unique constraint.
If I query the database on those records that are failing, the data doesn't exist.
The weird thing about this is that if I run the script 4 or 5 times in a row all the data eventually gets in there. So I am at a loss as to why I would be receiving this error because it doesn't seem accurate.
Is there a way I can get around this error from happening?
Is there a more efficient way of doing this?

Keep table synced with another but with accumulated / grouped data

If I have large amounts of data in a table defined like
CREATE TABLE sensor_values ( ts TIMESTAMPTZ(35, 6) NOT NULL,
value FLOAT8(17, 17) DEFAULT 'NaN' :: REAL NOT NULL,
sensor_id INT4(10) NOT NULL, );
Data comes in every minute for thousands of points. Quite often though I need to extract and work with daily values over years (On a web frontend). To aid this I would like a sensor_values_days table that only has the daily sums for each point and then I can use this for faster queries over longer timespans.
I don't want a trigger for every write to the db as I am afraid that would slow down the already bottle neck of writes to the db.
Is there a way to trigger only after so many rows have been inserted ?
Or perhaps an index and maintains a index of a sum of entries over days ? I don't think that is possible.
What would be the best way to do this. It would not have to be very up to date. Losing the last few hours or a day would not be an issue.
Thanks
What would be the best way to do this.
Install clickhouse and use AggregatingMergeTree table type.
With postgres:
Create per-period aggregate table. You can have several with different granularity, like hours, days, and months.
Have a cron or scheduled task run at the end of each period plus a few minutes. First, select the latest timestamp in the per-period table, so you know at which period to start. Then, aggregate all rows in the main table for periods that came after the last available one. This process will also work if the per-period table is empty, or if it missed the last update then it will catch up.
In order to do only inserts and no updates, you have to run it at the end of each period, to make sure it got all the data. You can also store the first and last timestamp of the rows that were aggregated, so later if you check the table you see it did use all the data from the period.
After aggregation, the "hour" table should be 60x smaller than the "minute" table, that should help!
Then, repeat the same process for the "day" and "month" table.
If you want up-to-date stats, you can UNION ALL the results of the "per day" table (for example) to the results of the live table, but only pull the current day out of the live table, since all the previous days's worth of data have been summarized into the "per day" table. Hopefully, the current day's data will be cached in RAM.
It would not have to be very up to date. Losing the last few hours or a day would not be an issue.
Also if you want to partition your huge table, make sure you do it before its size becomes unmanageable...
Materialized Views and a Cron every 5 minutes can help you:
https://wiki.postgresql.org/wiki/Incremental_View_Maintenance
In PG14, we will have INCREMENTAL MATERIALIZED VIEW, but for the moment is in devel.

How to quickly insert ~300GB/one billion record with relation into PostgreSQL database?

I have been working on this for months but still no solution, I hope I can get help from you ...
The task is, I need to insert/import real time data records from an online data provider into our database.
Real time data is provided in form of files. Each file contains up to 200,000 json records, one record one line. Every day several/tens of files are provided online, begins from year 2013.
I calculated the whole file store and got a total size of around 300GB. I estimated the whole number of records (I can get the file sizes via rest api but not line numbers of each file), it should be around one billion records or a little bit more.
Before I can import/insert one record into our database, I need to find out two parameters (station, parameter) from the record and make relationship.
So the workflow is something like:
Find parameter in our database, if the parameter exists in our db, just return parameter.id; otherwise the parameter should be inserted into parameter table as a new entry and a new parameter.id will be created and returned;
Find station in our database, similarly, if that station exists already, just take its id, otherwise a new station will be created and a new station.id will be created;
Then I can insert this json record into our main data table with its identified parameter.id and station.id to make relationship.
Basically it is a simple database structure with three main tables (data, parameter, station). They are conneted by parameter.id and station.id with primary key/foreign key relationship.
But the querying is very time consuming and I cannot find a way to properly insert this amound of data into the database in a foreseeable time.
I did two trials:
Just use normal SQL queries with bulk insert.
The workflow is described above.
For each record a) get parameter.id b) get station.id c) insert this record
Even with bulk insert, I could only insert one million records in a week. The records are not that short and contain about 20 fields.
After that, I tried following:
I don't check parameter and station in advance, but just use COPY command to copy the records into a intermediate table with no relation. For this I calculated, all the one billion records can be imported in around ten days.
But after the COPY, I need to manully find out all dictinct stations (there are only a few parameters so I can ignore the parameter part) with SQL query select distinct or group by, and create these stations in the station table, and then UPDATE the whole records with their corresponding station.id, but this UPDATE operation takes very very long.
Here is an example:
I spent one and a half days to import 33,000,000 records into the intermediate table.
I queried with select longitude, latitude from records group by longitude, latitude and got 4,500 stations
I inserted these 4,500 stations into the station table
And for each station I do
update record set stationid = station.id where longitude=station.longitude and latitude=station.latitude
The job is still running but I estimate it will take two days
And this is only for 30,000,000 records, I have one billion.
So my question is, how can I insert this amount of data into the database quickly?
Thank you very much in advance!
2020-08-20 10:01
Thank you all very much for the comments!
#Brits:
Yes and no, all the "COPY"s took over 24 hours. One "COPY" command for one json file. For each file I need doing following:
Download the file
Flat the Json file (no relation check of station, but with parameter check, it is very easy and quick, parameter is like constant in the project) to a CSV like text file for "COPY" command
Execute the "COPY" command via Python
1, 2 and 3 all together will take if I am in the company network around one minute for a Json file containing ~ 200,000 records, and around 20 minutes if I work from home.
So just ignore the "one and a half day" statement. My estimation is, if I work in the company network, I will be able to import all 300GB/One billion records into my intermediate table without relation check in ten days.
I think my problem now is not to move the json content into a flat database table, but to build the relationship between data - station.
After I have my json content in the flat table, I need to find out all stations and update all records with:
select longitude, latitude, count(*) from records group by longitude, latitude
insert these longitude latitude combination into station table
for each entry in station table, do
update records set stationid = station.id where longitude=station.longitude and latitude=station.latitude
This 3. above takes very long (also 1. takes several minutes only for 34million records, I have no idea how long 1. will take for one billion records)
#Mike Organek #Laurenz Albe Thanks a lot. Even your comments are now for me difficult to understand. I will study them and give a feedback.
The total file count is 100,000+
I am thinking about to parse all the files and get individual stations firstly, and then with station.id and parameter.id in advance do the "COPY"s. I will give a feedback.
2020-09-04 09:14
I finally got what I want, even I am not sure whether it is correct.
What I have done since my question:
Parse all the json files and extract unique stations by coordinate and save them into the station table
Parse all the json files again and save all the fields into the record table, with parameter id from constants and station id from 1), with COPY command
I did 1) and 2) in a pipeline, they were done parallelly. And 1) took longer than 2), so I always needed to let 2) wait.
And after 10 days, I have my data in the postgres, with ca. 15000 stations, and 0.65 billion records in total, each records has the corresponding station id.
#steven-matison Thank you very much and could you please explain a little bit more?

Concurrent data insert client (golang) results in first 50 rows missing in database (postgres), but the rest of the 390 are okay

I am pulling down stock market data and inserting it into a postgresql database. I have 500 stocks for 60 days of historical data. Each day has 390 trading minutes, and each minute is a row in the database table. The summary of the issue is that the first 20-50 minutes of each day are missing for the each stock. Sometimes its less than 50, but it is never more than 50. Every minute after that for each day is fine (EDIT: on further inspection there are missing minutes all over the place). The maximum matches the max number of concurrent goroutines (https://github.com/korovkin/limiter).
The hardware is set up in my home. I have a laptop that pulls the data, and a 8 year old gaming computer that has been repurposed as a postgres database running in ubuntu. They are connected through a netgear nighthawk x6 router and communicate over the LAN.
The laptop is running a go program that pulls data down and performs concurrent inserts. I loop through the 60 days, for each day I loop through each stock, and for each stock I loop through each minute and insert it into the database via a INSERT statement. Inside the minute loop I used a library that limits the max number of goroutines.
I am fixing it by grabbing the data again, and inserting until the first time the postgres server responds that the entry is a duplicate and violates the unique constraints on the table and breaking out of the loop for each stock.
However, I'd like to know what happened, as I want to better understand how these problems can arise under load. Any ideas?
limit := NewConcurrencyLimiter(50)
for _, m := range ms {
limit.Execute(func() {
m.Insert()
})
}
limit.Wait()
The issue is that using a receiver means that everything is passed by reference. I needed to copy the values I wanted inserted within the for loop, and change the method away from a receiver to one with input parameters
for i, _ := range ms {
value := ms[i]
limit.Execute(func() {
Insert(value)
})
}
limit.Wait()

Redshift - How to update aging column when doing an incremental update of the table

I have a table that updates incrementally once every day. Table has about 10 million records but I update only rows that were created or updated in the last 24 hours. This runs perfectly fine however I have a problem with one of the columns that calculates the aging of each sale (record) that is calculated based on time when the record was created and compares it against the latest data run time. I would like to know how could I have just this column update for each of the 10 million rows each time the table updates incrementally.
I am using Amazon Redshift as the DB.
Thanks..