I am new to DB and I needed it for a project.My problem is as follows: I have 3 scripts that write to Postgres DB and another script that does updates on it. So far, with that I haven't had any issues. However, now I need to read that data at the same time. More specifically from that DB, I need to read last 1 min data meanwhile. And I have another script for that. But, when I run this script, I can't see any writes from the scripts that is supposed to write. Any suggestions?
Chances are your other scripts haven't COMMITed their data yet, which means that their updates aren't visible to your queries yet.
Related
I need to process millions of records coming from MongoDb and put a ETL pipeline to insert that data into a PostgreSQL database. However, in all the methods I've tried, I keep getting the out memory heap space exception. Here's what I've already tried -
Tried connecting to MongoDB using tMongoDBInput and put a tMap to process the records and output them using a connection to PostgreSQL. tMap could not handle it.
Tried to load the data into a JSON file and then read from the file to PostgreSQL. Data got loaded into JSON file but from there on got the same memory exception.
Tried increasing the RAM for the job in the settings and tried the above two methods again, still no change.
I specifically wanted to know if there's any way to stream this data or process it in batches to counter the memory issue.
Also, I know that there are some components dealing with BulkDataLoad. Could anyone please confirm whether it would be helpful here since I want to process the records before inserting and if yes, point me to the right kind of documentation to get that set up.
Thanks in advance!
As you already tried all the possibilities the only way that I can see to do this requirement is breaking done the job into multiple sub-jobs or going with incremental load based on key columns or date columns, Considering this as a one-time activity for now.
Please let me know if it helps.
I often have to execute complex sql scripts in a single transaction on a large PostgreSQL database and I would like to verify everything that was changed during the transaction.
Verifying each single entry on each table "by hand" would take ages.
Dumping the database before and after the script to plain sql and using diff on the dumps isn't really an option since each dump would be about 50G of data.
Is there a way to show all the data that was added, deleted or modified during a single transaction?
Dude, What are you looking for is the most searchable thing on the internet when it comes to capturing Database changes. It is a kind of version control we can say.
But as long as I know, sadly there are no in-built approaches are available in PostgreSQL or MySql. But you can overcome it by setting/adding some triggers for your most usable operations.
You can create some backup schemas, and tables to capture your changes that are changed(updated), created, or deleted.
In this way you can achieve what you want. I know this process is fully manual, But really effective.
If you need to analyze the script's behaviour only sporadically, then the easiest approach would be to change server configuration parameter log_min_duration_statement to 0 and then back to any value it had before the analysis. Then all of the script activity will be written to the instance log.
This approach is not suitable if your storage is not prepared to accommodate this amount of data, or for systems in which you don't want sensitive client data to be written to a plain-text log file.
I have a scenario where I have a lot of files in a CSV file i need to do operations on. The script needs to be able to handle if script is stopped or failed, then it should continue where i stopped from. In a database scenario this would be fairly simple. I would have an updated column and update that when operation for the line has completed. I have looked if I somehow could update the CSV on the fly, but I dont think that is possible. I could start having multiple files, but not that elegant. Can anyone recommend some kind of simple file based DB like framework? Where I from PowerShell could create a new database file (maybe json) and read from it and update on the fly.
If your problem is really so complex, that you actually need somewhat of a local database solution, then consider to go with SQLite which was built for such scenarios.
In your case, since you process an CSV row-by-row, I assume storing the info for the current row only will be enough. (Line number, status etc.)
Trying find an example or a starting point for a project I have to restore databases into a test environment. I have a list of 40+ sql instances, databases, and backup location and like to use the cmdlet Restore-SQLDatabases but only allow 3 restores to occur at a time. To minimize the impact on our network/storage I don't want to initiate all 40+ restores at one time. The list of what needs to be restored are contained in a csv and when testing can get the restores to go but not sure what options I'd have to only thread only 3 at a time.
I used the RunspaceFactory example and modified it to use a script-block to execute Restore-SqlDatabase. I'm sure there may be cleaner or simpler ways of doing this but so far it seems to work.
I am working on a research platform that reads relevant Twitter feeds via the Twitter API and stores them in a PostgreSQL database for future analysis. Middleware is Perl, and the server is an HP ML310 with 8GB RAM running Debian linux.
The problem is that the twitter feed can be quite large (many entries per second), and I can't afford to wait for the insert before returning to wait for the next tweet. So what I've done is to use a fork() so each tweet gets a new process to insert into the database and the listener and return quickly to grab the next tweet. However, because each of these processes effectively opens a new connection to the PostgreSQL backend, the system never catches up with its twitter feed.
I am open to using a connection pooling suggestion and/or to upgrading hardware if necessary to make this work, but would appreciate any advice. Is this likely RAM bound, or is there configuration or software approaches I can try to make the system sufficiently speedy?
If you open and close a new connection for each insert, that is going to hurt big time. You should use a connection pooler instead. Creating a new database connection is not a lightweight thing to do.
Doing a a fork() for each insert is probably not such a good idea either. Can't you create one process that simply takes care of the inserts and listens on a socket, or scans a directory or something like that and another process signalling the insert process (a classical producer/consumer pattern). Or use some kind of message queue (I don't know Perl, so I can't say what kind of tools are available there).
When doing bulk inserts do them in a single transaction, sending the commit at the end. Do not commit each insert. Another option is to write the rows into a text file and then use COPY to insert them into the database (it doesn't get faster than that).
You can also tune the PostgreSQL server a bit. If you can afford to lose some transactions in case of a system crash, you might want to turn synchronous_commit off.
If you can rebuild the table from scratch anytime (e.g. by re-inserting the tweets), you might also want to make that table an "unlogged" table. It is faster than a regular table in writing, but if Postgres is not shown down cleanly, you lose all the data in the table.
Use COPY command.
One script reads Tweeter and appends strings to the CSV file on disk.
Other scripts looking for CSV file on disk, renamed this file file and started COPY command from this file.