Multiple scripts accessing common data file in parallel - possible? - perl

I have some Perl scripts on a unix-based server, which access a common text file containing server IPs and login credentials, which are used to login and perform routine operations on those servers. Currently, these scripts are being run manually at different times.
I would like to know that if I cron these scripts to execute at the same time, will it cause any issues with accessing data from the text file (file locking?), since all scripts will essentially be accessing the data file at the same time?
Also, is there a better way to do it (without using a DB - since I can't, due to some server restrictions) ?

It depends on which kind of access.
There is no problem in reading the data file from multiple processes. If you want to update the data file while it could be read, it's better to do it atomically (e.g. write a new version under different name, than rename it).

Related

Working with PowerShell and file based DB operations

I have a scenario where I have a lot of files in a CSV file i need to do operations on. The script needs to be able to handle if script is stopped or failed, then it should continue where i stopped from. In a database scenario this would be fairly simple. I would have an updated column and update that when operation for the line has completed. I have looked if I somehow could update the CSV on the fly, but I dont think that is possible. I could start having multiple files, but not that elegant. Can anyone recommend some kind of simple file based DB like framework? Where I from PowerShell could create a new database file (maybe json) and read from it and update on the fly.
If your problem is really so complex, that you actually need somewhat of a local database solution, then consider to go with SQLite which was built for such scenarios.
In your case, since you process an CSV row-by-row, I assume storing the info for the current row only will be enough. (Line number, status etc.)

Storing data in array vs text file

My database migration automation script used to require the user to copy the database names into a text file, then the script would read in that text file and know which databases to migrate.
I now have a form where the user selects which databases to migrate, then my script automatically inserts those database names into the text file, then reads in that text file later in the script.
Would it be better practice to move away from the text file all together and just store the data in an array or some other structure?
I'm also using PowerShell.
I'm no expert on this, but I would suggest keeping the text file even if you choose to use the array or form only approach. You can keep the text file as sort of a log file, so you don't have to read from it, but you could write to it so you can quickly determine what databases were being migrated if an error happens.
Although in a production environment you probably have more sophisticated logging tools, but I say keep the file in case of an emergency and you have to debug.
When you finish migrating and determine in the script that everything is as it should be, then you can clear the text file or keep it, append the date and time, and store it, as a quick reference should another task come up and you need quick access to databases that were migrated on a certain date.

Invoking remotely a script in Google Cloud

Initially I thought of this approach: creating a full web app that would allow uploading the files into the Data Store, then Data Flow would do the ETL by reading the files from the Data Store and putting the transformed data in Cloud SQL, and then have a different section that would allow passing the query to output the results in a CSV file into the Data Store.
However I want to keep it as simple as possible, so I my idea is to create a Perl script in the Cloud that does the ETL and also another script in which you can pass a SQL query as argument and output the results in a CSV file into the Data Store. This script would be invoked remotely. The idea is to execute the script without having to install all the stack in each client (Google SQL proxy, etc.), just by executing a local script with the arguments that would be passed to the remote script.
Can this be done? If so, how? And in addition to that, does this approach makes sense?
Thanks!!

PostgreSQL blocking on too many inserts

I am working on a research platform that reads relevant Twitter feeds via the Twitter API and stores them in a PostgreSQL database for future analysis. Middleware is Perl, and the server is an HP ML310 with 8GB RAM running Debian linux.
The problem is that the twitter feed can be quite large (many entries per second), and I can't afford to wait for the insert before returning to wait for the next tweet. So what I've done is to use a fork() so each tweet gets a new process to insert into the database and the listener and return quickly to grab the next tweet. However, because each of these processes effectively opens a new connection to the PostgreSQL backend, the system never catches up with its twitter feed.
I am open to using a connection pooling suggestion and/or to upgrading hardware if necessary to make this work, but would appreciate any advice. Is this likely RAM bound, or is there configuration or software approaches I can try to make the system sufficiently speedy?
If you open and close a new connection for each insert, that is going to hurt big time. You should use a connection pooler instead. Creating a new database connection is not a lightweight thing to do.
Doing a a fork() for each insert is probably not such a good idea either. Can't you create one process that simply takes care of the inserts and listens on a socket, or scans a directory or something like that and another process signalling the insert process (a classical producer/consumer pattern). Or use some kind of message queue (I don't know Perl, so I can't say what kind of tools are available there).
When doing bulk inserts do them in a single transaction, sending the commit at the end. Do not commit each insert. Another option is to write the rows into a text file and then use COPY to insert them into the database (it doesn't get faster than that).
You can also tune the PostgreSQL server a bit. If you can afford to lose some transactions in case of a system crash, you might want to turn synchronous_commit off.
If you can rebuild the table from scratch anytime (e.g. by re-inserting the tweets), you might also want to make that table an "unlogged" table. It is faster than a regular table in writing, but if Postgres is not shown down cleanly, you lose all the data in the table.
Use COPY command.
One script reads Tweeter and appends strings to the CSV file on disk.
Other scripts looking for CSV file on disk, renamed this file file and started COPY command from this file.

Extract Active Directory into SQL database using VBScript

I have written a VBScript to extract data from Active Directory into a record set. I'm now wondering what the most efficient way is to transfer the data into a SQL database.
I'm torn between;
Writing it to an excel file then firing an SSIS package to import it or...
Within the VBScript, iterating through the dataset in memory and submitting 3000+ INSERT commands to the SQL database
Would the latter option result in 3000+ round trips communicating with the database and therefore be the slower of the two options?
Sending an insert row by row is always the slowest option. This is what is known as Row by Agonizing Row or RBAR. You should avoid that if possible and take advantage of set based operations.
Your other option, writing to an intermediate file is a good option, I agree with #Remou in the comments that you should probably pick CSV rather than Excel if you are going to choose this option.
I would propose a third option. You already have the design in VB contained in your VBscript. You should be able to convert this easily to a script component in SSIS. Create an SSIS package, add a DataFlow task, add a Script Component (as a datasource {example here}) to the flow, write your fields out to the output buffer, and then add a sql destination and save yourself the step of writing to an intermediate file. This is also more secure, as you don't have your AD data on disk in plaintext anywhere during the process.
You don't mention how often this will run or if you have to run it within a certain time window, so it isn't clear that performance is even an issue here. "Slow" doesn't mean anything by itself: a process that runs for 30 minutes can be perfectly acceptable if the time window is one hour.
Just write the simplest, most maintainable code you can to get the job done and go from there. If it runs in an acceptable amount of time then you're done. If it doesn't, then at least you have a clean, functioning solution that you can profile and optimize.
If you already have it in a dataset and if it's SQL Server 2008+ create a user defined table type and send the whole dataset in as an atomic unit.
And if you go the SSIS route, I have a post covering Active Directory as an SSIS Data Source