How can I use tar and tee in PowerShell to do a read once, write many, raw file copy - powershell

I'm using a small laptop to copy video files on location to multiple memory sticks (~8GB).
The copy has to be done without supervision once it's started and has to be fast.
I've identified a serious boundary to the speed, that when making several copies (eg 4 sticks, from 2 cameras, ie 8 transfers * 8Gb ) the multiple Reads use a lot of bandwidth, especially since the cameras are USB2.0 interface (two ports) and have limited capacity.
If I had unix I could use tar -cf - | tee tar -xf /stick1 | tee tar -xf /stick2 etc
which means I'd only have to pull 1 copy (2*8Gb) from each camera once, on the USB2.0 interface.
The memory sticks are generally on a hub on the single USB3.0 interface that is driven on different channel so write sufficently fast.
For reasons, I'm stuck using the current Win10 PowerShell.
I'm currently writing the whole command to a string (concatenating the various sources and the various targets) and then using Invoke-Process to execute the copy process while I'm entertaining and buying the rounds in the pub after the shoot. (hence the necessity to be afk).
I can tar cf - | tar xf a single file, but can't seem to get the tee functioning correctly.
I can also successfully use the microSD slot to do a single cameras card which is not as physically nice but is fast on one cameras recording, but I still have the bandwidth issue on the remaining camera(s). We may end up with 4-5 source cameras at the same time which means the read once, write many, is still going to be an issue.
Edit: I've just advanced to play with Get-Content -raw | tee \stick1\f1 | tee \stick2\f1 | out-null . Haven't done timings or file verification yet....
Edit2: It seems like the Get-Content -raw works properly, but the functionality of PowerShell pipelines violates two of the fundamental Commandments of programming: A program shall do one thing and do it well, Thou shalt not mess with the data stream.
For some unknown reason PowerShell default (and only) pipeline behaviour always modifies the datastream it is supposed to transfer from one stream to the next. Doesn't seem to have a -raw option nor does it seem to have a $session or $global I can set to remedy the mutilation.
How do PowerShell people transfer raw binary from one stream out, into the next process?

May be not quite what you want (if you insist on using built in Powershell commands), but if you care about speed, use streams and asynchronous Read/Write. Powershell is a great tool because it can use any .NET classes seamlessly.
The script below can easily be extended to write to more than 2 destinations and can potentially handle arbitrary streams. You might want to add some error handling via try/catch there too. You may also try to play with buffered streams with various buffer size to optimize the code.
Some references:
FileStream.ReadAsync
FileStream.WriteAsync
CancellationToken
Task.GetAwaiter
-- 2021-12-09 update: Code is modified a little to reflect suggestions from comments.
# $InputPath, $Output1Path, $Output2Path are parameters
[Threading.CancellationTokenSource] $cancellationTokenSource = [Threading.CancellationTokenSource]::new()
[Threading.CancellationToken] $cancellationToken = $cancellationTokenSource.Token
[int] $bufferSize = 64*1024
$fileStreamIn = [IO.FileStream]::new($inputPath,[IO.FileMode]::Open,[IO.FileAccess]::Read,[IO.FileShare]::None,$bufferSize,[IO.FileOptions]::SequentialScan)
$fileStreamOut1 = [IO.FileStream]::new($output1Path,[IO.FileMode]::CreateNew,[IO.FileAccess]::Write,[IO.FileShare]::None,$bufferSize)
$fileStreamOut2 = [IO.FileStream]::new($output2Path,[IO.FileMode]::CreateNew,[IO.FileAccess]::Write,[IO.FileShare]::None,$bufferSize)
try{
[Byte[]] $bufferToWriteFrom = [byte[]]::new($bufferSize)
[Byte[]] $bufferToReadTo = [byte[]]::new($bufferSize)
$Time = [System.Diagnostics.Stopwatch]::StartNew()
$bytesRead = $fileStreamIn.read($bufferToReadTo,0,$bufferSize)
while ($bytesRead -gt 0){
$bufferToWriteFrom,$bufferToReadTo = $bufferToReadTo,$bufferToWriteFrom
$writeTask1 = $fileStreamOut1.WriteAsync($bufferToWriteFrom,0,$bytesRead,$cancellationToken)
$writeTask2 = $fileStreamOut2.WriteAsync($bufferToWriteFrom,0,$bytesRead,$cancellationToken)
$readTask = $fileStreamIn.ReadAsync($bufferToReadTo,0,$bufferSize,$cancellationToken)
$writeTask1.Wait()
$writeTask2.Wait()
$bytesRead = $readTask.GetAwaiter().GetResult()
}
$time.Elapsed.TotalSeconds
}
catch {
throw $_
}
finally{
$fileStreamIn.Close()
$fileStreamOut1.Close()
$fileStreamOut2.Close()
}

Related

Is there any limit to the length of text content that a PowerShell variable can hold?

I am storing the content of a text file in a variable like this -
$fileContent=$(Get-Content file1.txt)
Right now file1.txt contains 200 lines only. But if one day the file contains 10 million lines, then will this approach work? Is there any limit to the length of content that a variable can hold in PowerShell?
Get-Content reads the file into memory.
With that being said, you'd want to change the approach on what you're after. PowerShell being built on top of the .Net framework has access to all of its capabilities. So, you can use classes such as StreamReader which reads the file from disk one line at a time using a method like the one below.
$file = [System.IO.StreamReader]::new('.\Desktop\adobe_export.reg') #instantiate an istance of streamreader
while ($file.EndOfStream.Equals($false)) #if not end of file, continue.
{
# save this to a variable if needed
$file.ReadLine() # read/display line
# more code
}
$file.Close()
$file.Dispose()
First of all, you need to understand that a PS variable is a wrapper around a .NET type, so whatever that can hold, is the answer.
Regarding your actual case, you can search in Microsoft docs whatever GetType() returns, if there is a limit for that type - but there is always a memory limit. So if you read a lot of data into memory, and then return some of it after filtering/transforming/completing/whatever, you are filling memory. Instead you may NOT assign anything to a variable, but use the pipeline's one-at-a-time processing functionality, with this only that much memory is used for the items in the pipeline. Of course you might need to do more than one complex thing with the same input that need their own pipelines, but in this case you can either re-read the data, or if you think that it can change between reads and you need a snapshot, then copy it into a temporary place.

Reading part of a file into a Stream in Powershell

I have some files which are 'offsetted' Zip files in that they have 4 extra bytes at the begining which must be ignored when extracting them.
I've been using ReadAllBytes/WriteAllBytes (with an offset of 4) - that works but obviously I have to write read/write/read the file which is slow.
I'd prefer to use System.IO.Compression.ZipArchive to read from a Stream loaded from the file (sans the first 4 bytes) - but I cannot figure-out the steps required to do that?
I tried 'Seek' but ZipArchive ignores position
I cannot seem to get Byte Arrays to pass into System.IO.Compression at all...
Ideas?
Finally!
After trying all manner of hoop-jumping, it seems the simplest answer was the right one
$bytes = [system.io.file]::ReadAllBytes("file.zip4")
$ms = New-Object System.IO.MemoryStream -Argumentlist $bytes,4,($bytes.length-4)
$arch = New-Object System.IO.Compression.ZipArchive($ms)
I can then process $arch.Entries and extract things just fine - reading the file once and processing it instead of reading it, writing 'most' of it back to disc, reading that file back again!!

Benchmarking in BaseX: how to set up

Currently I am an intern at a research group that makes large sets of texts (corpora) searchable. Not only can one search for literal strings, but more importantly it is also possible to look for similar syntactical dependency structures as the given input, without the need of being proficient in any programming language or corpus annotation style. It may be clear that this tool is intended for linguists.
At the start of the project - before I was engaged in the project - the tool was limited to rather small corpora (up to 9 million words). The goal is to make large sets of texts searchable as well. We are talking about +- 500 millions words. Attempts have been made that in theory ought to improve speed by reducing the search space (see this paper) but this has not been tested yet. The results of this attempt is a new file structure. Let's call this structure B, compared to a non-processed structure A. We expect B to provide faster results when queried with BaseX.
My question is: what is the best way to test and compare both approaches with a Perl script? Below you find my current script to query BaseX locally. It takes two arguments. A directory that stores different files. These files each individually store XPaths. Those XPaths are the ones that I have selected to benchmark with. A second argument is the limit of results to return. When set to zero, no limit is set.
Because some parts of the dataset are so incredibly huge, we have divided them in different, equally sized files as well, called treebankparts. They are stored in <tb> tags inside treebankparts.lst.
#!/usr/bin/perl
use warnings;
$| = 1; # flush every print
# Directory where XPaths are stored
my $directory = shift(#ARGV);
# Set limit. If set to zero all results will be returned
my $limit = shift(#ARGV);
# Create session, connect to BaseX
my $session = Session->new([INFORMATION WITHHELD]);
# List all files in directory
#xpathfiles = <$directory/*.txt>;
# Read lines of treebank parts into variable
open( my $tfh, "treebankparts.lst" ) or die "cannot open file treebankparts.lst";
chomp( my #tlines = <$tfh> );
close $tfh;
# Loop through all XPaths in $directory
foreach my $xpathfile (#xpathfiles) {
open( my $xfh, $xpathfile ) or die "cannot open file $xpathfile";
chomp( my #xlines = <$xfh> );
close $xfh;
print STDOUT "File = $xpathfile\n";
# Loop through lines from XPath file (= XPath query)
foreach my $xline (#xlines) {
# Loop through the lines of treebank file
foreach my $tline (#tlines) {
my ($treebank) = $tline =~ /<tb>(.+)<\/tb>/;
QuerySonar( $xline, $treebank );
}
}
}
$session->close();
sub QuerySonar {
my ( $xpath, $db ) = #_;
print STDOUT "Querying $db for $xpath\n";
print STDOUT "Limit = $limit\n";
my $x_limit;
my $x_resultsofxp = 'declare variable $results := db:open("' . $db . '")/treebank/alpino_ds'
. $xpath . ';';
my $x_open = '<results>';
my $x_totalcount = '<total>{count($results)}</total>';
my $x_loopinit = '{for $node at $limitresults in $results';
# Spaces are important!
if ( $limit > 0 ) {
$x_limit = ' where $limitresults <= ' . $limit . ' ';
}
# Comment needed to prevent `Incomplete FLWOR expression`
else { $x_limit = '(: No limit set :)'; }
my $x_sentenceinfo = 'let $sentid := ($node/ancestor::alpino_ds/#id)
let $sentence := ($node/ancestor::alpino_ds/sentence)
let $begin := ($node//#begin)
let $idlist := ($node//#id)
let $beginlist := (distinct-values($begin))';
# Separate sentence info by tab
my $x_loopexit = 'return <match>{data($sentid)}
{string-join($idlist, "-")}
{string-join($beginlist, "-")}
{data($sentence)}</match>}';
my $x_close = '</results>';
# Concatenate all XQuery parts
my $x_concatquery =
$x_resultsofxp
. $x_open
. $x_totalcount
. $x_loopinit
. $x_limit
. $x_sentenceinfo
. $x_loopexit
. $x_close;
my $querysent = $session->query($x_concatquery);
my $basexoutput = $querysent->execute();
print $basexoutput. "\n\n";
$querysent->close();
}
(Note that this is a stripped down version and that it may not work as-is. This snippet does not use structure B!)
What happens is: loop through all XPath files, loop through each line in an XPath file, loop through all treebankparts and then execute the sub. The sub then queries BaseX. This comes down to sending a new XQuery to BaseX, and returning the total hits and the results (possibly limited by an argument in the Perl script). So I got that going, but now the question is: how can I improve this script so I can get some benchmarking results out of it.
First of all, I'd start with adding a profiler to this script. I guess that bit is obvious. However, I am not sure how I should start comparing structure A with B. Would I put both queries (to different databases) in separate scripts, then call a profiler on both, and run both scripts a number of times and get a mean value and compare? Or would I run each query by both databases in the same script, almost at the same time?
It is important to consider caching that is happening. Therefore I am not entirely sure what build-up for benchmarking of a database this huge is appropriate. First one script, then the other. Both at the same time. Alternating queries between the two. And so on. There are so many possibilities, but I wonder which would provide the best results. Also, I would repeat the process a couple of times. Would I repeat each query and then continue to the next, or finish all XPath files, and then repeat the whole process again?
(Reading the description of the benchmark-tag I am confident that this - albeit elaborate - post is suited for SO.)
One possible improvement: minimize the number of times you transfer control from Perl to the database -- just as you have minimized the number of database connections. (Or at least set yourself up to measure the cost of the transfer of control.) I suspect you will get significantly better results if you move your loop into XQuery rather than running the loop in Perl.
A single call to a database management system asking it to perform 1000 searches is likely to be somewhat faster than 1000 calls to the DBMS each requesting a single search. The first involves two context switches: one from your script or bash to the dbms, and one back; the second involves 2000. The last time I measured something like this carefully, each context switch cost something like 500 ms; it mounted up fast. (That said, this was a long time ago, with a different database. But it was surprising [and sobering] to learn than the difference between the two query formulations I was trying to compare was dwarfed by the difference between running the test loop in a script or inside the dbms.)
A second suggestion: From what you say, the size of the database and the result sets seem likely to ensure that caching between runs doesn't have a big effect on the results. But this seems to be a testable assertion, and one worth testing. So set up your A and B scripts, and then do a trial run: does for runcount in 1 2 3 4 5; do perl A.pl; perl B.pl; done produce results comparable to for runcount in 1 2 3 4 5; do perl A.pl; done; for runcount in 1 2 3 4 5; do perl B.pl; done? If they are comparable, then you have reason to believe it doesn't matter if you run A and B separately or in alternation. If they are not comparable, then you know it does matter, which would be very valuable information. Other things being equal, I would expect caching to produce lower times when running one query several times before moving on to the next query, and cache misses to produce higher times if running each query just once. Probably worth measuring.
In the same spirit, I would recommend that you run tests both with the loop in the Perl script and with the loop in an XQuery query.
A third suggestion: in practice, a query at the corpus query interface will involve several stages, each with measurable time: transmission of the query from the user's browser (assuming it's a Web interface) to the server, translation of the request into a form suitable for transmission to the back end dbms (here BaseX), context switch to BaseX, processing within BaseX, context switch back, handling by web server, transmission to user. It would be useful to have at least rough estimates of the times involved in each of these steps, or at least of the time taken for everything-but-BaseX.
So if it were me running the tests, I think I'd also prepare a set of vacuous XQuery tests, along the lines of
2 + 3
or just
42
to push the BaseX time as close to zero as possible; the measured time between user initiation of the query and display of response is the per-query overhead. (Interesting question: should one use many different trivial expressions to prevent caching of results, or should one use the same expression over and over, to encourage caching of the result? How can we try to ensure that BaseX will cache the result, but the Web server won't? ...)
A final suggestion: remember that other people who need to do benchmarking will often have the same questions as you do. This means that you can reformulate every question of the form "Should I do X or Y?" into the form "What measurable effect does the difference between X and Y have on the results of a benchmarking test?" Run some tests to try to measure that effect, and write them up. (I always find it makes it more interesting if I force myself to make a prediction after formulating the question but before measuring the difference.)
There are several things we have to separate here: The first issue is that BaseX performance should not be confused with your perl script as your perl script seems to simply construct an XQuery (and not XPath as you suggested in your question and tags). So for testing I would suggest to use some already pre-fined XQueries suitable to your real-world scenarios, as your XQuery construction should be negligible. How you pass your query to BaseX, so via the Perl API or via any other means should not be relevant. Even if your perl performance is relevant, you should test the performance separately.
Hence, your original question whether you should put test both scenarios in the same script or not is not relevant anymore. Instead you simply execute the two separate XQueries for the scenario A and B by themself without the perl script.
You are partly correct to worry about caching, however it is the Java JIT compiler which most likely will be relevant here (as BaseX is written in java, JIT and use caching, not BaseX itself. You should therefore use the Client/Server infrastructure and have a long-running server and warm it up before running performance measurements.
Regarding performance: The BaseX GUI and also the command line already have an included measurement (using command line you can set -V to get run times for parsing, compiling, evaluating and printing). Also, using the -r parameter you can execute a query multiple times and it will give you the average execution times.
In general, if you want to improve the performance of your script you should take a look at the query plan and the optimized query and check whether the appropriate indexes are used. Also, our new Selective Indexing might be very useful to you. If the index isn't used, your query will definitely not perform well for 500 million words.
Full Disclosure: I am with the BaseX team and you might get better help at the BaseX mailing list or might want to reference this questions as our head architect isn't watching SO as regularly as the ML.

How to read a large number of LDAP entries in perl?

I already have an LDAP script in order to read LDAP user information one by one. My problem is that I am returning all users found in Active Directory. This will not work because currently our AD has around 100,000 users causing the script to crash due to memory limitations.
What I was thinking of doing was to try to process users by batches of X amount of users and if possible, using threads in order to process some users in parallel. The only thing is that I have just started using Perl, so I was wondering if anyone could give me a general idea of how to do this.
If you can get the executable ldapsearch to work in your environment (and it does work in *nix and Windows, although the syntax is often different), you can try something like this:
my $LDAP_SEARCH = "ldapsearch -h $LDAP_SERVER -p $LDAP_PORT -b $BASE -D uid=$LDAP_USERNAME -w $LDAP_PASSWORD -LLL";
my #LDAP_FIELDS = qw(uid mail Manager telephoneNumber CostCenter NTLogin displayName);
open (LDAP, "-|:utf8", "$LDAP_SEARCH \"$FILTER\" " . join(" ", #LDAP_FIELDS));
while (<LDAP>) {
# process each LDAP response
}
I use that to read nearly 100K LDAP entries without memory problems (although it still takes 30 minutes or more). You'll need to define $FILTER (or leave it blank) and of course all the LDAP server/username/password pieces.
If you want/need to do a more pure-Perl version, I've had better luck with Net::LDAP instead of Net::LDAP::Express, especially for large queries.

In Windows PowerShell, How Can You Set the Maximum CPU % for the Script to Use?

I am looking to limit the percentage of the CPU time used by a PowerShell process to a certain number -- for the sake of argument, let's imagine it to be 13%.
Other options that are not precisely what I need:
1) Setting priority.
2) Setting CPU affinity.
Basically, we have monitoring software which complains if the total CPU usage gets too high. We have a daily process that sets this off -- mostly harmlessly, but too many false positives in a monitoring system and people become inured to warnings/errors when we do not wish that.
The process itself gets lsass.exe very excited, too, as it runs, and other processes happen, as well.
I do not know PowerShell and am attempting to fix Somebody Else's Powershell. Obviously, a ground-up rewrite would be nice at some future point, but for now, bells are ringing and annoying people.
What you're asking for isn't really possible. The Windows kernel is in charge of scheduling the CPU -- and rightfully so. (I for one don't want to return to real-mode DOS).
The best you can do is insert a long Sleep() in between each line of the script. But there's no guarantee that any particular Powershell cmdlet / function / etc will throttle itself the way you want. The Windows API calls that ultimately execute each statement's dirty work certainly won't.
Ironically, Raymond touched on this topic just a few days ago: http://blogs.msdn.com/oldnewthing/archive/2009/07/27/9849503.aspx
My real suggestion is to modify your script so it looks like:
try {
Stop-CpuMonitor
# ...the current script contents...
}
finally {
Start-CpuMonitor
}
From my experience, there's not a way to stipulate what percentage of the CPU Powershell will get to use. I think the best course of action would be to set the priority of Powershell to Low to allow other tasks/programs to go first. I think the first post has a decent suggestion of using the pauses (I'd upvote, but I'm below the 15 reputation points to do so) and that's something you might look into, but I don't think it will give you the control you're looking for. Sorry my answer is more of a suggestion than a resolution.
/matt
I sincerely doubt there is a particularly simple way to accomplish what you are asking. If CPU concern is a big deal, I would combine a couple of tactics and let the scheduler take care of the rest - it is actually pretty good at managing the load.
The suggestion using ThreadPriority is a bit problematic as PowerShell with spawn each new command in a new Thread, which you could get around by having everything encapsulated on a single line, a cmdlet, or a function of some sort. Better to send the whole powershell process to idle:
Get-Process -name powershell | foreach { $_.PriorityClass = "Idle" }
Note: that will send ALL powershell instances to idle, which may not be the desired effect. And certainly doesn't prevent a script from boosting its own priority either.
Also, as mentioned here - littering your code with sleep commands can be an efficient way to ensure other processes have ample CPU time to process their code. Even 50-100ms is almost an eternity to a processor.
[System.Threading.Thread]::Sleep(50)
Between these two tactics your script will run when it is available, and graciously bow to other CPU demands as they arise.
While I know of no way to actually limit the usage of any process to a particular amount of CPU consumption by percentage, I figured maybe one could alter the priority of the PowerShell thread. So I wrote a quick script and ran a test.
The result was that in both "Normal" and "Lowest", the time taken was about the same. Looking at the CPU meter widget, the CPU usage taken was approximately 27%, but I was running on a quad core box, so that's not surprising. It was taking all of one CPU in both cases. My machine wasn't doing much else at the time, so that was the other 2%, I guess.
Perhaps the results will vary on a busier machine.
Here's the script:
function Spin
{
param($iterations)
for($i = 0; $i -lt $iterations; ++$i)
{
# Do nothing
}
}
$thread = [System.Threading.Thread]::CurrentThread
Write-Host $thread.Priority
$start = [DateTime]::Now
Write-Host "[$start] Start"
Spin 10000000
$end = [DateTime]::Now
Write-Host "[$end] End"
$span = $end - $start
Write-Host "Time: $span"
$thread.Priority = "Lowest"
Write-Host $thread.Priority
$start = [DateTime]::Now
Write-Host "[$start] Start"
Spin 10000000
$end = [DateTime]::Now
Write-Host "[$end] End"
$span = $end - $start
Write-Host "Time: $span"
$thread.Priority = "Normal"
And here's the result:
Normal
[08/06/2009 08:12:38] Start
[08/06/2009 08:12:55] End
Time: 00:00:16.7760000
Lowest
[08/06/2009 08:12:55] Start
[08/06/2009 08:13:11] End
Time: 00:00:16.8570000
Notice also that in the documentation for Thread.Priority it states:
Operating systems are not required to honor the priority of a thread.
A possibility may be using process lasso. I have had nothing but positive results over automatic throttling of processes not excluded by me when the CPU usage surpasses a certain percentage.
Keep in mind that there is a fairly limited free version that you can use to test, if it works for you.
PS: I am not working for anybody. I'm just offering a program that has worked for me in the past.
Use AddType on a C# script literal, which again uses P/Invoke to call the Win32 functions CreateJobObject, SetInformationJobObject (passing a JOBOBJECT_CPU_RATE_CONTROL_INFORMATION) and AssignProcessToJobObject. It will probably be easier to write a wrapper program in C that sets the limits and spawn off the PowerShell script in such a job.