Powershell to change sample rate CSV file from data logger - powershell

I have csv data from a datalogger that collected the info every second rather than every 15minutes and I made a script to export every 900th entry. The script works for smaller csv files (up to 80mb). But I have one file at 3.6GB, and it doesn't work.
I looked online and found better methods to increase the speed (don't have .net, and haven't been able to get stream.reader to work).
Here is the script:
$file = Import-Csv z:\csv\input_file.csv -Header A,B,C,D,E,F
$counter = 0
ForEach ($item in $file)
{
$counter++
If($counter -lt 900)
{
}
Else{
Write-Output “$item” | Out-File "z:\csv\output_file.csv" -Append
$counter=0
}
}
Any ideas/optimizations are greatly appreciated.
Thanks.

You can skip reading it as a CSV, and just read it as text. Then loop through iterating by 900 at a time, and output those lines.
$file = Get-Content z:\csv\input_file.csv -ReadCount 1000
For($i=0; $i -le $file.count;$i=$i+900){
$file[$i] | Add-Content z:\csv\output_file.csv
}
I'm sure there are probably other optimizations that could be made, but that's a simple way to speed things up.
Edit: Ok, so -ReadCount behaves a little differently than I had anticipated. When set to a number other than 0 or 1 it creates an array of arrays of strings. So, basically [array[string[]]], at which point there's two options here... either use -ReadCount 0 to read the entire file at once, or better yet read in 900 lines at a time and output only the first of each set, and pass that directly down the pipe to Set-Content.
Get-Content z:\csv\input_file.csv -ReadCount 900 | %{$_[0]} | Set-Content z:\csv\output_file.csv
So that will read the file into memory 900 lines at a time, and then pass only the first line from each series down the pipe, and output that to the output file.

Related

How create combined file from two text files in powershell?

How can I create file with combined lines from two different text files and create new file like this one:
First line from text file A
First line from text file B
Second line from text file A
Second line from text file B
...
For a solution that:
keeps memory use constant (doesn't load the whole files into memory up front)
performs acceptably with larger files.
direct use of .NET APIs is needed:
# Input files, assumed to be in the current directory.
# Important: always use FULL paths when calling .nET methods.
$dir = $PWD.ProviderPath
$fileA = [System.IO.File]::ReadLines("dir/fileA.txt")
$fileB = [System.IO.File]::ReadLines("$dir/fileB.txt")
# Create the output file.
$fileOut = [System.IO.File]::CreateText("$dir/merged.txt")
# Iterate over the files' lines in tandem, and write each pair
# to the output file.
while ($fileA.MoveNext(), $fileB.MoveNext() -contains $true) {
if ($null -ne $fileA.Current) { $fileOut.WriteLine($fileA.Current) }
if ($null -ne $fileB.Current) { $fileOut.WriteLine($fileB.Current) }
}
# Dipose of (close) the files.
$fileA.Dispose(); $fileB.Dispose(); $fileOut.Dispose()
Note: .NET APIs use UTF-8 by default, but you can pass the desired encoding, if needed.
See also: The relevant .NET API help topics:
System.IO.File.ReadLines
System.IO.File.CreateText
A solution that uses only PowerShell features:
Note: Using PowerShell-only features you can only lazily enumerate one file's lines at a time, so reading the other into memory in full is required.
(However, you could again use a lazy enumerable via the .NET API, i.e. System.IO.File]::ReadLines() as shown above, or read both files into memory in full up front.)
The key to acceptable performance is to only have one Set-Content call (plus possibly one Add-Content call) which processes all output lines.
However, given that Get-Content (without -Raw) is quite slow itself, due to decorating each line read with additional properties, the solution based on .NET APIs will perform noticeably better.
# Read the 2nd file into an array of lines up front.
# Note: -ReadCount 0 greatly speeds up reading, by returning
# the lines directly as a single array.
$fileBLines = Get-Content fileB.txt -ReadCount 0
$i = 0 # Initialize the index into array $fileBLines.
# Lazily enumerate the lines of file A.
Get-Content fileA.txt | ForEach-Object {
$_ # Output the line from file A.
# If file B hasn't run out of lines yet, output the corresponding file B line.
if ($i -lt $fileBLines.Count) { $fileBLines[$i++] }
} | Set-Content Merged.txt
# If file B still has lines left, append them now:
if ($i -lt $fileBLines.Count) {
Add-Content Merged.txt -Value $fileBLines[$i..($fileBLines.Count-1)]
}
Note: Windows PowerShell's Set-Content cmdlet defaults to "ANSI" encoding, whereas PowerShell (Core) (v6+) uses BOM-less UTF-8; use the -Encoding parameter as needed.
$file1content = Get-Content -Path "IN_1.txt"
$file2content = Get-Content -Path "IN_2.txt"
$filesLenght =#($file1content.Length, $file2content.Length)
for ($i = 1; $i -le ($filesLenght | Measure-Object -Maximum).Maximum; $i++)
{ Add-Content -Path "OUT.txt" $file1content[$i]
Add-Content -Path "OUT.txt" $file2content[$i]
}

Powershell to Break up CSV by Number of Row

So I am now tasked with getting constant reports that are more than 1 Million lines long.
My last question did not explain all things so I'm tryin got do a better question.
I'm getting a dozen + daily reports that are coming in as CSV files. I don't know what the headers are or anything like that as I get them.
They are huge. I cant open in excel.
I wanted to basically break them up into the same report, just each report maybe 100,000 lines long.
The code I wrote below does not work as I keep getting a
Exception of type 'System.OutOfMemoryException' was thrown.
I am guessing I need a better way to do this.
I just need this file broken down to a more manageable size.
It does not matter how long it takes as I can run it over night.
I found this on the internet, and I tried to manipulate it, but I cant get it to work.
$PSScriptRoot
write-host $PSScriptRoot
$loc = $PSScriptRoot
$location = $loc
# how many rows per CSV?
$rowsMax = 10000;
# Get all CSV under current folder
$allCSVs = Get-ChildItem "$location\Split.csv"
# Read and split all of them
$allCSVs | ForEach-Object {
Write-Host $_.Name;
$content = Import-Csv "$location\Split.csv"
$insertLocation = ($_.Name.Length - 4);
for($i=1; $i -le $content.length ;$i+=$rowsMax){
$newName = $_.Name.Insert($insertLocation, "splitted_"+$i)
$content|select -first $i|select -last $rowsMax | convertto-csv -NoTypeInformation | % { $_ -replace '"', ""} | out-file $location\$newName -fo -en ascii
}
}
The key is not to read large files into memory in full, which is what you're doing by capturing the output from Import-Csv in a variable ($content = Import-Csv "$location\Split.csv").
That said, while using a single pipeline would solve your memory problem, performance will likely be poor, because you're converting from and back to CSV, which incurs a lot of overhead.
Even reading and writing the files as text with Get-Content and Set-Content is slow, however.
Therefore, I suggest a .NET-based approach for processing the files as text, which should substantially speed up processing.
The following code demonstrates this technique:
Get-ChildItem $PSScriptRoot/*.csv | ForEach-Object {
$csvFile = $_.FullName
# Construct a file-path template for the sequentially numbered chunk
# files; e.g., "...\file_split_001.csv"
$csvFileChunkTemplate = $csvFile -replace '(.+)\.(.+)', '$1_split_{0:000}.$2'
# Set how many lines make up a chunk.
$chunkLineCount = 10000
# Read the file lazily and save every chunk of $chunkLineCount
# lines to a new file.
$i = 0; $chunkNdx = 0
foreach ($line in [IO.File]::ReadLines($csvFile)) {
if ($i -eq 0) { ++$i; $header = $line; continue } # Save header line.
if ($i++ % $chunkLineCount -eq 1) { # Create new chunk file.
# Close previous file, if any.
if (++$chunkNdx -gt 1) { $fileWriter.Dispose() }
# Construct the file path for the next chunk, by
# instantiating the template with the next sequence number.
$csvFileChunk = $csvFileChunkTemplate -f $chunkNdx
Write-Verbose "Creating chunk: $csvFileChunk"
# Create the next chunk file and write the header.
$fileWriter = [IO.File]::CreateText($csvFileChunk)
$fileWriter.WriteLine($header)
}
# Write a data row to the current chunk file.
$fileWriter.WriteLine($line)
}
$fileWriter.Dispose() # Close the last file.
}
Note that the above code creates BOM-less UTF-8 files; if your input contains ASCII-range characters only, these files will effectively be ASCII files.
Here's the equivalent single-pipeline solution, which is likely to be substantially slower.
Get-ChildItem $PSScriptRoot/*.csv | ForEach-Object {
$csvFile = $_.FullName
# Construct a file-path template for the sequentially numbered chunk
# files; e.g., ".../file_split_001.csv"
$csvFileChunkTemplate = $csvFile -replace '(.+)\.(.+)', '$1_split_{0:000}.$2'
# Set how many lines make up a chunk.
$chunkLineCount = 10000
$i = 0; $chunkNdx = 0
Get-Content -LiteralPath $csvFile | ForEach-Object {
if ($i -eq 0) { ++$i; $header = $_; return } # Save header line.
if ($i++ % $chunkLineCount -eq 1) { #
# Construct the file path for the next chunk.
$csvFileChunk = $csvFileChunkTemplate -f ++$chunkNdx
Write-Verbose "Creating chunk: $csvFileChunk"
# Create the next chunk file and write the header.
Set-Content -Encoding ASCII -LiteralPath $csvFileChunk -Value $header
}
# Write data row to the current chunk file.
Add-Content -Encoding ASCII -LiteralPath $csvFileChunk -Value $_
}
}
Another option from linux world - split command. To get it on windows just install git bash, then you'll be able to use many linux tools in your CMD/powershell.
Below is the syntax to achieve your goal:
split -l 100000 --numeric-suffixes --suffix-length 3 --additional-suffix=.csv sourceFile.csv outputfile
It's very fast. If you want you can wrap split.exe as a cmdlet

Powershell: Some simple numbering

So I wanted to get a simple "this is where we're up to" numbering to show in the host when running a powershell script. I created the below to mess around with this:
$musicList = Get-ChildItem "C:\Users\rlyons\Music" -Recurse -File | Select-Object FullName
$outfile = #{}
$num = 1
foreach ($sound in $musicList) {
$file = $sound.FullName.ToString()
$md5 = Get-FileHash $file -Algorithm MD5
$outFile.$file += $md5.Hash
Write-Output "Processed file $num"
$num ++
}
This works great! Except it gives me:
Processed file 1
Processed file 2
Processed file 3
Processed file 4
Processed file 5
etc etc. What I wanted was for the screen to clear every time it processed a file, so all I wanted to see was the number of the file processed change at the end of the line written to the host.
I tried adding in the good old Clear-Host after the $num ++ and also at the start of the foreach loop, but all I get is a flickering line, and sometimes something partially ledgible. Yes, this is down to how fast the files are being processed, but I was wondering if there is a trick to this? To keep the Processed file bit continously on screen, but have the number increment upwards?
Even better would be if it didn't clear the screen, so you could see all previously written outputs but have this one line refresh? Is this possible?
Note: I'm trying this in VS Code, Powershell 5.1
Per the comments, Write-Progress is a much better fit for this because it gives you the output you want but without affecting the actual output stream of the script (e.g what is returned to any potential follow on command you might write).
Here's how you might want to use Write-Progress that will also show the percentage of completion:
$musicList = Get-ChildItem "C:\Users\rlyons\Music" -Recurse -File | Select-Object FullName
$outfile = #{}
$num = 1
foreach ($sound in $musicList) {
$file = $sound.FullName.ToString()
$md5 = Get-FileHash $file -Algorithm MD5
$outFile.$file += $md5.Hash
Write-Progress -Activity "Processed file $num of $($musiclist.count)" -PercentComplete ($num / $musiclist.count * 100)
$num ++
}

Powershell/Batch Files: Verify that a file contains at least one entry from a list of strings

Here is my current issue: I have a list of 1800 customer numbers (ie 123456789). I need to determine which of these numbers show up in another, much larger (4 gb) file. The larger file is a fixed-width file of all customer information. I know how I would do this in SQL, but like I said it's a flat file.
When searching for individual numbers, I was using a command I found elsewhere on this site which worked very well:
get-content CUSTOMERINFO.txt -ReadCount 1000 | foreach { $_ -match "123456789" }
However, I do not have the expertise to translate this into another command, or a batch file, which would load list.txt and search all lines in customerinfo.txt for the requisite strings.
Time is not a major constraint, as this is running on a test server and will be a once-off project.
Thank you very much for any help you can provide.
So I appreciate everyone's help. Everybody gave me helpful info that let me get to my final solution, so I appreciate it. Especially to the guy who asked if this was a codewriting request, because it made me realize I needed to just write some code.
For anyone else who runs into the same problem, here is the code I ended up using:
$matches = Get-Content .\list.txt
foreach ($entry in $matches)
{ $results = get-content FiletoSearch -ReadCount 1000 | foreach { $_ -match $entry }
if ($results -eq $null) {
$entry }
else {
"found"}
}
This gives a 'found' entry for everything that was found (which is information I don't need), and gives back the value searched for when it's not found (which is information I do need).
The match comparator can work over multiple values, you can separate them with a bar | character.
e.g.
get-content CUSTOMERINFO.txt -ReadCount 1000 | foreach { $_ -match "DEF|YZ" }
You can also read the contents of a file and replace newlines with a character of your choice. So if list.txt is a list of values to search, such as
DEF
XY
Then you can read it and convert it to a bar-separated list using the join operator:
(Get-Content list.txt) -join "|"
Put them together and you should have your solution:
$listSearch = (Get-Content list.txt) -join "|";
get-content CUSTOMERINFO.txt -ReadCount 1000 | foreach { $_ -match $listSearch}

Get all lines containing a string in a huge text file - as fast as possible?

In Powershell, how to read and get as fast as possible the last line (or all the lines) which contains a specific string in a huge text file (about 200000 lines / 30 MBytes) ?
I'm using :
get-content myfile.txt | select-string -pattern "my_string" -encoding ASCII | select -last 1
But it's very very long (about 16-18 seconds).
I did tests without the last pipe "select -last 1", but it's the same time.
Is there a faster way to get the last occurence (or all occurences) of a specific string in huge file?
Perhaps it's the needed time ...
Or it there any possiblity to read the file faster from the end as I want the last occurence?
Thanks
Try this:
get-content myfile.txt -ReadCount 1000 |
foreach { $_ -match "my_string" }
That will read your file in chunks of 1000 records at a time, and find the matches in each chunk. This gives you better performance because you aren't wasting a lot of cpu time on memory management, since there's only 1000 lines at a time in the pipeline.
Have you tried:
gc myfile.txt | % { if($_ -match "my_string") {write-host $_}}
Or, you can create a "grep"-like function:
function grep($f,$s) {
gc $f | % {if($_ -match $s){write-host $_}}
}
Then you can just issue: grep $myfile.txt $my_string
$reader = New-Object System.IO.StreamReader("myfile.txt")
$lines = #()
if ($reader -ne $null) {
while (!$reader.EndOfStream) {
$line = $reader.ReadLine()
if ($line.Contains("my_string")) {
$lines += $line
}
}
}
$lines | Select-Object -Last 1
Have you tried using [System.IO.File]::ReadAllLines();? This method is more "raw" than the PowerShell-esque method, since we're plugging directly into the Microsoft .NET Framework types.
$Lines = [System.IO.File]::ReadAllLines();
[Regex]::Matches($Lines, 'my_string_pattern');
I wanted to extract the lines that contained failed and also write this lines to a new file, I will add the full command for this
get-content log.txt -ReadCount 1000 |
>> foreach { $_ -match "failed" } | Out-File C:\failes.txt