Powershell .csv merge with column remove - powershell

Using the code below I am able to merge several .csv files in 5 seconds.
$getFirstLine = $true
get-childItem "C:\my\dir\*.csv" | foreach {
$filePath = $_
$lines = $lines = Get-Content $filePath
$linesToWrite = switch($getFirstLine) {
$true {$lines}
$false {$lines | Select -Skip 1}
}
$getFirstLine = $false
Add-Content "C:\my\dir\output_code2.csv" $linesToWrite
}
I would like to take this one step further, preferable using piping to remove several of the columns using a command like:
select DateAndTime,DG1_KW,DG2_KW,WT_KW,HTR1_KW,POSS_Load_KW,INV1_KW,INV2_SOC|Export-csv output_test.csv -Notypeinformation
that being the variables in the header of each file.
How would I modify this code to make this work? The idea here is that I am going to be working with hundreds up to thousands of files.
I have other code which can do this but it is no where near as fast.
for instance using 10 .csv files that are 450kb each. the code below takes 20 seconds to process and spits out a .csv file in 20 seconds removing 48 of the 56 columns leaving the variables I need. If I remove part of the code that trims the columns it still takes 12+ seconds.
# Directory containing csv files, include *.*
$directory = "C:\my\dir\*.*";
# Get the csv files
$csvFiles = Get-ChildItem -Path $directory -Filter *.csv;
#$content = $null;
$content = #();
# Process each file
foreach($csv in $csvFiles)
{
$content += Import-Csv $csv;
}
# Write a datetime stamped csv file
$datetime = Get-Date -Format "yyyyMMddhhmmss";
$content |Export-Csv -Path "C:\my\dir\output_code2_$datetime.csv" -NoTypeInformation;
The code I would like to modify runs those same 10 files in 5 seconds but does not remove the 48 columns.
Any Ideas guys?

Ok, you want an example... Let's say your CSVs always look like this:
Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8,Col9,Col10
data1,data2,data3,data4,data5,data6,data7,data8,data9,data10
dataA,dataB,dataC,dataD,dataE,dataF,dataG,dataH,dataI,dataJ
Now let's say you only want Col1, Col2, Col6, Col9, and Col10. You could do a RegEx replace something like:
$Files = get-childItem "C:\my\dir\*.csv" | Select -Expand FullName
ForEach($File in $Files){
If($SkipFirst){
Get-Content $File | Select -Skip 1 | ForEach{$_ -replace "^((?:.*?\,){2})(?:.*\,){3}(.*?\,)(?:(?:.*?\,){2})(.*?,.*?)$", '$1$2$3'} | Add-Content "C:\my\dir\output_code2.csv"
}Else{
Get-Content $File | ForEach{$_ -replace "^((?:.*?\,){2})(?:.*\,){3}(.*?\,)(?:(?:.*?\,){2})(.*?,.*?)$", '$1$2$3'} | Add-Content "C:\my\dir\output_code2.csv"
}
}
That would extract just the columns that I noted above. See https://regex101.com/r/jY4oO6/1 for detailed breakdown of RegEx string. Effective output would be (skipping first line if so dictated):
Col1,Col2,Col6,Col9,Col10
data1,data2,data6,data9,data10
dataA,dataB,dataF,dataI,dataJ

Related

powershell: delete specific line from x to x

I'm new in powershell and I absolutely dont get it ...
Just want to delete line 7 to 2500 of a text file. First 6 lines should be untouched.
With linux bash everything is so easy, just:
sed -i '7,2500d' $file
Did not find any solution for mighty powershell :-(
Thank you.
Use Get-Content to read the contents of the file into a variable. The variable can be indexed like a regular PowerShell array. Get the parts of the array you need then pipe the variable into Set-Content to write back to the file.
$file = Get-Content test.log
$keep = $file[0..1] + $file[7..($file.Count - 1)]
$keep | Set-Content test.log
Using this as the contents of the file test.log:
One
Two
Three
Four
Five
Six
Seven
Eight
Nine
This script will output the following into test.log (overwriting the contents):
One
Two
Eight
Nine
In your case, you will want to use $file[0..5] + $file[2500..($file.Count - 1)].
To remove a series of lines in a text file, you could do something like this:
$fileIn = 'D:\Test\File1.txt'
$fileOut = 'D:\Test\File2.txt'
$startRemove = 7
$endRemove = 2500
$currentLine = 1
# needs .NET 4
$newText = foreach ($line in [System.IO.File]::ReadLines($fileIn)) {
if ($currentLine -lt $startRemove -or $currentLine -gt $endRemove) { $line}
$currentLine++
}
$newText | Set-Content -Path $fileOut -Force
Or, if your version of .NET is below 4.0
$reader = [System.IO.File]::OpenText($fileIn)
$newText = while($null -ne ($line = $reader.ReadLine())) {
if ($currentLine -lt $startRemove -or $currentLine -gt $endRemove) { $line }
$currentLine++
}
$reader.Dispose()
$newText | Set-Content -Path $fileOut -Force
Select-object -index takes an array, so:
1..10 > file
(get-content file) | select -index (0..5) | set-content file
get-content file
1
2
3
4
5
6
Or:
(cat file)[0..5] | set-content file

New columns into CSV file incredibly slow

I have a bunch of .csv files and I'm trying to add in some new column headers and their values (which are all blank anyway) and then output this to a new .csv file. My script currently runs and works fine but it takes about 5 minutes to complete the operation on a 60MB file with about 70,000 rows - I have about 100 files to do this on so it will take a while using this script.
My code is below, it's quite simple but clearly inefficient!
Import-Csv $strFilePath |
Select-Object *, #{Name='NewHeader';Expression={''}},
#{Name='NewHeader2';Expression={''}},
#{Name='NewHeader3';Expression={''}},
#{Name='NewHeader4';Expression={''}} |
Export-Csv $($strFilePath + ".new") -NoTypeInformation
As pointed out in the comments, I think it would be better to treat it as a simple text without the useless conversion.
$path = 'C:\test'
$newHeaders = 'NewHeader1','NewHeader2','NewHeader3','NewHeader4'
$files = Get-ChildItem -LiteralPath $path -Filter *.csv
$newHeadersString = #(''; $newHeaders | foreach { '"{0}"' -f $_ }) -join ','
$newColmunsString = ',""' * $newHeaders.Count
foreach ($file in $files) {
$sr = $file.OpenText()
$outfile = New-Item ($file.FullName + '.new') -Force
$sw = [IO.StreamWriter]::new($outfile.FullName)
$sw.WriteLine($sr.ReadLine() + $newHeadersString)
while(!$sr.EndOfStream) { $sw.WriteLine($sr.ReadLine() + $newColmunsString) }
$sr.Close()
$sw.Close()
}

How can I count the number of CSV columns when the file has multiline data and no header

My CSV files have no headers and multi line entries like this:
11;"multi line
col12";13;foobar;foobar
21;22;23;24;25
And I'd like to count the number of columns. So 5 in this example. How do I do that?
What I tried:
Import-CSV doesn't work without the header parameter due to duplicate entries on the first line.
(Import-Csv .\bad.csv -Delimiter ";" | get-member -type NoteProperty).count
Adding a header parameter skews the count.
(Import-Csv .\bad.csv -Delimiter ";" -Header (1..99) | get-member -type NoteProperty).count
I had to abort reading the file manually via Get-Content because of all the parsing I would have to handle manually. Escaping characters and multi line entries...
My version of PowerShell is 3 and I have to port my script to version 2 later on.
If you are willing to accept the caveat that this could miscount the number of columns if there are quoted delimiters in string this could be good enough for you.
$path = "c:\temp\test.txt"
$delimiter = ";"
$numberOfColumns = Get-Content $path |
ForEach-Object{($_.split($delimiter)).Count} |
Measure-Object -Maximum |
Select-Object -ExpandProperty Maximum
Import-Csv $path -Header (1..$numberOfColumns) -Delimiter $delimiter
Read in the file with Get-Content and isolate the maximum number of columns by
splitting each line on its delimiter and then using that value to import the CSV. If the file is large you can read in the file once with Get-Content and then use ConvertTo-CSV once you know your column count.
If all lines contain a line break on them the above logic would fail. Still we could temporarily scrub the data by removing the correct line breaks in order to get the accurate count.
$delimiter = ";"
$fileData = (Get-Content $path | Out-String)
$numberOfColumns = ((($fileData -replace "(`"[^;]+?)`r`n",'$1') -split "`r`n" | Select -First 1).split($delimiter)).Count
$fileData | ConvertFrom-Csv -Header (1..$numberOfColumns) -Delimiter $delimiter
What this will do is find lines that end where there is a double quote followed by data that does not contain the delimiter. We also match the newline that follows but drop that same new line in the replacement. If that is done we know that the first line is proper. Use that same line to split and count just like before.
Since Excel knows, let's ask him :
$path = "path\to\bad.csv"
$excel = New-Object -ComObject Excel.Application
$workbook = $excel.Workbooks.Open($path)
$sheet = $workbook.ActiveSheet
$columnIndex = 1
while($sheet.Cells.Item(1, $columnIndex).Text -ne "") {
$columnIndex++
}
"There are $($columnIndex - 1) columns in CSV file $path"
Start-Sleep -Seconds 1
Get-Process excel | Stop-Process -Force
As pointed out by Ansgar Wiechers in comments, there is a much shorter solution :
$path = "path\to\bad.csv"
$excel = New-Object -ComObject Excel.Application
$workbook = $excel.Workbooks.Open($path)
$sheet = $workbook.ActiveSheet
$columnCount = $sheet.UsedRange.Columns.Count
"There are $columnCount columns in CSV file $path"
Start-Sleep -Seconds 1
Get-Process excel | Stop-Process -Force
(I know my way of killing Excel is dirty, but iirc it takes too much code to do so)
I know this is very old, but I came across a similar situation (did not have have rows of varying columns) today and found my own solution so I thought I would share for anyone else coming into this situation. My solution was to use Get-Content for the first row of the CSV and -split on the delimiter (,) to create an array and then return the count of the array. As mentioned in replies above, this will not account for delimiters existing within quotations.
((Get-Content $PathToCsv)[0] -split ",").count
I had the same issue and went with AAgent suggestion.
$CommaCount = ((Get-Content $PathToCsv)[0] -split ",").count
$SemicolonCount = ((Get-Content $PathToCsv)[0] -split ";").count
if ($CommaCount -gt $SemicolonCount){
$CMSlist = Import-Csv ($PathToCsv) –Delimiter “,”
}
else{
$CMSlist = Import-Csv ($PathToCsv) –Delimiter “;”

Powershell - reading ahead and While

I have a text file in the following format:
.....
ENTRY,PartNumber1,,,
FIELD,IntCode,123456
...
FIELD,MFRPartNumber,ABC123,,,
...
FIELD,XPARTNUMBER,ABC123
...
FIELD,InternalPartNumber,3214567
...
ENTRY,PartNumber2,,,
...
...
the ... indicates there is other data between these fields. The ONLY thing I can be certain of is that the field starting with ENTRY is a new set of records. The rows starting with FIELD can be in any order, and not all of them may be present in each group of data.
I need to read in a chunk of data
Search for any field matching the
string ABC123
If ABC123 found, search for the existence of the
InternalPartNumber field & return that row of data.
I have not seen a way to use Get-Content that can read in a variable number of rows as a set & be able to search it.
Here is the code I currently have, which will read a file, searching for a string & replacing it with another. I hope this can be modified to be used in this case.
$ftype = "*.txt"
$fnames = gci -Path $filefolder1 -Filter $ftype -Recurse|% {$_.FullName}
$mfgPartlist = Import-Csv -Path "C:\test\mfrPartList.csv"
foreach ($file in $fnames) {
$contents = Get-Content -Path $file
foreach ($partnbr in $mfgPartlist) {
$oldString = $mfgPartlist.OldValue
$newString = $mfgPartlist.NewValue
if (Select-String -Path $file -SimpleMatch $oldString -Debug -Quiet) {
$stringData = $contents -imatch $oldString
$stringData = $stringData -replace "[\n\r]","|"
foreach ($dataline in $stringData) {
$file +"|"+$stringData+"|"+$oldString+"|"+$newString|Out-File "C:\test\Datachanges.txt" -Width 2000 -Append
}
$contents = $contents -replace $oldString $newString
Set-Content -Path $file -Value $contents
}
}
}
Is there a way to read & search a text file in "chunks" using Powershell? Or to do a Read-ahead & determine what to search?
Assuming your fine isn't too big to read into memory all at once:
$Text = Get-Content testfile.txt -Raw
($Text -split '(?ms)^(?=ENTRY)') |
foreach {
if ($_ -match '(?ms)^FIELD\S+ABC123')
{$_ -replace '(?ms).+(^Field\S+InternalPartNumber.+?$).+','$1'}
}
FIELD,InternalPartNumber,3214567
That reads the entire file in as a single multiline string, and then splits it at the beginning of any line that starts with 'ENTRY'. Then it tests each segment for a FIELD line that contains 'ABC123', and if it does, removes everything except the FIELD line for the InternalPartNumber.
This is not my best work as I have just got back from vacation. You could use a while loop reading the text and set an entry flag to gobble up the text in chunks. However if your files are not too big then you could just read up the text file at once and use regex to split up the chunks and then process accordingly.
$pattern = "ABC123"
$matchedRowToReturn = "InternalPartNumber"
$fileData = Get-Content "d:\temp\test.txt" | Where-Object{$_ -match '^(entry|field)'} | Out-String
$parts = $fileData | Select-String '(?smi)(^Entry).*?(?=^Entry|\Z)' -AllMatches | Select-Object -ExpandProperty Matches | Select-Object -ExpandProperty Value
$parts | Where-Object{$_ -match $pattern} | Select-String "$matchedRowToReturn.*$" | Select-Object -ExpandProperty Matches | Select-Object -ExpandProperty Value
What this will do is read in the text file, drop any lines that are not entry or field related, as one long string and split it up into chunks that start with lines that begin with the work "Entry".
Then we drop those "parts" that do not contain the $pattern. Of the remaining that match extract the InternalPartNumber line and present.

Remove Top Line of Text File with PowerShell

I am trying to just remove the first line of about 5000 text files before importing them.
I am still very new to PowerShell so not sure what to search for or how to approach this. My current concept using pseudo-code:
set-content file (get-content unless line contains amount)
However, I can't seem to figure out how to do something like contains.
While I really admire the answer from #hoge both for a very concise technique and a wrapper function to generalize it and I encourage upvotes for it, I am compelled to comment on the other two answers that use temp files (it gnaws at me like fingernails on a chalkboard!).
Assuming the file is not huge, you can force the pipeline to operate in discrete sections--thereby obviating the need for a temp file--with judicious use of parentheses:
(Get-Content $file | Select-Object -Skip 1) | Set-Content $file
... or in short form:
(gc $file | select -Skip 1) | sc $file
It is not the most efficient in the world, but this should work:
get-content $file |
select -Skip 1 |
set-content "$file-temp"
move "$file-temp" $file -Force
Using variable notation, you can do it without a temporary file:
${C:\file.txt} = ${C:\file.txt} | select -skip 1
function Remove-Topline ( [string[]]$path, [int]$skip=1 ) {
if ( -not (Test-Path $path -PathType Leaf) ) {
throw "invalid filename"
}
ls $path |
% { iex "`${$($_.fullname)} = `${$($_.fullname)} | select -skip $skip" }
}
I just had to do the same task, and gc | select ... | sc took over 4 GB of RAM on my machine while reading a 1.6 GB file. It didn't finish for at least 20 minutes after reading the whole file in (as reported by Read Bytes in Process Explorer), at which point I had to kill it.
My solution was to use a more .NET approach: StreamReader + StreamWriter.
See this answer for a great answer discussing the perf: In Powershell, what's the most efficient way to split a large text file by record type?
Below is my solution. Yes, it uses a temporary file, but in my case, it didn't matter (it was a freaking huge SQL table creation and insert statements file):
PS> (measure-command{
$i = 0
$ins = New-Object System.IO.StreamReader "in/file/pa.th"
$outs = New-Object System.IO.StreamWriter "out/file/pa.th"
while( !$ins.EndOfStream ) {
$line = $ins.ReadLine();
if( $i -ne 0 ) {
$outs.WriteLine($line);
}
$i = $i+1;
}
$outs.Close();
$ins.Close();
}).TotalSeconds
It returned:
188.1224443
Inspired by AASoft's answer, I went out to improve it a bit more:
Avoid the loop variable $i and the comparison with 0 in every loop
Wrap the execution into a try..finally block to always close the files in use
Make the solution work for an arbitrary number of lines to remove from the beginning of the file
Use a variable $p to reference the current directory
These changes lead to the following code:
$p = (Get-Location).Path
(Measure-Command {
# Number of lines to skip
$skip = 1
$ins = New-Object System.IO.StreamReader ($p + "\test.log")
$outs = New-Object System.IO.StreamWriter ($p + "\test-1.log")
try {
# Skip the first N lines, but allow for fewer than N, as well
for( $s = 1; $s -le $skip -and !$ins.EndOfStream; $s++ ) {
$ins.ReadLine()
}
while( !$ins.EndOfStream ) {
$outs.WriteLine( $ins.ReadLine() )
}
}
finally {
$outs.Close()
$ins.Close()
}
}).TotalSeconds
The first change brought the processing time for my 60 MB file down from 5.3s to 4s. The rest of the changes is more cosmetic.
$x = get-content $file
$x[1..$x.count] | set-content $file
Just that much. Long boring explanation follows. Get-content returns an array. We can "index into" array variables, as demonstrated in this and other Scripting Guys posts.
For example, if we define an array variable like this,
$array = #("first item","second item","third item")
so $array returns
first item
second item
third item
then we can "index into" that array to retrieve only its 1st element
$array[0]
or only its 2nd
$array[1]
or a range of index values from the 2nd through the last.
$array[1..$array.count]
I just learned from a website:
Get-ChildItem *.txt | ForEach-Object { (get-Content $_) | Where-Object {(1) -notcontains $_.ReadCount } | Set-Content -path $_ }
Or you can use the aliases to make it short, like:
gci *.txt | % { (gc $_) | ? { (1) -notcontains $_.ReadCount } | sc -path $_ }
Another approach to remove the first line from file, using multiple assignment technique. Refer Link
$firstLine, $restOfDocument = Get-Content -Path $filename
$modifiedContent = $restOfDocument
$modifiedContent | Out-String | Set-Content $filename
skip` didn't work, so my workaround is
$LinesCount = $(get-content $file).Count
get-content $file |
select -Last $($LinesCount-1) |
set-content "$file-temp"
move "$file-temp" $file -Force
Following on from Michael Soren's answer.
If you want to edit all .txt files in the current directory and remove the first line from each.
Get-ChildItem (Get-Location).Path -Filter *.txt |
Foreach-Object {
(Get-Content $_.FullName | Select-Object -Skip 1) | Set-Content $_.FullName
}
For smaller files you could use this:
& C:\windows\system32\more +1 oldfile.csv > newfile.csv | out-null
... but it's not very effective at processing my example file of 16MB. It doesn't seem to terminate and release the lock on newfile.csv.