So I’ve had a a request to edit a csv file by replacing column values with a a set of unique numbers. Below is a sample of the original input file with a a header line followed by a couple of rows. Note that the rows have NO column headers.
HH ### SLH304 01100001 2 20200606 20200605 011100
DD|GZFD|IE|USD|20200605|EUR200717||
DD|GZFD|IE|USD|20200605|EUR200717||
What I’m looking to do is change say the values in column 3 with a unique number.
So far I have the following …
$i=0
$txtin = Get-Content "C:\Temp\InFile.txt" | ForEach {"$($_.split('|'))"-replace $_[2],$i++} |Out-File C:\Temp\csvout.txt
… but this isn’t working as it removes the delimiter and adds numbers in the wrong places …
HH0###0000000SLH30400110000100000002000000202006060202006050011100
1D1D1 1G1Z1F1D1 1I1E1 1U1S1D1 12101210101610151 1E1U1R1210101711171 1 1
2D2D2 2G2Z2F2D2 2I2E2 2U2S2D2 22202220202620252 2E2U2R2220202721272 2 2
Ideally I want it to look like this, whereby the values of 'IE' have been replaced by '01' and '02' in each row ...
HH ### SLH304 01100001 2 20200606 20200605 011100
DD|GZFD|01|USD|20200605|EUR200717||
DD|GZFD|02|USD|20200605|EUR200717||
Any ideas on how to resolve would be much appreciated.
I think by spreading this out to multiline code will make it easier:
$txtin = Get-Content 'C:\Temp\InFile.txt'
# loop through the lines, skipping the first line
for ($i = 1; $i -lt $txtin.Count; $i++){
$parts = $txtin[$i].Split('|') # or use regex -split '\|'
if ($parts.Count -ge 3) {
$parts[2] = '{0:00}' -f $i # update the 3rd column
$txtin[$i] = $parts -join '|' # rejoin the parts with '|'
}
}
$txtin | Out-File -FilePath 'C:\Temp\csvout.txt'
Output will be:
HH ### SLH304 01100001 2 20200606 20200605 011100
DD|GZFD|01|USD|20200605|EUR200717||
DD|GZFD|02|USD|20200605|EUR200717||
Updated to use the more robust check suggested by mklement0. This avoids errors when the line does not have at least three parts in it after the split
My code captures a string of numbers from another file using regex (.*) There should always be a minimum of four numbers
The output may be
1456 or 234567
But let’s say it’s
3667876
I want to add ‘km ’ before the last three digits a ‘m’ after the last three digits. So resulting in
3667km 876m
The line of code in the Powershell script is
Get-Content -Tail 0 -Wait -Encoding "UTF8" $log |
Select-String "Run Distance: (.*)" |
% {"Total Distance `- " + $_.matches.groups[1].value} |
Write-SlowOutput -outputFile $output -waitFor $delay
So in this case the output would read
Total Distance - 3667km 876m
Can anyone help with the regex formula to use in place of the (.*) in this Powershell script.
Thank you
here's yet another way to do the job. [grin] you can use a regex pattern with the -replace operator to replace the digits in the string & build a new string with the match groups. like this ...
'1234' -replace '(.+)(.{3})$', '$1km $2m'
output = 1km 234m
the glitch with this is that the number must have at least 4 digits to work correctly. if you may have fewer digits to work with, then a solution similar those by Thomas or FoxDeploy are needed.
I don't have the Write-SlowOutput cmdlet, but the output of the ForEach-Object cmdlet looks fine:
Get-Content -Tail 0 -Wait -Encoding "UTF8" $log |
Select-String "Run Distance: (\d+)(\d{3})$" |
% {"Total Distance `- $($_.matches.groups[1].value)km $($_.matches.groups[2].value)m"} |
Write-SlowOutput -outputFile $output -waitFor $delay
I implemented two matching groups in the regex to be able to process them individually.
If you like some more readable code, you can also do this easily by casting your int number into a string using ToString() and then use Substring() to slice it apart. The result is very easy to read.
ForEach($n in $nums){
$splitIndex = $n.ToString().Length - 3
$KMs = $n.ToString().Substring(0,$splitIndex)
$Meters = $n.ToString().SubString($splitIndex, 3)
"Total distance $KMs Kilos - $Meters meters"
}
Resulting in
Total distance 3667 Kilos - 876 meters
Total distance 33667 Kilos - 876 meters
Total distance 45454 Kilos - 131 meters
So I have a csv file which is 25MB.
I only need to get the value stored in 2nd line in first column and use it later in powershell script.
e.g data
File_name,INVNUM,ID,XXX....850 columns
ABCD,123,090,xxxx.....850 columns
ABCD,120,091,xxxx.....850 columns
xxxxxx5000+ rows
So my first column data is always the same and i just need to get this filename form the first column, 2nd row.
Should I try to use Get-content or Import-csv for this use case ?
Thanks,
Mickey
TessellatingHeckler's helpful answer contains a pragmatic, easy-to-understand solution that is most likely fast enough in practice; the same goes for Robert Cotterman's helpful answer which is concise (and also faster).
If performance is really paramount, you can try the following, which uses the .NET framework directly to read the lines - but given that you only need to read 2 lines, it's probably not worth it:
$inputFile = "$PWD/some.csv" # be sure to specify a *full* path
$isFirstLine=$true
$fname = foreach ($line in [IO.File]::ReadLines($inputFile)) {
if ($isFirstLine) { $isFirstLine = $false; continue } # skip header line
$line -replace '^([^,]*),.*', '$1' # extract 1st field from 2nd line and exit
break # exit
}
Note: A conceptually simpler way to extract the 1st field is to use ($line -split ',')[0], but with a large number of columns the above -replace-based approach is measurably faster.
Update: TessellatingHeckler offers 2 ways to speed up the above:
Use of $line.Substring(0, $line.IndexOf(',')) in lieu of $line -replace '^([^,]*),.*', '$1' in order to avoid relatively costly regex processing.
To lesser gain, use of a [System.IO.StreamReader] instance's .ReadLine() method twice in a row rather than [IO.File]::ReadLines() in a loop.
Here's a performance comparison of the approaches across all answers on this page (as of this writing).
To run it yourself, you must download functions New-CsvSampleData and Time-Command first.
For more representative results, the timings are averaged across 1,000 runs:
# Create sample CSV file 'test.csv' with 850 columns and 100 rows.
$testFileName = "test-$PID.csv"
New-CsvSampleData -Columns 850 -Count 100 | Set-Content $testFileName
# Compare the execution speed of the various approaches:
Time-Command -Count 1000 {
# Import-Csv
Import-Csv -LiteralPath $testFileName |
Select-Object -Skip 1 -First 1 -ExpandProperty 'col1'
}, {
# ReadLines(), -replace
$inputFile = $PWD.ProviderPath + "/$testFileName"
$isFirstLine=$true
foreach ($line in [IO.File]::ReadLines($inputFile)) {
if ($isFirstLine) { $isFirstLine = $false; continue } # skip header line
$line -replace '^([^,]*),.*', '$1' # extract 1st field from 2nd line and exit
break # exit
}
}, {
# ReadLines(), .Substring / IndexOf
$inputFile = $PWD.ProviderPath + "/$testFileName"
$isFirstLine=$true
foreach ($line in [IO.File]::ReadLines($inputFile)) {
if ($isFirstLine) { $isFirstLine = $false; continue } # skip header line
$line.Substring(0, $line.IndexOf(',')) # extract 1st field from 2nd line and exit
break # exit
}
}, {
# ReadLine() x 2, .Substring / IndexOf
$inputFile = $PWD.ProviderPath + "/$testFileName"
$f = [System.IO.StreamReader]::new($inputFile,$true);
$null = $f.ReadLine(); $line = $f.ReadLine()
$line.Substring(0, $line.IndexOf(','))
$f.Close()
}, {
# Get-Content -Head, .Split()
((Get-Content $testFileName -Head 2)[1]).split(',')[1]
} |
Format-Table Factor, Timespan, Command
Remove-Item $testFileName
Sample output from a single-core Windows 10 VM running Windows PowerShell v5.1 / PowerShell Core 6.1.0-preview.4 on a recent-model MacBook Pro:
Windows PowerShell v5.1:
Factor TimeSpan Command
------ -------- -------
1.00 00:00:00.0001922 # ReadLine() x 2, .Substring / IndexOf...
1.04 00:00:00.0002004 # ReadLines(), .Substring / IndexOf...
1.57 00:00:00.0003024 # ReadLines(), -replace...
3.25 00:00:00.0006245 # Get-Content -Head, .Split()...
25.83 00:00:00.0049661 # Import-Csv...
PowerShell Core 6.1.0-preview.4:
Factor TimeSpan Command
------ -------- -------
1.00 00:00:00.0001858 # ReadLine() x 2, .Substring / IndexOf...
1.03 00:00:00.0001911 # ReadLines(), .Substring / IndexOf...
1.60 00:00:00.0002977 # ReadLines(), -replace...
3.30 00:00:00.0006132 # Get-Content -Head, .Split()...
27.54 00:00:00.0051174 # Import-Csv...
Conclusions:
Calling .ReadLine() twice is marginally faster than the ::ReadLines() loop.
Using -replace instead of Substring() / IndexOf() adds about 60% execution time.
Using Get-Content is more than 3 times slower.
Using Import-Csv | Select-Object is close to 30 times(!) slower, presumably due to the large number of columns; that said, in absolute terms we're still only talking about around 5 milliseconds.
As a side note: execution on macOS seems to be noticeably slower overall, with the regex solution and the cmdlet calls also being slower in relative terms.
Depends what you want to prioritize.
$data = Import-Csv -LiteralPath 'c:\temp\data.csv' |
Select-Object -Skip 1 -First 1 -ExpandProperty 'File_Name'
Is short and convenient. (2nd line meaning 2nd line of the file, or 2nd line of the data? Don't skip any if it's the first line of data).
Select-Object with something like -First 1 will break the whole pipeline when it's done, so it won't wait to read the rest of the 25MB in the background before returning.
You could likely speed it up, or reduce memory use, a miniscule amount if you opened the file, seek'd two newlines, then a comma, then read to another comma, or some other long detailed code, but I very much doubt it would be worth it.
Same with Get-Content, the way it adds NoteProperties to the output strings will mean it's likely no easier on memory and not usefully faster than Import-Csv
You could really shorten it with
(gc c:\file.txt -head 2)[1]
Only reads 2 lines and then grabs index 1 (second line)
You could then split it. And grab index 1 of the split up line
((gc c:\file.txt -head 2)[1]).split(',')[1]
UPDATE:::After seeing the new post with times, I was inspired to do some tests myself (Thanks mklement0). this was the fastest I could get to work
$check = 0
foreach ($i in [IO.FILE]::ReadLines("$filePath")){
if ($check -eq 2){break}
if ($check -eq 1){$value = $i.split(',')[1]} #$value = your answer
$check++
}
Just thought of this: remove if -eq 2 and put break after a semi colon after the check 1 is performed. 5 ticks faster. Haven't tested.
here were my results over 40000 tests:
GC split avg was 1.11307622 Milliseconds
GC split Min was 0.3076 Milliseconds
GC split Max was 18.1514 Milliseconds
ReadLines split avg was 0.3836625825 Milliseconds
ReadLines split Min was 0.2309 Milliseconds
ReadLines split Max was 31.7407 Milliseconds
Stream Reader avg was 0.4464924825 Milliseconds
Stream Reader MIN was 0.2703 Milliseconds
Stream Reader Max was 31.4991 Milliseconds
Import-CSV avg was 1.32440485 Milliseconds
Import-CSV MIN was 0.2875 Milliseconds
Import-CSV Max was 103.1694 Milliseconds
I was able to run 3000 tests a second on the 2nd and 3rd, and 1000 tests a second on the first and last. Stream Reader was HIS fastest one. And import CSV wasn't bad, i wonder if the mklement0 didn't have a column named "file_name" in his test csv? Anyhow, I'd personally use the GC command because it's concise and easy to remember. But this is up to you, and I wish you luck on your scripting adventures.
I'm certain we could start hyperthreading this and get insane results, but when you're talking thousandths of a second is it really a big deal? Especially to get one variable? :D
here's the streamreader code I used for transparency reasons...
$inputFile = "$filePath"
$f = [System.IO.StreamReader]::new($inputFile,$true);
$null = $f.ReadLine(); $line = $f.ReadLine()
$line.Substring(0, $line.IndexOf(','))
$f.Close()
I also noticed this pulls the 1st value of the second line, and I have no idea how to switch it to the 2nd value... it seems to be measuring the width from point 0 to the first comma, and then cutting that. if you change substring from 0 to say 5, it still measures the length of 0 to comma, but then moves where to start grabbing... to the 6th character.
The import-csv I used was :
$data = Import-Csv -LiteralPath "$filePath" |
Select-Object -Skip 1 -First 1 -ExpandProperty 'FileName'
I tested these on a 90 meg csv, with 21 columns, and 284k rows. and "FileName" was the second column
Can anyone explain me the difference between following two statements?
gc -ReadCount 2 .\input.txt| % {"##" + $_}
(gc -ReadCount 2 .\input.txt)| % {"##" + $_}
I am using below file as input for above commands.
input.txt
1
2
Output
gc -ReadCount 2 .\input.txt| % {"##" + $_}
##1 2
(gc -ReadCount 2 .\input.txt)| % {"##" + $_}
##1
##2
If the input file contains more than 2 records both are giving same output.
I can modify my code to achieve what i want but i am just wondering why these 2 are giving different outputs.
I googled for the information but didn't find any answer.
Edit 1
Isn't the output of command 2 wrong, when i specify "-ReadCount 2" it should pipe two lines at a time, that means foreach loop should iterate only once(as input contains only 2 lines) with $[0]=1 , $[1]=2 so that when i print "##"+$_ it should print "##1 2" as command1 did.
gc -ReadCount 2 .\input.txt| % {"##" + $_}
Read the content as [String] - Which means it adds "##" then the whole text file after it (the foreach loop running once)
(gc -ReadCount 2 .\input.txt)| % {"##" + $_}
Read the content as [Array] and evaluate each line of it, which adds "##" and the content of each line after it (the foreach loop running twice)
The -ReadCount Parameter are used to split the data to an array of lines as one chunk, mostly used for performance, so -ReadCount 3 will show
##1 2 3
##4 5 6
##7
and -ReadCount 4 will show:
##1 2 3 4
##5 6 7
For everyone to see, here is the output of Get-Help Get-Content -Parameter ReadCount:
-ReadCount <Int64>
Specifies how many lines of content are sent through the pipeline at a time.
The default value is 1. A value of 0 (zero) sends all of the content at one time.
This is what breaks the lines into groups (I assumed it would instead limit the number of lines read in the file).
Still no clue about the less-than-three-lines behavior, though.
You can try this:
function setConfig( $file) {
$content = Get-Content $file
$content -remove '$content[1..14]'
Set-Content $file
}
I want to make a function through which I can pass through files so that it deletes specific lines or bunch of lines
I don't recall -remove being a PowerShell operator and I should think the error you would be getting to be:
Unexpected token '-remove' in expression or statement.
Also you are preventing PowerShell from expanding the code in single quotes so it is therefore being treated as the literal string "$content[1..14]".
I am going to take the liberty of assuming you are trying to remove the first 14ish lines of code from a file while keeping the first yes?
I create a test file that contains 30 lines using the following code.
1..30 | Set-Content C:\temp\30lines.txt
Then we use this updated version of your function
function setConfig($file){
$content = Get-Content $file
$content | Select-Object -Index (,0 + (14..$($content.Count))) | Set-Content $file
}
Using the -Index of Select-Object we get the first line ,0 then add the remaining lines after the 14th (14..$($content.Count))). The comma is needed in from of the 0 since we went to combine two arrays of numbers. An updated file content would look like this. Change the -Index values to suit your needs.
1
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30