So I have a CSV file with rows that I want to transpose (some of them) into columns using PowerShell.
The example is as follows:
ALPHA
CD
CL
CM
-5
0.1
-0.2
0.05
0
0.4
0.4
-0.08
5
0.5
0.8
-0.1
What I want is something like this:
Alpha CD CL CM
-5 0.1 -0.2 0.05
0 0.4 0.4 -0.08
5 0.5 0.8 -0.1
For reference I got these values from a .dat data file output with over 400 rows full of information. I reformatted it into a CSV file using out-file, and I skipped all the rows I don't need.
The information was split into rows but not columns, meaning ALPHA CD CL CM were all in one cell with spaces in between, so I used the split command as shown below to break them into rows.
$ text .split () | where { $ _ }
Now I want to transpose SOME of them back into columns.
The problem is it's not fixed amounts, meaning it's not always four rows into four columns, sometimes I would get five rows that I want to turn into five columns, and THEN turn every four rows into four columns AFTER that.
Sorry if I'm rambling but it's something like this:
Row 1 > Column 1 Row 1
Row 2 > Column 2 Row 1
Row 3 > Column 3 Row 1
Row 4 > Column 4 Row 1
Row 5 > Column 5 Row 1
Row 6 > Column 1 Row 2
Row 7 > Column 2 Row 2
Row 8 > Column 3 Row 2
Row 9 > Column 4 Row 2
Row 10 > Column 5 Row 2
Row 11 > Column 1 Row 3
Row 12 > Column 2 Row 3
Row 13 > Column 3 Row 3
Row 14 > Column 4 Row 3
Row 15 > Column 1 Row 4
Please notice how it went from five columns to four columns now.
If it can be done easier in other methods other than PowerShell where I can use PowerShell to run them, i.e. a batch file that calls PowerShell, that would be good by me as I need to automate a very long process, and this is one of the later process steps.
PS: The data are NOT comma separated cleanly. The used program DATCOM outputs a data file where it looks neat and structured in text format, but when you export CSV it destroys it, so it has to be done using:
out-file name csv
PPS: There is no clear delimiter/cutoff point, and there are no repeating numbers or anything else that can be used as a hint. I have to do it by row number, which I know due to dealing with DATCOM before.
I explained more above, but I tried using split commands. It dropped them all into rows. So if there is a way that can just do a literal text to columns delimit using spaces (exactly like in Excel) that would be perfect, and even better than breaking them into rows then transposing to columns. However, it has to be EXACTLY like Excel. The problem is there are 4-8 "spaces" between each value, so if I try to
import-csv -delim " "
on the file I get something like Alpha H1 H2 H3 CD H4 H5 H6 H7 H8 CL and everything else gets destroyed, whereas if I actually open Excel, highlight cells, text to columns > delimited > check "spaces" the results are perfect.
Here are the files: https://easyupload.io/m/6q70ei
for006.dat is the data file generated by DATCOM.
Output1 is what I want done as described above (row to column).
Output2 is what I hope I can do later, i.e. delete a column and a row to make it cleaner, this is my ideal final output.
Mmm... I am afraid your description is pretty confusing, so I forgot it and focused in your files...
The Batch file below read the for006.dat file and generate your "ideal final output" Output2.xlsx file in .csv form.
#echo off
setlocal EnableDelayedExpansion
set "skip="
set "lines="
for /F "delims=:" %%a in ('findstr /N /L /C:" ALPHA" for006.dat') do (
if not defined skip (
set /A "skip=%%a-1"
) else if not defined lines (
set /A "lines=%%a-skip-1"
)
)
< for006.dat (
for /L %%a in (1,1,%skip%) do set /P "="
for /L %%a in (1,1,%lines%) do (
set /P "line="
set "line=!line:~2!"
if defined line call :reformat
)
) > Output2.csv
goto :EOF
:reformat
set "newLine=%line: = %"
if "%newLine%" == "%line%" goto continue
set "line=%newLine%"
goto reformat
:continue
if "%line:~0,1%" == " " set "line=%line:~1%"
if "%line:~-1%" == " " set "line=%line:~0,-1%"
echo "%line: =","%"
This is Output2.csv:
"ALPHA","CD","CL","CM","CN","CA","XCP","CLA","CMA","CYB","CNB","CLB"
"-6.0","0.013","-0.175","0.2807","-0.176","-0.006","-1.599","3.100E+00","-3.580E+00","-5.643E-02","-3.080E-03","-8.679E-02"
"-3.0","0.011","-0.011","0.0926","-0.012","0.010","-7.977","3.172E+00","-3.626E+00","-8.989E-02"
"0.0","0.013","0.157","-0.0990","0.157","0.013","-0.631","3.286E+00","-3.740E+00","-9.305E-02"
"3.0","0.019","0.333","-0.2991","0.334","0.001","-0.897","3.426E+00","-3.901E+00","-9.635E-02"
"6.0","0.029","0.516","-0.5075","0.516","-0.025","-0.984","3.529E+00","-4.084E+00","-9.979E-02"
"7.5","0.036","0.609","-0.6158","0.608","-0.044","-1.013","3.472E+00","-4.002E+00","-1.015E-01"
"9.0","0.043","0.698","-0.7171","0.696","-0.067","-1.031","3.218E+00","-3.679E+00","-1.032E-01"
"10.0","0.047","0.752","-0.7791","0.748","-0.084","-1.041","2.895E+00","-3.489E+00","-1.042E-01"
"11.0","0.051","0.799","-0.8388","0.794","-0.102","-1.057","2.572E+00","-3.345E+00","-1.051E-01"
"12.0","0.055","0.841","-0.8958","0.835","-0.121","-1.073","2.320E+00","-3.178E+00","-1.059E-01"
"13.0","0.059","0.880","-0.9498","0.870","-0.140","-1.091","2.041E+00","-2.983E+00","-1.066E-01"
"14.0","0.063","0.913","-0.9999","0.901","-0.160","-1.110","1.738E+00","-2.772E+00","-1.072E-01"
"15.0","0.066","0.940","-1.0465","0.925","-0.180","-1.131","1.356E+00","-2.567E+00","-1.077E-01"
"16.0","0.067","0.960","NA","0.941","-0.201","NA","1.798E-02","NA","-1.081E-01"
"18.0","0.055","0.883","NA","0.857","-0.220","NA","-4.434E+00","NA","-1.066E-01"
You can also generate the .csv output file with no quotes by just removing the quotes from the last echo command
Try following :
$columns = 4
$data =
"#ALPHA
CD
CL
CM
-5
0.1
-0.2
0.05
0
0.4
0.4
-0.08
5
0.5
0.8
-0.1#"
$data | Format-Table
$headers = [System.Collections.ArrayList]::new()
$table = [System.Collections.ArrayList]::new()
$rows = [System.IO.StringReader]::new($data)
for($i = 0; $i -lt $columns; $i++)
{
$headers.Add($rows.ReadLine())
}
$rowCount = 0
Write-Host $headers
While(($line = $rows.ReadLine()) -ne $null)
{
if($rowCount % $columns -eq 0)
{
$newRow = New-Object -TypeName psobject
$table.Add($newRow)
}
$newRow | Add-Member -NotePropertyName $headers[$rowCount % $columns] -NotePropertyValue $line
$rowCount++
}
$table | Format-Table
You can use PowerShell's Begin/Process/End lifecycle to "buffer" input data until you have enough for a "row", then output that and start collecting for the next row:
# define width of each row as well as the column separator
$columnCount = 5
$delimiter = "`t"
# read in the file contents, "cell-by-cell"
$rawCSVData = Get-Content path\to\input\file.txt |ForEach-Object -Begin {
# set up a buffer to hold 1 row at a time
$index = 0
$buffer = [psobject[]]::new($columnCount)
} -Process {
# add input to buffer, and optionally output
$buffer[$index++] = $_
if($index -eq $columnCount){
# output row, reset column index
$buffer -join $delimiter
$index = 0
}
} -End {
# Output any partial last row
if($index){
$buffer -join $delimiter
}
}
This will produce a list of strings that can either be written to disk or parsed using regular CSV-parsing tools in PowerShell:
$rawCSVData |Set-Content path\to\output.csv
# or
$rawCSVData |ConvertFrom-Csv -Delimiter $delimiter
Once you know how many rows form the headers and the data, you can convert the file into
an array of objects by using ConvertFrom-Csv.
When done, it is easy to create a new csv from this as below:
# in this example the first 4 lines are the columns, the rest is data
# other files may need a different number of columns
$columns = 4
$data = #(Get-Content -Path 'X:\Somewhere\data.txt')
$count = 0
$result = while ($count -lt ($data.Count - ($columns - 1))) {
$data[$count..($count + $columns - 1)] -join "`t" # join the lines with a TAB
$count += $columns
}
$result = $result | ConvertFrom-Csv -Delimiter "`t"
# output on screen
$result | Format-Table -AutoSize
# write to new csv file
$result | Export-Csv -Path 'X:\Somewhere\data_new.csv' -NoTypeInformation
Output on screen:
ALPHA CD CL CM
----- -- -- --
-5 0.1 -0.2 0.05
0 0.4 0.4 -0.08
5 0.5 0.8 -0.1
Two (custom) functions that might help you to scrape your data from the for006.dat:
SelectString -From -To (see: #15136 Add -From and -To parameters to Select-String)
ConvertFrom-SourceTable
Get-Content .\for006.dat |
SelectString -From '(?=0 ALPHA CD CL.*)' -To '^0.+' |
ForEach-Object { $_.SubString(1) } |
ConvertFrom-SourceTable |
ConvertTo-Csv
Results:
"ALPHA","CD","CL","CM","CN","CA","XCP","CLA","CMA","CYB","CNB","CLB"
"-6","0.013","-0.175","0.2807","-0.176","-0.006","-1.599","3.100E+00","-3.580E+00","-5.643E-02","-3.080E-03","-8.679E-02"
"-3","0.011","-0.011","0.0926","-0.012","0.01","-7.977","3.172E+00","-3.626E+00","","","-8.989E-02"
"0","0.013","0.157","-0.0990","0.157","0.013","-0.631","3.286E+00","-3.740E+00","","","-9.305E-02"
"3","0.019","0.333","-0.2991","0.334","0.001","-0.897","3.426E+00","-3.901E+00","","","-9.635E-02"
"6","0.029","0.516","-0.5075","0.516","-0.025","-0.984","3.529E+00","-4.084E+00","","","-9.979E-02"
"7.5","0.036","0.609","-0.6158","0.608","-0.044","-1.013","3.472E+00","-4.002E+00","","","-1.015E-01"
"9","0.043","0.698","-0.7171","0.696","-0.067","-1.031","3.218E+00","-3.679E+00","","","-1.032E-01"
"10","0.047","0.752","-0.7791","0.748","-0.084","-1.041","2.895E+00","-3.489E+00","","","-1.042E-01"
"11","0.051","0.799","-0.8388","0.794","-0.102","-1.057","2.572E+00","-3.345E+00","","","-1.051E-01"
"12","0.055","0.841","-0.8958","0.835","-0.121","-1.073","2.320E+00","-3.178E+00","","","-1.059E-01"
"13","0.059","0.88","-0.9498","0.87","-0.140","-1.091","2.041E+00","-2.983E+00","","","-1.066E-01"
"14","0.063","0.913","-0.9999","0.901","-0.160","-1.110","1.738E+00","-2.772E+00","","","-1.072E-01"
"15","0.066","0.94","-1.0465","0.925","-0.180","-1.131","1.356E+00","-2.567E+00","","","-1.077E-01"
"16","0.067","0.96","NA","0.941","-0.201","NA","1.798E-02","NA","","","-1.081E-01"
"18","0.055","0.883","NA","0.857","-0.220","NA","-4.434E+00","NA","","","-1.066E-01"
Related
I have an input file with below contents:
27/08/2020 02:47:37.365 (-0516) hostname12 ult_licesrv ULT 5 LiceSrv Main[108 00000 Session 'session1' (from 'vmpms1\app1#pmc21app20.pm.com') request for 1 additional licenses for module 'SA-XT' - 1 licenses have been allocated by concurrent usage category 'Unlimited' (session module usage now 1, session category usage now 1, total module concurrent usage now 1, total category usage now 1)
27/08/2020 02:47:37.600 (-0516) hostname13 ult_licesrv ULT 5 LiceSrv Main[108 00000 Session 'sssion2' (from 'vmpms2\app1#pmc21app20.pm.com') request for 1 additional licenses for module 'SA-XT-Read' - 1 licenses have been allocated by concurrent usage category 'Floating' (session module usage now 2, session category usage now 2, total module concurrent usage now 1, total category usage now 1)
27/08/2020 02:47:37.115 (-0516) hostname141 ult_licesrv CMN 5 Logging Housekee 00000 Deleting old log file 'C:\Program Files\PMCOM Global\License Server\diag_ult_licesrv_20200824_011130.log.gz' as it exceeds the purge threashold of 72 hours
27/08/2020 02:47:37.115 (-0516) hostname141 ult_licesrv CMN 5 Logging Housekee 00000 Deleting old log file 'C:\Program Files\PMCOM Global\License Server\diag_ult_licesrv_20200824_021310.log.gz' as it exceeds the purge threashold of 72 hours
27/08/2020 02:47:37.625 (-0516) hostname150 ult_licesrv ULT 5 LiceSrv Main[108 00000 Session 'session1' (from 'vmpms1\app1#pmc21app20.pm.com') request for 1 additional licenses for module 'SA-XT' - 1 licenses have been allocated by concurrent usage category 'Unlimited' (session module usage now 2, session category usage now 1, total module concurrent usage now 2, total category usage now 1)
I need to generate and output file like below:
Date,time,hostname,session_module_usage,session_category_usage,module_concurrent_usage,total_category_usage
27/08/2020,02:47:37.365 (-0516),hostname12,1,1,1,1
27/08/2020,02:47:37.600 (-0516),hostname13,2,2,1,1
27/08/2020,02:47:37.115 (-0516),hostname141,0,0,0,0
27/08/2020,02:47:37.115 (-0516),hostname141,0,0,0,0
27/08/2020,02:47:37.625 (-0516),hostname150,2,1,2,1
The output data order is: Date,time,hostname,session_module_usage,session_category_usage,module_concurrent_usage,total_category_usage.
Put 0,0,0,0 if no entry for session_module_usage,session_category_usage,module_concurrent_usage,total_category_usage
I need to get content from the input file and write the output to another file.
Update
I have created a file input.txt in F drive and pasted the log details into it.
Then I form an array by splitting the file content when a new line occurs like below.
$myList = (Get-Content -Path F:\input.txt) -split '\n'
Now I got 5 items in my array myList. Then I replace the multiple blank spaces with a single blank space and formed a new array by splitting each element by blank space. Then I print the 0 to 3 array elements. Now I need to add the end values (session_module_usage,session_category_usage,module_concurrent_usage,total_category_usage).
PS C:\Users\user> $myList = (Get-Content -Path F:\input.txt) -split '\n'
PS C:\Users\user> $myList.Length
5
PS C:\Users\user> $myList = (Get-Content -Path F:\input.txt) -split '\n'
PS C:\Users\user> $myList.Length
5
PS C:\Users\user> for ($i = 0; $i -le ($myList.length - 1); $i += 1) {
>> $newList = ($myList[$i] -replace '\s+', ' ') -split ' '
>> $newList[0]+','+$newList[1]+' '+$newList[2]+','+$newList[3]
>> }
27/08/2020,02:47:37.365 (-0516),hostname12
27/08/2020,02:47:37.600 (-0516),hostname13
27/08/2020,02:47:37.115 (-0516),hostname141
27/08/2020,02:47:37.115 (-0516),hostname141
27/08/2020,02:47:37.625 (-0516),hostname150
If you really need to filter on the granularity that you're looking for, then you may need to use regex to filter the lines.
This would assume that the rows have similarly labeled lines before the values you're looking for, so keep that in mind.
[System.Collections.ArrayList]$filteredRows = #()
$log = Get-Content -Path C:\logfile.log
foreach ($row in $log) {
$rowIndex = $log.IndexOf($row)
$date = ([regex]::Match($log[$rowIndex],'^\d+\/\d+\/\d+')).value
$time = ([regex]::Match($log[$rowIndex],'\d+:\d+:\d+\.\d+\s\(\S+\)')).value
$hostname = ([regex]::Match($log[$rowIndex],'(?<=\d\d\d\d\) )\w+')).value
$sessionModuleUsage = ([regex]::Match($log[$rowIndex],'(?<=session module usage now )\d')).value
if (!$sessionModuleUsage) {
$sessionModuleUsage = 0
}
$sessionCategoryUsage = ([regex]::Match($log[$rowIndex],'(?<=session category usage now )\d')).value
if (!$sessionCategoryUsage) {
$sessionCategoryUsage = 0
}
$moduleConcurrentUsage = ([regex]::Match($log[$rowIndex],'(?<=total module concurrent usage now )\d')).value
if (!$moduleConcurrentUsage) {
$moduleConcurrentUsage = 0
}
$totalCategoryUsage = ([regex]::Match($log[$rowIndex],'(?<=total category usage now )\d')).value
if (!$totalCategoryUsage) {
$totalCategoryUsage = 0
}
$hash = [ordered]#{
Date = $date
time = $time
hostname = $hostname
session_module_usage = $sessionModuleUsage
session_category_usage = $sessionCategoryUsage
module_concurrent_usage = $moduleConcurrentUsage
total_category_usage = $totalCategoryUsage
}
$rowData = New-Object -TypeName 'psobject' -Property $hash
$filteredRows.Add($rowData) > $null
}
$csv = $filteredRows | convertto-csv -NoTypeInformation -Delimiter "," | foreach {$_ -replace '"',''}
$csv | Out-File C:\results.csv
What essentially needs to happen is that we need to get-content of the log, which returns an array with each item terminated on a newline.
Once we have the rows, we need to grab the values via regex
Since you want zeroes in some of the items if those values don't exist, I have if statements that assign '0' if the regex returns nothing
Finally, we add each filtered item to a PSObject and append that object to an array of objects in each iteration.
Then export to a CSV.
You can probably pick apart the lines with a regex and substrings easily enough. Basically something like the following:
# Iterate over the lines of the input file
Get-Content F:\input.txt |
ForEach-Object {
# Extract the individual fields
$Date = $_.Substring(0, 10)
$Time = $_.Substring(12, $_.IndexOf(')') - 11)
$Hostname = $_.Substring(34, $_.IndexOf(' ', 34) - 34)
$session_module_usage = 0
$session_category_usage = 0
$module_concurrent_usage = 0
$total_category_usage = 0
if ($_ -match 'session module usage now (\d+), session category usage now (\d+), total module concurrent usage now (\d+), total category usage now (\d+)') {
$session_module_usage = $Matches[1]
$session_category_usage = $Matches[2]
$module_concurrent_usage = $Matches[3]
$total_category_usage = $Matches[4]
}
# Create custom object with those properties
New-Object PSObject -Property #{
Date = $Date
time = $Time
hostname = $Hostname
session_module_usage = $session_module_usage
session_category_usage = $session_category_usage
module_concurrent_usage = $module_concurrent_usage
total_category_usage = $total_category_usage
}
} |
# Ensure column order in output
Select-Object Date,time,hostname,session_module_usage,session_category_usage,module_concurrent_usage,total_category_usage |
# Write as CSV - without quotes
ConvertTo-Csv -NoTypeInformation |
ForEach-Object { $_ -replace '"' } |
Out-File F:\output.csv
Whether to pull the date, time, and host name from the line with substrings or regex is probably a matter of taste. Same goes for how strict the format must be matched, but that to me mostly depends on how rigid the format is. For more free-form things where different lines would match different regexes, or multiple lines makes up a single record, I also quite like switch -Regex to iterate over the lines.
So I’ve had a a request to edit a csv file by replacing column values with a a set of unique numbers. Below is a sample of the original input file with a a header line followed by a couple of rows. Note that the rows have NO column headers.
HH ### SLH304 01100001 2 20200606 20200605 011100
DD|GZFD|IE|USD|20200605|EUR200717||
DD|GZFD|IE|USD|20200605|EUR200717||
What I’m looking to do is change say the values in column 3 with a unique number.
So far I have the following …
$i=0
$txtin = Get-Content "C:\Temp\InFile.txt" | ForEach {"$($_.split('|'))"-replace $_[2],$i++} |Out-File C:\Temp\csvout.txt
… but this isn’t working as it removes the delimiter and adds numbers in the wrong places …
HH0###0000000SLH30400110000100000002000000202006060202006050011100
1D1D1 1G1Z1F1D1 1I1E1 1U1S1D1 12101210101610151 1E1U1R1210101711171 1 1
2D2D2 2G2Z2F2D2 2I2E2 2U2S2D2 22202220202620252 2E2U2R2220202721272 2 2
Ideally I want it to look like this, whereby the values of 'IE' have been replaced by '01' and '02' in each row ...
HH ### SLH304 01100001 2 20200606 20200605 011100
DD|GZFD|01|USD|20200605|EUR200717||
DD|GZFD|02|USD|20200605|EUR200717||
Any ideas on how to resolve would be much appreciated.
I think by spreading this out to multiline code will make it easier:
$txtin = Get-Content 'C:\Temp\InFile.txt'
# loop through the lines, skipping the first line
for ($i = 1; $i -lt $txtin.Count; $i++){
$parts = $txtin[$i].Split('|') # or use regex -split '\|'
if ($parts.Count -ge 3) {
$parts[2] = '{0:00}' -f $i # update the 3rd column
$txtin[$i] = $parts -join '|' # rejoin the parts with '|'
}
}
$txtin | Out-File -FilePath 'C:\Temp\csvout.txt'
Output will be:
HH ### SLH304 01100001 2 20200606 20200605 011100
DD|GZFD|01|USD|20200605|EUR200717||
DD|GZFD|02|USD|20200605|EUR200717||
Updated to use the more robust check suggested by mklement0. This avoids errors when the line does not have at least three parts in it after the split
So I have a csv file which is 25MB.
I only need to get the value stored in 2nd line in first column and use it later in powershell script.
e.g data
File_name,INVNUM,ID,XXX....850 columns
ABCD,123,090,xxxx.....850 columns
ABCD,120,091,xxxx.....850 columns
xxxxxx5000+ rows
So my first column data is always the same and i just need to get this filename form the first column, 2nd row.
Should I try to use Get-content or Import-csv for this use case ?
Thanks,
Mickey
TessellatingHeckler's helpful answer contains a pragmatic, easy-to-understand solution that is most likely fast enough in practice; the same goes for Robert Cotterman's helpful answer which is concise (and also faster).
If performance is really paramount, you can try the following, which uses the .NET framework directly to read the lines - but given that you only need to read 2 lines, it's probably not worth it:
$inputFile = "$PWD/some.csv" # be sure to specify a *full* path
$isFirstLine=$true
$fname = foreach ($line in [IO.File]::ReadLines($inputFile)) {
if ($isFirstLine) { $isFirstLine = $false; continue } # skip header line
$line -replace '^([^,]*),.*', '$1' # extract 1st field from 2nd line and exit
break # exit
}
Note: A conceptually simpler way to extract the 1st field is to use ($line -split ',')[0], but with a large number of columns the above -replace-based approach is measurably faster.
Update: TessellatingHeckler offers 2 ways to speed up the above:
Use of $line.Substring(0, $line.IndexOf(',')) in lieu of $line -replace '^([^,]*),.*', '$1' in order to avoid relatively costly regex processing.
To lesser gain, use of a [System.IO.StreamReader] instance's .ReadLine() method twice in a row rather than [IO.File]::ReadLines() in a loop.
Here's a performance comparison of the approaches across all answers on this page (as of this writing).
To run it yourself, you must download functions New-CsvSampleData and Time-Command first.
For more representative results, the timings are averaged across 1,000 runs:
# Create sample CSV file 'test.csv' with 850 columns and 100 rows.
$testFileName = "test-$PID.csv"
New-CsvSampleData -Columns 850 -Count 100 | Set-Content $testFileName
# Compare the execution speed of the various approaches:
Time-Command -Count 1000 {
# Import-Csv
Import-Csv -LiteralPath $testFileName |
Select-Object -Skip 1 -First 1 -ExpandProperty 'col1'
}, {
# ReadLines(), -replace
$inputFile = $PWD.ProviderPath + "/$testFileName"
$isFirstLine=$true
foreach ($line in [IO.File]::ReadLines($inputFile)) {
if ($isFirstLine) { $isFirstLine = $false; continue } # skip header line
$line -replace '^([^,]*),.*', '$1' # extract 1st field from 2nd line and exit
break # exit
}
}, {
# ReadLines(), .Substring / IndexOf
$inputFile = $PWD.ProviderPath + "/$testFileName"
$isFirstLine=$true
foreach ($line in [IO.File]::ReadLines($inputFile)) {
if ($isFirstLine) { $isFirstLine = $false; continue } # skip header line
$line.Substring(0, $line.IndexOf(',')) # extract 1st field from 2nd line and exit
break # exit
}
}, {
# ReadLine() x 2, .Substring / IndexOf
$inputFile = $PWD.ProviderPath + "/$testFileName"
$f = [System.IO.StreamReader]::new($inputFile,$true);
$null = $f.ReadLine(); $line = $f.ReadLine()
$line.Substring(0, $line.IndexOf(','))
$f.Close()
}, {
# Get-Content -Head, .Split()
((Get-Content $testFileName -Head 2)[1]).split(',')[1]
} |
Format-Table Factor, Timespan, Command
Remove-Item $testFileName
Sample output from a single-core Windows 10 VM running Windows PowerShell v5.1 / PowerShell Core 6.1.0-preview.4 on a recent-model MacBook Pro:
Windows PowerShell v5.1:
Factor TimeSpan Command
------ -------- -------
1.00 00:00:00.0001922 # ReadLine() x 2, .Substring / IndexOf...
1.04 00:00:00.0002004 # ReadLines(), .Substring / IndexOf...
1.57 00:00:00.0003024 # ReadLines(), -replace...
3.25 00:00:00.0006245 # Get-Content -Head, .Split()...
25.83 00:00:00.0049661 # Import-Csv...
PowerShell Core 6.1.0-preview.4:
Factor TimeSpan Command
------ -------- -------
1.00 00:00:00.0001858 # ReadLine() x 2, .Substring / IndexOf...
1.03 00:00:00.0001911 # ReadLines(), .Substring / IndexOf...
1.60 00:00:00.0002977 # ReadLines(), -replace...
3.30 00:00:00.0006132 # Get-Content -Head, .Split()...
27.54 00:00:00.0051174 # Import-Csv...
Conclusions:
Calling .ReadLine() twice is marginally faster than the ::ReadLines() loop.
Using -replace instead of Substring() / IndexOf() adds about 60% execution time.
Using Get-Content is more than 3 times slower.
Using Import-Csv | Select-Object is close to 30 times(!) slower, presumably due to the large number of columns; that said, in absolute terms we're still only talking about around 5 milliseconds.
As a side note: execution on macOS seems to be noticeably slower overall, with the regex solution and the cmdlet calls also being slower in relative terms.
Depends what you want to prioritize.
$data = Import-Csv -LiteralPath 'c:\temp\data.csv' |
Select-Object -Skip 1 -First 1 -ExpandProperty 'File_Name'
Is short and convenient. (2nd line meaning 2nd line of the file, or 2nd line of the data? Don't skip any if it's the first line of data).
Select-Object with something like -First 1 will break the whole pipeline when it's done, so it won't wait to read the rest of the 25MB in the background before returning.
You could likely speed it up, or reduce memory use, a miniscule amount if you opened the file, seek'd two newlines, then a comma, then read to another comma, or some other long detailed code, but I very much doubt it would be worth it.
Same with Get-Content, the way it adds NoteProperties to the output strings will mean it's likely no easier on memory and not usefully faster than Import-Csv
You could really shorten it with
(gc c:\file.txt -head 2)[1]
Only reads 2 lines and then grabs index 1 (second line)
You could then split it. And grab index 1 of the split up line
((gc c:\file.txt -head 2)[1]).split(',')[1]
UPDATE:::After seeing the new post with times, I was inspired to do some tests myself (Thanks mklement0). this was the fastest I could get to work
$check = 0
foreach ($i in [IO.FILE]::ReadLines("$filePath")){
if ($check -eq 2){break}
if ($check -eq 1){$value = $i.split(',')[1]} #$value = your answer
$check++
}
Just thought of this: remove if -eq 2 and put break after a semi colon after the check 1 is performed. 5 ticks faster. Haven't tested.
here were my results over 40000 tests:
GC split avg was 1.11307622 Milliseconds
GC split Min was 0.3076 Milliseconds
GC split Max was 18.1514 Milliseconds
ReadLines split avg was 0.3836625825 Milliseconds
ReadLines split Min was 0.2309 Milliseconds
ReadLines split Max was 31.7407 Milliseconds
Stream Reader avg was 0.4464924825 Milliseconds
Stream Reader MIN was 0.2703 Milliseconds
Stream Reader Max was 31.4991 Milliseconds
Import-CSV avg was 1.32440485 Milliseconds
Import-CSV MIN was 0.2875 Milliseconds
Import-CSV Max was 103.1694 Milliseconds
I was able to run 3000 tests a second on the 2nd and 3rd, and 1000 tests a second on the first and last. Stream Reader was HIS fastest one. And import CSV wasn't bad, i wonder if the mklement0 didn't have a column named "file_name" in his test csv? Anyhow, I'd personally use the GC command because it's concise and easy to remember. But this is up to you, and I wish you luck on your scripting adventures.
I'm certain we could start hyperthreading this and get insane results, but when you're talking thousandths of a second is it really a big deal? Especially to get one variable? :D
here's the streamreader code I used for transparency reasons...
$inputFile = "$filePath"
$f = [System.IO.StreamReader]::new($inputFile,$true);
$null = $f.ReadLine(); $line = $f.ReadLine()
$line.Substring(0, $line.IndexOf(','))
$f.Close()
I also noticed this pulls the 1st value of the second line, and I have no idea how to switch it to the 2nd value... it seems to be measuring the width from point 0 to the first comma, and then cutting that. if you change substring from 0 to say 5, it still measures the length of 0 to comma, but then moves where to start grabbing... to the 6th character.
The import-csv I used was :
$data = Import-Csv -LiteralPath "$filePath" |
Select-Object -Skip 1 -First 1 -ExpandProperty 'FileName'
I tested these on a 90 meg csv, with 21 columns, and 284k rows. and "FileName" was the second column
I have two text files.
$File1 = "C:\Content1.txt"
$File2 = "C:\Content2.txt"
I'd like to compare these to see if they have the same number of lines and then I'd like to record the line number of each line that matches. I realize that sounds ridiculous but this is what I've been asked to do at my work.
I can compare them a lot of ways. I decided to do the following:
$File1Lines = Get-Content $File1 | Measure-Object -Line
$File2Lines = Get-Content $File2 | Measure-Object -Line
I'd like to test it with an if statement so that if they don't match, then I can start an earlier process over again.
if ($file1lines.lines -eq $file2lines.lines)
{ Get the Line #s that match and proceed to the next step}
else {Start Over}
I'm unsure how to record the line #s that match. Any thoughts on how to do this?
This is really pretty simple since Get-Content reads the file in as an array of strings, and you can index that array simply enough.
Do{
<stuff to generate files>
}While(($File1 = GC $PathToFile1).Count -ne ($File2 = GC $PathToFile2).count)
$MatchingLineNumbers = 0..($File1.count -1) | Where{$File1[$_] -eq $File2[$_]}
Since arrays in PowerShell use a 0 based index we want to start at 0 and go for however many lines the files have. Since .count starts at 1 not 0 we need to subtract 1 from the total count. So if your file has 27 lines $File1.count will equal 27. The index for those lines ranges from 0 (first line) to 26 (last line). The code ($File1.count - 1) would effectively come out to 26, so 0..26 starts at 0, and counts to 26.
Then each number goes to a Where statement that checks that specific line in each file to see if they are equal. If they are then it passes the number along, and that gets collected in $MatchingLineNumbers. If the lines don't match the number isn't passed along.
You'll need to get an intersection first, then find the index.
file1.txt
Line1
Line2
Line3
Line11
Line21
Line31
Line12
Line22
Line32
file2.txt
Line1
Line11
Line21
Line31
Line12
Line222
Line323
Line214
Line315
Line12
Line22
Line32
test.ps1
$file1 = Get-Content file1.txt
$file2 = Get-Content file2.txt
$matchingLines = $file1 | ? { $file2 -contains $_ }
$file1Lines = $matchingLines | % { Write-Host "$([array]::IndexOf($file1, $_))" }
$file2Lines = $matchingLines | % { Write-Host "$([array]::IndexOf($file2, $_))" }
Output
$file1Lines
0
3
4
5
6
7
8
$file2Lines
0
1
2
3
4
10
11
Can anyone explain me the difference between following two statements?
gc -ReadCount 2 .\input.txt| % {"##" + $_}
(gc -ReadCount 2 .\input.txt)| % {"##" + $_}
I am using below file as input for above commands.
input.txt
1
2
Output
gc -ReadCount 2 .\input.txt| % {"##" + $_}
##1 2
(gc -ReadCount 2 .\input.txt)| % {"##" + $_}
##1
##2
If the input file contains more than 2 records both are giving same output.
I can modify my code to achieve what i want but i am just wondering why these 2 are giving different outputs.
I googled for the information but didn't find any answer.
Edit 1
Isn't the output of command 2 wrong, when i specify "-ReadCount 2" it should pipe two lines at a time, that means foreach loop should iterate only once(as input contains only 2 lines) with $[0]=1 , $[1]=2 so that when i print "##"+$_ it should print "##1 2" as command1 did.
gc -ReadCount 2 .\input.txt| % {"##" + $_}
Read the content as [String] - Which means it adds "##" then the whole text file after it (the foreach loop running once)
(gc -ReadCount 2 .\input.txt)| % {"##" + $_}
Read the content as [Array] and evaluate each line of it, which adds "##" and the content of each line after it (the foreach loop running twice)
The -ReadCount Parameter are used to split the data to an array of lines as one chunk, mostly used for performance, so -ReadCount 3 will show
##1 2 3
##4 5 6
##7
and -ReadCount 4 will show:
##1 2 3 4
##5 6 7
For everyone to see, here is the output of Get-Help Get-Content -Parameter ReadCount:
-ReadCount <Int64>
Specifies how many lines of content are sent through the pipeline at a time.
The default value is 1. A value of 0 (zero) sends all of the content at one time.
This is what breaks the lines into groups (I assumed it would instead limit the number of lines read in the file).
Still no clue about the less-than-three-lines behavior, though.