PowerShell: how to count number of rows in csv file? - command-line

How can I count the number of rows in a csv file using powershell? I tried something like
Get-Content -length "C:\Directory\file.csv"
or
(Get-Content).length "C:\Directory\file.csv"
but these result an error.

Get-Content and Measure-Object are fine for small files, but both are super inefficient with memory. I had real problems with large files.
When counting rows in a 1GB file using either method, Powershell gobbled up all available memory on the server (8GB), then started paging to disk. I left it over an hour, but it was still paging to disk so I killed it.
The best method I found for large files is to use IO.StreamReader to load the file from disk and count each row using a variable. This keeps memory usage down to a very reasonable 25MB and is much, much quicker, taking around 30 seconds to count rows in a 1GB file or a couple of minutes for a 6GB file. It never eats up unreasonable amounts of RAM, no matter how large your file is:
[int]$LinesInFile = 0
$reader = New-Object IO.StreamReader 'c:\filename.csv'
while($reader.ReadLine() -ne $null){ $LinesInFile++ }
The above snippet can be inserted wherever you would use get-content or measure-object, simply refer to the $LinesInFile variable to get the row count of the file.

Pipe it to the Measure-Object cmdlet
Import-Csv C:\Directory\file.csv | Measure-Object

Generally (csv or not)
#(Get-Content c:\file.csv).Length
If the file has only one line, then, it will fail. (You need the # prefix...otherwise if the file has one line, it will only count the number of characters in that line.
Get-Content c:\file.csv | Measure-Object -line
But both will fail if any record takes more than one row. Then better import csv and measure:
Import-Csv c:\file.csv | Measure-Object | Select-Object -expand count

You can simply use unix like comand in powershell.
If you file test.csv
Then command to get rowcount is
gc test.csv | Measure-Object

(Import-Csv C:\Directory\file.csv).count is the only accurate one out of these.
I tried all of the other suggestions on a csv with 4781 rows, and all but this one returned 4803.

You can try
(Import-Csv C:\Directory\file.csv).count
or
$a=Import-Csv C:\Directory\file.csv
$a.count

Related

Powershell - Return Line or Row number from input file

I found an answer to a previous question incredibly helpful, but I can't quite figure out how Get-Content is able able to store the 'line number' from the input.
Basically I'm wondering if PSObjects store information such as line number or row number. In the example below, it is basically like using Get-Content is able to store the line number as a variable you can use later. In the pipeline, the variable would be $_.psobject.Properties.value[5]
A bit of that seems redundant to me since $_ is an object (I think), but still it is very cool that .value[5] seems to be the line number or row number. The same is not true of Import-CSV and while I'm looking for a similar option with Import-CSV; I'd like to better understand why this works the way it does.
https://stackoverflow.com/a/23119235/15243610
Get-Content $colCnt | ?{$_} | Select -Skip 1 | %{if(!($_.split("|").Count -eq 210)){"Process stopped at line number $($_.psobject.Properties.value[5]), incorrect column count of: $($_.split("|").Count).";break}}
The answer in the other question works because Get-Content does indeed include the line number when it reads in the strings. When you run Get-Content each line will have a $_.ReadCount property as the 6th property on the object, which in my old answer I referenced in the PSObject for it as $_.psobject.Properties.value[5] (it was 7 years ago and I didn't know better yet, sorry). Mind you, if you use the -ReadCount parameter it will send that many lines through at a time, so Get-Content $file -readcount 5 | Select -first 1 | ForEach-Object{ $_.ReadCount } will come out as 5. Also -Raw sends everything through at once so it won't work with that.
Honestly, this isn't that hard to adapt to Import-Csv, we just increment a variable defined in the ForEach-Object loop.
Import-Csv C:\Path\To\SomeFile.csv | ForEach-Object -Begin {$x=1} -Process {
If($_.Something -eq $SomethingElse){
Write-Warning "Somethin' bad happened on line $x!"
break
}else{$_}
$x++
}

How does powershell lazily evaluate this statement?

I was searching for a way to to read only the first few lines of a csv file and came across this answer. The accepted answer suggests using
Get-Content "C:\start.csv" | select -First 10 | Out-File "C:\stop.csv"
Another answers suggests using
Get-Content C:\Temp\Test.csv -TotalCount 3
Because my csv is fairly large I went with the second option. It worked fine. Out of curiosity I decided to try the first option assuming I could ctrl+c if it took forever. I was surprised to see that it returned just as quickly.
Is it safe to use the first approach when working with large files? How does powershell achieve this?
Yes, Select-Object -First n is "safe" for large files (provided you want to read only a small number of lines, so pipeline overhead will be insignificant, else Get-Content -TotalCount n will be more efficient).
It works like break in a loop, by exiting the pipeline early, when the given number of items have been processed. Internally it throws a special exception that the PowerShell pipeline machinery recognizes.
Here is a demonstration that "abuses" Select-Object to break from a ForEach-Object "loop", which is not possible using normal break statement.
1..10 | ForEach-Object {
Write-Host $_ # goes directly to console, so is ignored by Select-Object
if( $_ -ge 3 ) { $true } # "break" by outputting one item
} | Select-Object -First 1 | Out-Null
Output:
1
2
3
As you can see, Select-Object -First n actually breaks the pipeline instead of first reading all input and then selecting only the specified number of items.
Another, more common use case is when you want to find only a single item in the output of a pipeline. Then it makes sense to exit from the pipeline as soon as you have found that item:
Get-ChildItem -Recurse | Where-Object { SomeCondition } | Select-Object -First 1
According to Microsoft the Get-Content cmdlet has a parameter called -ReadCount. Their documentation states
Specifies how many lines of content are sent through the pipeline at a time. The default value is 1. A value of 0 (zero) sends all of the content at one time.
This parameter does not change the content displayed, but it does affect the time it takes to display the content. As the value of ReadCount increases, the time it takes to return the first line increases, but the total time for the operation decreases. This can make a perceptible difference in large items.
Since -ReadCount defaults to 1 Get-Content effectively acts as a generator for reading a file line-by-line.

Get-Content Measure-Object Command : Additional rows are added to the actual row count

This is my first post here - my apologies in advance if I didn't follow a certain etiquette for posting. I'm a newbie to powershell, but I'm hoping someone can help me figure something out.
I'm using the following powershell script to tell me the total count of rows in a CSV file, minus the header. This generated into a text file.
$x = (Get-Content -Path "C:\mysql\out_data\18*.csv" | Measure-Object -Line).Lines
$logfile = "C:\temp\MyLog.txt"
$files = get-childitem "C:\mysql\out_data\18*.csv"
foreach($file in $files)
{
$x--
"File: $($file.name) Count: $x" | out-file $logfile -Append
}
I am doing this for 10 individual files. But there is just ONE file that keeps adding exactly 807 more rows to the actual count. For example, for the code above, the actual row count (minus the header) in the file is 25,083. But my script above generates 25,890 as the count. I've tried running this for different iterations of the same type of file (same data, different days), but it keeps adding exactly 807 to the row count.
Even when running only (Get-Content -Path "C:\mysql\out_data\18*.csv" | Measure-Object -Line).Lines, I still see the wrong record count in the powershell window.
I'm suspicious that there may be a problem specifically with the csv file itself? I'm coming to that conclusion since 9 out of 10 files generate the correct row count. Thank you advance for your time.
To measure the items in a csv you should use Import-Csv rather than Get-Content. This way you don't have to worry about headers or empty lines.
(Import-Csv -Path $csvfile | Measure-Object).Count
It's definitely possible there's a problem with that csv file. Also, note that if the csv has cells that include linebreaks that will confuse Get-Content so also try Import-CSV
I'd start with this
$PathToQuestionableFile = "c:\somefile.csv"
$TestContents = Get-Content -Path $PathToQuestionableFile
Write-Host "`n-------`nUsing Get-Content:"
$TestContents.count
$TestContents[0..10]
$TestCsv = Import-CSV -Path $PathToQuestionableFile
Write-Host "`n-------`nUsing Import-CSV:"
$TestCsv.count
$TestCsv[0..10] | Format-Table
That will let you see what Get-Content is pulling so you can narrow down where the problem is.
If it is in the file itself and using Import-CSV doesn't fix it I'd try using Notepad++ to check both the encoding and the line endings
encoding is a drop down menu, compare to the other csv files
line endings can be seen with (View > Show Symbol > Show All Characters). They should be consistent across the file, and should be one of these
CR (typically if it came from a mac)
LF (typically if it came from *nix or the internet)
CRLF (typically if it came from windows)

How can I remove duplicates in Powershell without running out of memory?

I'm currently using this command in Windows Powershell to remove duplicates from a simple 1 row CSV.
gc combine.csv | sort | get-unique > tags.cs
Whenever I run it on a 150mb CSV (20 million row guessing) the task manager shows the Powershell eating up all available memory (32GB) and then using virtual memory. I also let the script run for about an hour as well and it didn't finish. I find that as strange because in excel it usually takes a few seconds to remove duplicates from my 1M row CSVS. Any suggestions on how to deal with this?
You could try:
Get-Content combine.csv -ReadCount 1000 |
foreach-object { $_ } |
Sort-Object -Unique |
Set-Content tags.cs
(gc combine.csv -read 1kb | % { $_ } | sort -uniq | sc tags.cs)
But I think you'll hit the same problems. If you want faster results, and they don't need to be sorted they just need to be duplicate free:
$Lines = [System.Collections.Generic.HashSet[string]]::new()
$Lines.UnionWith([string[]][System.IO.File]::ReadAllLines('c:\path\to\combine.csv'))
[System.IO.File]::WriteAllLines('c:\path\to\tags.cs', $Lines)
That ran on my test 20M random numbers file in 23 seconds and ~1.5GB memory. If they do need to be sorted, use SortedSet instead of HashSet, that ran in 5 minutes and <2GB memory. While your code is still running and currently passing 15GB.
Edit: tiberriver256 comments that [System.IO.File]::ReadLines instead of ReadAllLines can be streamed before the file has finished being read; it returns an enumerator rather than a final array of all lines. In the HashSet case this knocks runtime down a little from 12.5s to 11.5s - it varies too much to be sure, but it does seem to help.
Excel is designed to deal with files that large efficiently (apparently? I'm actually a little surprised).
The major problem with your code is that you're sorting it. I know you're doing that because Get-Unique requires it, but the way that Sort-Object works is that it needs to collect every item being sent into it (in this case every line of the file) in memory in order to actually do the sort. And unlike your file, it's not just storing it as flat memory, it's storing it as N strings where N is the number of lines in your file, and all the overhead of those in-memory strings. As TessellatingHeckler points out, it seems to be tied much more to the sorting than the storing!
You probably want to be determining whether a given line is unique as you process it, so you can discard it right away.
For that, I'll recommend Sets. In particular a HashSet or, if you actually need it sorted, a SortedSet.
A simple conversion of your code:
Get-Content combine.csv |
ForEach-Object -Begin {
$h = [System.Collections.Generic.HashSet[String]]::new()
} -Process {
if ($h.Add($_)) {
$_
}
} |
Set-Content tags.cs
For me, testing this on a > 650 MB file with ~4M lines where only 26 were unique took just over a minute and didn't appreciably affect RAM.
The same file where about half the rows were unique took around 2 minutes, and used about 2 GB of RAM (with SortedSet it took a little over 2.5 mins and about 2.4 GB).
That same latter file, even with simplifying down from | sort | gu to | sort -Unique used over 5 GB of RAM in ~10 seconds.
You can probably squeeze more performance out if you start using StreamReader.ReadLine and for loops, and some other things, but I'll leave that an exercise for you.
It seems that in most implementations, in the best case, the amount of RAM used is going to be highly dependent on how many of the items are unique (with more unique items meaning more RAM).
Get-Content and stdio > are both pretty slow. .Net will likely give you much better performance.
Try:
$stream = [System.IO.StreamWriter] "tags.csv"
[System.IO.File]::ReadLines("combine.csv") | get-unique | sort | % { $Stream.writeline($_) }
$Stream.close()
Testing on my own box with a 4 column 1,000,000 row csv I hit 650MB of memory utilization at 22 seconds. Running the same csv with get-content and > was 2GB of memory and 60 seconds.
With some additional trickiness taken from a similar question here (Sort very large text file in PowerShell) you can further reduce the time by casting the data to a hashset to get unique values and then to a list and running the sort method as this seems to be a bit faster than PowerShell's Sort-Object.
$stream = [System.IO.StreamWriter] "tags.csv"
$UniqueItems = [system.collections.generic.list[string]]([System.Collections.Generic.HashSet[string]]([System.IO.File]::ReadLines("combine.csv")))
$UniqueItems.sort()
$UniqueItems | % { $Stream.writeline($_) }
$Stream.close()
Using this on my same dataset I was able to do it in 1 second with 144MB of memory usage.

how to autofit columns of csv from powershell

I have powershell script which connects to database & exports result in csv file.
However there is one column of date which size needs to be manually increased after opening csv file.
Do we have some command/property which will make columns AutoFit?
export-csv $OutFile -NoTypeInformation
I can't export excel instead CSV, cause I don't have excell installed on my machine.
This is what I have tried latest.
$objTable | Select Store,RegNo,Date,#{L="Amount";E={($_.Amount).PadLeft(50," ")}},TranCount
$objTable | export-csv $OutFile -NoTypeInformation
But even after adding PadLeft() output is same, Date column is short in width (showing ###, need to increase value manually)
When you say you need to increase one of your column sizes all the comments were right about how everything is formatted based on the object content. If you really need the cells to be a certain length you need to change the data before it is exported. Using the string methods .PadLeft() and .PadRight() I think you will get what you need.
Take this example using output from Get-ChildItem which uses a calculated property to pad the "column" so that all the data takes up at least 20 characters.
$path = "C:\temp"
$results = Get-ChildItem $path
$results | Select LastWriteTime,#{L="Name";E={($_.Name).PadLeft(20," ")}} | Export-CSV C:\temp\file.csv -NoTypeInformation
If that was then exported the output file would look like this (Notice the whitespace):
"LastWriteTime","Name"
"2/23/2015 7:33:55 PM"," folder1"
"2/23/2015 7:48:02 PM"," folder2"
"2/23/2015 7:48:02 PM"," Folder3"
"1/8/2015 10:37:45 PM"," logoutput"