How to process large CSV file in powershell

How to process large CSV file in powershell - powershell

I am trying to find the number of rows in a csv file that have above a certain value. The code I have goes something like
$T6=Import-Csv $file | Where-Object {$_."Value" -ge 0.6 } | Measure-Object
This works well for smaller files but for large csv files(1 GB or more) it will run forever. Is there any better way to parse csv files like this in powershell?

Import-Csv is the official cmdlet for this. One comment though, everything imported is a string, so you better cast the Value property to the correct type. For instance:
$T6 = Import-Csv $file | Where-Object { [float]$_.Value -ge 0.6 } | Measure-Object

You can try to get rid of Import-Csv:
$values = ([System.IO.File]::ReadAllText('c:\pst\New Microsoft Office Excel Worksheet.csv')).Split(";") | where {$_ -ne ""}
$items = New-Object "System.Collections.Generic.List[decimal]"
foreach($value in $values)
{
[decimal]$out = New-Object decimal
if ([System.Decimal]::TryParse($value, [ref] $out))
{
if ($out -ge 10){$items.Add($out)}
}
}
$items | Measure-Object

For speed when processing large files consider using a streamreader, Roman's answer here demonstrates usage.

Related

Powershell: delete duplicate entry in arraylist

In my Powershellscript I read some data from a csv-File in an Arraylist.
In the second step I eliminate every line without the specific char: (.
At the third step I want to eliminate every double entries.
Example for my list:
Klein, Jürgen (Klein01); salesmanagement national
Klein, Jürgen (Klein01); salesmanagement national
Meyer, Gerlinde (Meyer02); accounting
Testuser
Admin1
Müller, Kai (Muell04); support international
I use the following script:
$Arrayusername = New-Object System.Collections.ArrayList
$NewArraylistuser = New-Object System.Collections.ArrayList
$Arrayusername = Get-Content -Path "C:\Temp\User\Userlist.csv"
for ($i=0; $i -le $Arrayusername.length; $i++)
{
if ($Arrayusername[$i] -like "*(*")
{
$NewArraylistuser.Add($Arrayusername_ads[$i])
}
$Array_sorted = $NewArraylistuser | sort
$Array_sorted | Get-Unique
}
But the variable $Array_sorted still has double entries.
I don´t find the mistake.

Some Ideas how you could change your code:
Use the existing Command to import .csv files with the Delimiter ;.
Filter the output with Where-Object to only include Names with (.
Select only unique objects with Select-Object, or if you want to sort the Object, use the Sort-Object with the same paramets.
Something like this should work:
Import-csv -Delimiter ';' -Header "Name","Position" -Path "C:\Temp\User\Userlist.csv" | Where-Object {$_.Name -like "*(*"} | Sort-Object -Unique -Property Name,Position

powershell: Write specific rows from files to formatted csv

The following code gives me the correct output to console. But I would need it in a csv file:
$array = #{}
$files = Get-ChildItem "C:\Temp\Logs\*"
foreach($file in $files){
foreach($row in (Get-Content $file | select -Last 2)){
if($row -like "Total peak job memory used:*"){
$sp_memory = $row.Split(" ")[5]
$array.Add(($file.BaseName),([double]$sp_memory))
break
}
}
}
$array.GetEnumerator() | sort Value -Descending |Format-Table -AutoSize
current output (console):
required output (csv):
In order to increase performance I would like to avoid the array and write output directly to csv (no append).
Thanks in advance!

Change your last line to this -
$array.GetEnumerator() | sort Value -Descending | select #{l='FileName'; e={$_.Name}}, #{l='Memory (MB)'; e={$_.Value }} | Export-Csv -path $env:USERPROFILE\Desktop\Output.csv -NoTypeInformation
This will give you a csv file named Output.csv on your desktop.
I am using Calculated properties to change the column headers to FileName and Memory (MB) and piping the output of $array to Export-Csv cmdlet.
Just to let you know, your variable $array is of type Hashtable which won't store duplicate keys. If you need to store duplicate key/value pairs, you can use arrays. Just suggesting! :)

Parse line of text and match with parse of CSV

As a continuation of a script I'm running, working on the following.
I have a CSV file that has formatted information, example as follows:
File named Import.csv:
Name,email,x,y,z
\I\RS\T\Name1\c\x,email#jksjks,d,f
\I\RS\T\Name2\d\f,email#jsshjs,d,f
...
This file is large.
I also have another file called Note.txt.
Name1
Name2
Name3
...
I'm trying to get the content of Import.csv and for each line in Note.txt if the line in Note.txt matches any line in Import.csv, then copy that line into a CSV with append. Continue adding every other line that is matched. Then this loops on each line of the CSV.
I need to find the best way to do it without having it import the CSV multiple times, since it is large.
What I got does the opposite though, I think:
$Dir = PathToFile
$import = Import-Csv $Dir\import.csv
$NoteFile = "$Dir\Note.txt"
$Note = GC $NoteFile
$Name = (($Import.Name).Split("\"))[4]
foreach ($j in $import) {
foreach ($i in $Note) {
$j | where {$Name -eq "$i"} | Export-Csv "$Dir\Result.csv" -NoTypeInfo -Append
}
}
This takes too long and I'm not getting the extraction I need.

This takes too long and I'm not getting the extraction I need.
That's because you only assign $name once, outside of the outer foreach loop, so you're basically performing the same X comparisons for each line in the CSV.
I would rewrite the nested loops as a single Where-Object filter, using the -contains operator:
$Import |Where-Object {$Note -contains $_.Name.Split('\')[4]} |Export-Csv "$Dir\Result.csv" -NoTypeInformation -Append

Group the imported data by your distinguishing feature, filter the groups by name, then expand the remaining groups and write the data to the output file:
Import-Csv "$Dir\import.csv" |
Group-Object { $_.Name.Split('\')[4] } |
Where-Object { $Note -contains $_.Name } |
Select-Object -Expand Group |
Export-Csv "$Dir\Result.csv" -NoType

Using Powershell to compare two files and then output only the different string names

So I am a complete beginner at Powershell but need to write a script that will take a file, compare it against another file, and tell me what strings are different in the first compared to the second. I have had a go at this but I am struggling with the outputs as my script will currently only tell me on which line things are different, but it also seems to count lines that are empty too.
To give some context for what I am trying to achieve, I would like to have a static file of known good Windows processes ($Authorized) and I want my script to pull a list of current running processes, filter by the process name column so to just pull the process name strings, then match anything over 1 character, sort the file by unique values and then compare it against $Authorized, plus finally either outputting the different process strings found in $Processes (to the ISE Output Pane) or just to output the different process names to a file.
I have spent today attempting the following in Powershell ISE and also Googling around to try and find solutions. I heard 'fc' is a better choice instead of Compare-Object but I could not get that to work. I have thus far managed to get it to work but the final part where it compares the two files it seems to compare line by line, for which would always give me false positives as the line position of the process names in the file supplied would change, furthermore I only want to see the changed process names, and not the line numbers which it is reporting ("The process at line 34 is an outlier" is what currently gets outputted).
I hope this makes sense, and any help on this would be very much appreciated.
Get-Process | Format-Table -Wrap -Autosize -Property ProcessName | Outfile c:\users\me\Desktop\Processes.txt
$Processes = 'c:\Users\me\Desktop\Processes.txt'
$Output_file = 'c:\Users\me\Desktop\Extracted.txt'
$Sorted = 'c:\Users\me\Desktop\Sorted.txt'
$Authorized = 'c:\Users\me\Desktop\Authorized.txt'
$regex = '.{1,}'
select-string -Path $Processes -Pattern $regex |% { $_.Matches } |% { $_.Value } > $Output_file
Get-Content $Output_file | Sort-Object -Unique > $Sorted
$dif = Compare-Object -ReferenceObject $(Get-Content $Sorted) -DifferenceObject $(get-content $Authorized) -IncludeEqual
$lineNumber = 1
foreach ($difference in $dif)
{
if ($difference.SideIndicator -ne "==")
{
Write-Output "The Process at Line $linenumber is an Outlier"
}
$lineNumber ++
}
Remove-Item c:\Users\me\Desktop\Processes.txt
Remove-Item c:\Users\me\Desktop\Extracted.txt
Write-Output "The Results are Stored in $Sorted"

From the length and complexity of your script, I feel like I'm missing something, but your description seems clear
Running process names:
$ProcessNames = #(Get-Process | Select-Object -ExpandProperty Name)
.. which aren't blank: $ProcessNames = $ProcessNames | Where-Object {$_ -ne ''}
List of authorised names from a file:
$AuthorizedNames = Get-Content 'c:\Users\me\Desktop\Authorized.txt'
Compare:
$UnAuthorizedNames = $ProcessNames | Where-Object { $_ -notin $AuthorizedNames }
optional output to file:
$UnAuthorizedNames | Set-Content out.txt
or in the shell:
#(gps).Name -ne '' |? { $_ -notin (gc authorized.txt) } | sc out.txt
1 2 3 4 5 6 7 8
1. #() forces something to be an array, even if it only returns one thing
2. gps is a default alias of Get-Process
3. using .Property on an array takes that property value from every item in the array
4. using an operator on an array filters the array by whether the items pass the test
5. ? is an alias of Where-Object
6. -notin tests if one item is not in a collection
7. gc is an alias of Get-Content
8. sc is an alias of Set-Content
You should use Set-Content instead of Out-File and > because it handles character encoding nicely, and they don't. And because Get-Content/Set-Content sounds like a memorable matched pair, and Get-Content/Out-File doesn't.

Using powershell to transform CSV file

I have CSV files which have a lot of columns. I need to transform several columns, for example, some date columns have text string of "Missing" and I want to replace "Missing" to an empty string, etc.
The following code may work but it will be a long file since there are a lot of columns. Is it a better way to write it?
Import-Csv $file |
select #(
#{l="xxx"; e={ ....}},
# repeat many times for each column....
) | export-Csv

You could use an imperative style rather than a pipelined style:
$records = Import-Csv $file
foreach ($record in $records)
{
if ($record.Date -eq 'Missing')
{
$record.Date = ''
}
}
$records | Export-Csv $file
Edit: To use a pipelined style, you could do it like this:
import-csv $file |
select -ExcludeProperty Name1,Name2 -Property *,#{n='Name1'; e={"..."}},#{n='Name2'; e={'...'}}
The * is a wildcard that matches all properties. I couldn't find a way to format this code in a nicer way, so it is kind of ugly looking.

If all you want to do is a find-replace, you don't really need to read it as a CSV.
You could do this instead:
Get-Content $file | %{$_.ToString().Replace("Missing", "")} | Out-File $file

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to process large CSV file in powershell - powershell

Import-Csv is the official cmdlet for this. One comment though, everything imported is a string, so you better cast the Value property to the correct type. For instance: $T6 = Import-Csv $file | Where-Object { [float]$_.Value -ge 0.6 } | Measure-Object

For speed when processing large files consider using a streamreader, Roman's answer here demonstrates usage.

Related

Powershell: delete duplicate entry in arraylist

powershell: Write specific rows from files to formatted csv

Parse line of text and match with parse of CSV

Using Powershell to compare two files and then output only the different string names

Using powershell to transform CSV file

Categories

Resources