CSV splitting causes errors - powershell

Could you please help me with an issue described below?
I wrote a script in PS which tries to split large CSV file (30 000 rows / 6MB) into smaller ones. New files are named as a mix of 1st and 2nd column content. If file already exists, script only appends new lines.
Main CSV file example:
Site;OS.Type;Hostname;IP address
Amsterdam;Server;AMS_SRVDEV01;10.10.10.12
Warsaw;Workstation;WAR-L4D6;10.10.20.22
Ankara;Workstation;AN-D5G36;10.10.13.22
Warsaw;Workstation;WAR-SRVTST02;10.10.20.33
Amsterdam;Server;LON-SRV545;10.10.10.244
PowerShell Version: 5.1.17134.858
function Csv-Splitter {
$fileName = Read-Host "Pass file name to process: "
$FileToProcess = Import-Csv "$fileName.csv" -Delimiter ';'
$MyList = New-Object System.Collections.Generic.List[string]
foreach ($row in $FileToProcess) {
if ("$($row.'OS.Type')-$($row.Site)" -notin $MyList) {
$MyList.Add("$($row.'OS.Type')-$($row.Site)")
$row | Export-Csv -Delimiter ";" -Append -NoTypeInformation "$($row.'OS.Type')-$($row.Site).csv"
}
else {
$row | Export-Csv -Delimiter ";" -Append -NoTypeInformation "$($row.'OS.Type')-$($row.Site).csv"
}
}
}
Basically, code works fine, however it generates some errors from time to time when it process through the loop. This causes lack of some rows in new files - number of missing rows equals to amount of errors:
Export-Csv : The process cannot access the file 'C:\xxx\xxx\xxx.csv' because
it is being used by another process.

Export-Csv is synchronous - by the time it returns, the output file has already been closed - so the code in the question does not explain the problem.
As you've since confirmed in a comment, based on a suggestion by Lee_Dailey, the culprit was the AV (anti-virus) Mcafee On-Access Scan module, which accessed each newly created file behind the scenes, thereby locking it temporarily, causing Export-Csv to fail intermittently.
The problem should go away if all output files can be fully created with a single Export-Csv call each, after the loop, as also suggested by Lee. This is preferable for performance anyway, but assumes that the entire CSV file fits into memory as a whole.
Here's a Group-Object-based solution that uses a single pipeline to implement write-each-output-file-in-full functionality:
function Csv-Splitter {
$fileName = Read-Host "Pass file name to process: "
Import-Csv "$fileName.csv" -Delimiter ';' |
Group-Object { $_.'OS.Type' + '_' + $_.Site + '.csv' } |
ForEach-Object { $_.Group | Export-Csv -NoTypeInformation $_.Name }
}
Your own answer shows alternative solutions that eliminate interference from the AV software.

Source of the issue was McAfee On-Access Scan which was scanning every file created. There are 3 ways to bypass the problem:
a) temporarily disable the whole AV / OAS module.
b) add powershell.exe to the OAS policies as a Low Risk process
c) collect all data in memory and create all files with Export-CSV, as a last step, as shown in the other answer.

Related

Get-Content Measure-Object Command : Additional rows are added to the actual row count

This is my first post here - my apologies in advance if I didn't follow a certain etiquette for posting. I'm a newbie to powershell, but I'm hoping someone can help me figure something out.
I'm using the following powershell script to tell me the total count of rows in a CSV file, minus the header. This generated into a text file.
$x = (Get-Content -Path "C:\mysql\out_data\18*.csv" | Measure-Object -Line).Lines
$logfile = "C:\temp\MyLog.txt"
$files = get-childitem "C:\mysql\out_data\18*.csv"
foreach($file in $files)
{
$x--
"File: $($file.name) Count: $x" | out-file $logfile -Append
}
I am doing this for 10 individual files. But there is just ONE file that keeps adding exactly 807 more rows to the actual count. For example, for the code above, the actual row count (minus the header) in the file is 25,083. But my script above generates 25,890 as the count. I've tried running this for different iterations of the same type of file (same data, different days), but it keeps adding exactly 807 to the row count.
Even when running only (Get-Content -Path "C:\mysql\out_data\18*.csv" | Measure-Object -Line).Lines, I still see the wrong record count in the powershell window.
I'm suspicious that there may be a problem specifically with the csv file itself? I'm coming to that conclusion since 9 out of 10 files generate the correct row count. Thank you advance for your time.
To measure the items in a csv you should use Import-Csv rather than Get-Content. This way you don't have to worry about headers or empty lines.
(Import-Csv -Path $csvfile | Measure-Object).Count
It's definitely possible there's a problem with that csv file. Also, note that if the csv has cells that include linebreaks that will confuse Get-Content so also try Import-CSV
I'd start with this
$PathToQuestionableFile = "c:\somefile.csv"
$TestContents = Get-Content -Path $PathToQuestionableFile
Write-Host "`n-------`nUsing Get-Content:"
$TestContents.count
$TestContents[0..10]
$TestCsv = Import-CSV -Path $PathToQuestionableFile
Write-Host "`n-------`nUsing Import-CSV:"
$TestCsv.count
$TestCsv[0..10] | Format-Table
That will let you see what Get-Content is pulling so you can narrow down where the problem is.
If it is in the file itself and using Import-CSV doesn't fix it I'd try using Notepad++ to check both the encoding and the line endings
encoding is a drop down menu, compare to the other csv files
line endings can be seen with (View > Show Symbol > Show All Characters). They should be consistent across the file, and should be one of these
CR (typically if it came from a mac)
LF (typically if it came from *nix or the internet)
CRLF (typically if it came from windows)

How to process Custom log data using Powershell?

I have a log file which has data separated with "|" symbol. Like
"Username|servername|access|password|group"
"Username|servername|access|password|group"
I need to validate the data. And, If the group column(record) is missing information or empty. I need to write only that row into another file. Please help me. Thanks in Advance.
If you're just checking for missing data, you can run a quick check using a regex of '(\S+\|){4}\S+'. Use Get-Content with the -ReadCount parameter, and you can work in batches of a few thousand records at a time, minimizing disk i/o and memory usage without going through them one record at a time.
Get-Content $inputfile -ReadCount 2000 |
foreach {
$_ -notmatch '(\S+\|){4}\S+' |
Add-Content $outputfile
}
You could use 'Import-CSV with -Delimiter '|'. If your file doesn't have a header line, you would also need to use -Header to define it. You could then use Where to filter for the empty Group lines and Export-CSV with -Delimiter again to create a new file of just those lines.
For example:
Import-CSV 'YourLog.log' -Delimiter '|' -Header 'Username','Servername','Access','Password','Group' |
Where {$_.'Group' -eq ''} |
Export-CSV 'EmptyGroupLines.log' -Delimiter '|'
If your group column is always in the same place, which it looks like it is, you could use the split method. You can certainly neaten the code up. I have used the below as an example as to how you could use split.
The foreach statement is to iterate through each line in your file.
if (!$($groupstring.Split('|')[4])) checks if it is null.
$groupstring = 'Username|servername|access|password|group'
$groupstring.Split('|')[4]
foreach ($item in $collection)
{
if (!$($groupstring.Split('|')[4]))
{
Write-Host "variable is null"
}
}
Hope this helps.
Thanks, Tim.

Using duplicate headers in Powershell .csv file

I have a .csv file and I want to import it into powershell then iterate through the file changing certain values. I then want the output to append to the original .csv file, so that the values have been updated.
My issue is that the .csv file has headers which aren't unique, and can't be changed as then it won't work in another program. Originally I defined my own headers in the powershell to get around this but then the output file has these new headers when it needs to have the old ones.
I have also tried ConvertFrom-Csv which means I can no longer access the columns I need to, so lots of runtime errors.
What would be ideal is to be able to use the defined column headers and then convert back to the original column headers. My current code is below:
$csvfile = Import-Csv C:\test.csv| Where-Object {$_.'3' -eq $classID} | ConvertFrom-Csv
foreach($record in $csvfile){
*do something*}
$csvfile | Export-Csv -path C:\test.csv -NoTypeInformation -Append
I've searched the web now for some hours and tried everything I've come across, to no avail.
Thanks in advance.
This is a somewhat hackish implementation but should work.
Remove all the headers as a single line and save it somewhere
Parse the new result-set (with the headers removed)
Add the line at the top when you are finished
A CSV is a comma delimited file, you don't have to treat it like structured data. Feel free to splice and dice as you want.
Since you know beforehand how many columns are in the input CSV file, you can import without the header and process internally. Example:
$columns = 78
Import-Csv "inputfile.csv" -Header (0..$($columns - 1)) | Select-Object -Skip 1 | ForEach-Object {
$row = $_
$outputObject = New-Object PSObject
0..$($columns- 1) | ForEach-Object {
$outputObject | Add-Member NoteProperty "Col$_" $row.$_
}
$outputObject
} | Export-Csv "outputfile.csv" -NoTypeInformation
This example generates new PSObjects and then outputs a new CSV file with generic column names (Col0, Col1, etc.).

Replace lines with specific string and save with the same name

I'm working with an application that creates a log file. Due to an error in the software itself, it keeps producing three errors I'm not interested in. Each line has a unique identifier so I can't just replace the line since each one is different.
I have two main issues with this: I need to save it with the same name, and while it works the file should be available (in case the logger needs to write something).
I can't hard-code the original app to prevent it from writing that part of the log.
I have tried so far:
Get-Content log.log | Where-Object {$_-notmatch 'ERROR1' -And $_-notmatch 'ERROR2' -And $_-notmatch 'ERROR3' } `|Set-Content log_stripped.log
^ It only works if the output file has a different name.
Get-Content error.log | foreach-object { Where-Object {$_-notmatch 'ERROR1' -And $_-notmatch 'ERROR2' -And $_-notmatch 'ERROR3' } } | Set-Content error.log
^ This one froze my PS session.
I also tried reading the file to a variable:
$logcontent = ${h:error.log}
but I got System.OutOfMemoryException.
Ideally, what I need is something that reads the log file, takes away all the lines I don't want, and then save it with its original name.
Ideas? (Keep in mind that the log file is +/- 900 MB with the unnecesary data and 45mb once I strip the data with the first method - but I need it to save the file with its original name)
You can't save the file back to the same name while you're still reading from it, which means you'd have to read the whole 900MB into memory before you start writing. Not a good idea.
Try this:
Remove-Item log_stripped.log
Get-Content log.log -ReadCount 1000 |
foreach {$_ -notmatch 'ERROR1|ERROR2|ERROR3' | Add-Content log_stripped.log }
Remove-item log.log
Rename-Item log_stripped.log log.log
I know you said you want to save to the same filename, but if the reason you want that is that you want the log to be continuously updated, then you could do the following:
Get-Content -Wait log.log |
? {$_ -notmatch 'ERROR1|ERROR2|ERROR3' } |
Out-File log_stripped.log
Note the -Wait on the Get-Content.
log_stripped.log will be continuously updated as log.log is updated.

I need to hash (obfuscate) a column of data in a CSV file. Script preferred

I have a pipe-delimited text file with a header row. (I said CSV in the question to make it a a bit more immediately understandable ... I imagine most solutions would be applicable to either format.)
The file looks like this:
COLUMN1|COLUMN2|COLUMN3|COLUMN4|...|
Field1|Field2|Field3|Field4|...|
...
I need to obscure the data in (for example) columns 3 and 9, without affecting any of the other entries in the file.
I want to do this using a hashing algorithm like SHA1 or MD5, so that the same strings will resove to the same hash values anywhere they are encountered.
EDIT - Why I want to do this
I need to send some data to a third party, and certain columns contain sensitive information (e.g. customer names). I need the file to be complete, and where a string is replaced, I need it to be done in the same way every time it is encountered (so that any mapping or grouping remains). It does not need military encryption, just to be difficult to reverse. As I need to to this intermittently, a scripted solution would be ideal.
/EDIT
What is the easiest way to achieve this using a command line tool or script?
By preference, I would like a batch script or PowerShell script, since that does not require any additional software to achieve...
Try
(Import-Csv .\my.csv -delimiter '|' ) | ForEach-Object{
$_.column3 = $_.column3.gethashcode()
$_.column4 = $_.column4.gethashcode()
$_
} | Export-Csv .\myobfuscated.csv -NoTypeInformation -delimiter '|'
$md5 = new-object -TypeName Security.Cryptography.MD5CryptoServiceProvider
$utf8 = new-object -TypeName Text.UTF8Encoding
import-csv original.csv -delimiter '|' |
foreach {
$_.Column3 = [BitConverter]::ToString($md5.ComputeHash($utf8.GetBytes($_.Column3)))
$_.Column9 = [BitConverter]::ToString($md5.ComputeHash($utf8.GetBytes($_.Column9)))
$_
} |
export-csv encrypted.csv -delimiter '|' -noTypeInformation