I have a large number of .txt files pulled from pdf and formatted with comma delimiters.
I'm trying to append these text files to one another with a new line between each. Earlier in the formatting process I took multi-line input and formatted it into one line with entries separated by commas.
Yet when appending one txt file to another in a csv the previous formatting with many line breaks returns. So my final output is valid csv, but not representative of each text file being one line of csv entries. How can I ensure the transition from txt to csv retains the formatting of the txt files?
I've used Export-CSV, Add-Content, and the >> operator with similar outcomes.
To summarize, individual .txt files with the following format:
,927,Dance like Misty"," shine like Lupita"," slay like Serena. speak like Viola"," fight like Rosa! ,United States ,16 - 65+
Turn into the following when appended together in a csv file:
,927
,Dance like Misty"," shine like Lupita"," slay like Serena. speak like Viola"," fight like Rosa!
,United States
,16 - 65+
How the data was prepped:
Removing new lines
Foreach($f in $FILES){(Get-Content $f -Raw).Replace("`n","") | Set-Content $f -Force}
Adding one new line to the end of each txt file
foreach($f in $FILES){Add-Content -Path $f -value "`n" |Set-Content $f -Force}
Trying to Convert to CSV, one text file per line with comma delimiter:
cat $FILES | sc csv.csv
Or
foreach($f in $FILES){import-csv $f -delimiter "," | export-csv $f}
Or
foreach($f in $FILES){ Export-Csv -InputObject $f -append -path "test.csv"}
Return csv with each comma separated value on a new line, instead of each txt file as one line.
This was resolved by realizing that even though notepad was showing no newlines, there were still hidden return carriage characters. On loading the apparently one line csv files into Notepad++ and toggling "show hidden characters" this oversight was evident.
By replacing both \r and \n characters before converting to CSV,
Foreach($f in $FILES){(Get-Content $f -Raw).Replace("\n","").Replace("\r","" |
Set-Content $f -Force}
The CSV conversion process worked as planned using the following
cat $FILES | sc final.csv
Final verdict --
The text file that appeared to be a one line entry ready to become CSV
,927,Dance like Misty"," shine like Lupita"," slay like Serena. speak like Viola"," fight like Rosa! ,United States ,16 - 65+
Still had return carriage characters between each value. This was made evident by trying another text editor with the feature "show hidden characters."
Related
I found a nifty command here - http://www.stackoverflow.com/questions/27892957/merging-multiple-csv-files-into-one-using-powershell that I am using to merge CSV files -
Get-ChildItem -Filter *.csv | Select-Object -ExpandProperty FullName | Import-Csv | Export-Csv .\merged\merged.csv -NoTypeInformation -Append
Now this does what it says on the tin and works great for the most part. I have 2 issues with it however, and I am wondering if there is a way they can be overcome:
Firstly, the merged csv file has CRLF line endings, and I am wondering how I can make the line endings just LF, as the file is being generated?
Also, it looks like there are some shenanigans with quote marks being added/moved around. As an example:
Sample row from initial CSV:
"2021-10-05"|"00:00"|"1212"|"160477"|"1.00"|"3.49"LF
Same row in the merged CSV:
"2021-10-05|""00:00""|""1212""|""160477""|""1.00""|""3.49"""CRLF
So see that the first row has lost its trailing quotes, other fields have doubled quotes, and the end of the row has an additional quote. I'm not quite sure what is going on here, so any help would be much appreciated!
For dealing with the quotes, the cause of the “problem” is that your CSV does not use the default field delimiter that Import-CSV assumes - the C in CSV stands for comma, and you’re using the vertical bar. Add the parameter -Delimiter "|" to both the Import-CSV and Export-CSV cmdlets.
I don’t think you can do anything about the line-end characters (CRLF vs LF); that’s almost certainly operating-system dependent.
Jeff Zeitlin's helpful answer explains the quote-related part of your problem well.
As for your line-ending problem:
As of PowerShell 7.2, there are no PowerShell-native features that allow you to control the newline format of file-writing cmdlets such as Export-Csv.
However, if you use plain-text processing, you can use multi-line strings built with the newline format of interest and save / append them with Set-Content and its -NoNewLine switch, which writes the input strings as-is, without a (newline) separator.
In fact, to significantly speed up processing in your case, plain-text handling is preferable, since in essence your operation amounts to concatenating text files, the only twist being that the header lines of all but the first file should be skipped; using plain-text handling also bypasses your quote problem:
$tokenCount = 1
Get-ChildItem -Filter *.csv |
Get-Content -Raw |
ForEach-Object {
# Get the file content and replace CRLF with LF.
# Include the first line (the header) only for the first file.
$content = ($_ -split '\r?\n', $tokenCount)[-1].Replace("`r`n", "`n")
$tokenCount = 2 # Subsequent files should have their header ignored.
# Make sure that each file content ends in a LF
if (-not $content.EndsWith("`n")) { $content += "`n" }
# Output the modified content.
$content
} |
Set-Content -NoNewLine ./merged/merged.csv # add -Encoding as needed.
I have folder info for all user folders. It is dumped out to a CSV file as follows:
Servername, F:\Users\user, 9,355.7602 MB, 264, 3054, 03/15/2000 13:28:48, 12/10/2018 11:58:29
We are unable to work with the data as is due to the thousands separator in the 3rd column. I could run the report scripts again, but we have a lot of file servers and a large number of users on one in particular, so running it again is very time consuming. The reason the commas are there is that the data was written as a string not a number.
I can import and convert, the only problem is that any number over 1000 will be wrong and then all other data is 1 column off. I would like to replace any comma between 2 numbers. It doesn't seem it would be that hard to do with PowerShell, but I am not having any luck finding anything.
If you assume that columns of data are comma plus space separated and your numbers have no spaces, you can use the -replace operator for this.
$line = 'Servername, F:\Users\user, 9,355.7602 MB, 264, 3054, 03/15/2000 13:28:48, 12/10/2018 11:58:29'
$line -replace '(?<=\d),(?=\d)'
If you are reading the data from a file, you can read the data with Get-Content, replace your data, and update the file with Set-Content.
(Get-Content file.csv) -replace '(?<=\d),(?=\d)' | Set-Content file.csv
If the file is large, you can utilize the faster switch statement.
$data = switch -regex -file file.csv {
'(?<=\d),(?=\d)' { $_ -replace '(?<=\d),(?=\d)' }
default {$_}
}
$data | Set-Content file.csv
Explanation:
(?<=\d) uses a positive lookbehind assertion (?<=) that matches a single digit \d.
(?=\d) uses a positive lookahead assertion (?=) that matches a single digit. You could replace this with (?=\d{3}) to match 3 consecutive digits after the comma.
Since you want to replace the target comma with empty string, you do not need a replacement string.
Typically, it would be best to stick with commands that work with CSV data or files. However, if your data contains commas and you aren't qualifying your text, it may be difficult to distinguish between data and delimiters. If you have a clear way of making that distinction, you are better off using ConvertFrom-Csv for already read data or Import-Csv for files. You will need to define headers either in the files or in the command.
EDIT
It was my oversight that the , in the dataset is not delimited, which causes this answer to not work as expected as the comma is seen as a column separator when parsing the CSV. I'm going to leave it as it does explain how to generally manipulate the data as you'd expect, if the column data were escaped property. However, #AdminOfThings' answer below should work for your specific case here, and will fix the erroneous defined column without relying on parsing the CSV content as a CSV first.
Import the data using Import-Csv, then remove any , in the third column. This assumes that you have no values where , is the decimal separator:
If you have headers in the CSV, you won't need to define header names or get fancy with writing the CSV back out:
Import-Csv -Path \path\to\file.csv | Foreach-Object {
$_.ColumnName = $_.ColumnName -replace ','
} | Export-Csv -NoTypeInformation -Path \path\to\file.csv
The way this works is that we import the CSV as an operable PSCustomObject, then for each line we take whatever the column name with the size is and remove the , from it. Finally, we export the modified PSCustomObject back out to the original CSV.
If you don't have headers, it gets a little trickier since we have to define temporary headers, but Export-Csv doesn't have an option to skip writing out headers:
Import-Csv -Path \path\to\file.csv -Headers Col1, Col2, Col3, Col4, Col5, Col6, Col7 |
Foreach-Object {
$_.Col3 = $_.Col3 -replace ','
} | ConvertTo-Csv | Select-Object -Skip 1 |
Set-Content -Path \path\to\file.csv
This does the same thing as the first block of code, but since we don't want to export the temporary headers, we have to get creative. First, note we reference the target column with the temporary header name. Instead of piping the modified CSV object right to Export-Csv, first we want to convert the object to CSV using ConvertTo-Csv. We then use Select-Object to skip the first line of the converted CSV text, which is the header, so we just have the row data and column values. Finally, we use Set-Content to write the CSV text without the header back to the original file.
I am new to powershell scripting and I am looking for a way to add 2 new rows at the top of the already present csv file.
Things that I have tried is replacing the header and rows with the new rows.
I am looking for a way to add 2 new rows above the header in CSV.
You mention that you want to add the new lines above the header, which means that no CSV-specific processing is needed - it sounds like you're asking how to prepend lines to an existing text file (which happens to contain CSV - note that the resulting file will no longer be a valid CSV file).
E.g., assuming a target file named some.csv:
Note: Best to make a backup of the target file before trying these commands.
If the input file is small enough to fit into memory as a whole:
Reading the entire target file into memory as a single string with Get-Content -Raw allows for a convenient and concise solution:
Set-Content -LiteralPath some.csv -NoNewLine -Value (
#'
New line 1 above header
New line 2 above header
'# + (Get-Content -Raw some.csv)
)
Note that Set-Content applies a default character encoding (the active ANSI code page in Windows PowerShell, UTF-8 without BOM in PowerShell Core), irrespective of the current encoding of some.csv, so you may have to use the -Encoding parameter to specify the encoding explicitly.
Also note that the single-quoted here-string (#'<newline>...<newline>'#) uses the same newline style (CRLF (Windows-style) vs. LF (Unix-style)) as the enclosing script, which may not match the style used in some.csv - though PowerShell itself has no problem processing files with mixed newlines styles.
If the file is too large to fit into memory, use a streaming (line-by-line) approach:
$ErrorActionPreference = 'Stop'
# Create a temporary file and fill it with the 2 new lines.
$tempFile = [IO.Path]::GetTempFileName()
'New line 1 above header', 'New line 2 above header' | Set-Content $tempFile
# Then append the CSV file's lines one by one.
Get-Content some.csv | Add-Content $tempFile
# If that succeeded, replace the original file.
Move-Item -Force $tempFile some.csv
Note: Use of the Get-Content, Set-Content and Add-Content cmdlets is convenient, but slow; the next section shows a faster alternative.
If performance matters, use .NET types such as [IO.File] instead:
$ErrorActionPreference = 'Stop'
# Create a temporary file...
$tempFile = [IO.Path]::GetTempFileName()
# ... and fill it with the 2 new lines.
$streamWriter = [IO.File]::CreateText($tempFile)
foreach ($lineToPrepend in 'New line 1 above header', 'New line 2 above header') {
$streamWriter.WriteLine($lineToPrepend)
}
# Then append the CSV file's lines one by one.
foreach ($csvLine in [IO.File]::ReadLines((Convert-Path some.csv))) {
$streamWriter.WriteLine($csvLine)
}
$streamWriter.Dispose()
# If that succeeded, replace the original file.
Move-Item -Force $tempFile some.csv
This question is similar to earlier question How can I replace every occurence of a String in a file with PowerShell?" except my challenge is to replace the text is multiple files. I tried using the solution in earlier question and used a command similar like below.
(Get-Content .\*.txt).replace("old text", "new text") | Set-Content .\*.txt
It seem to work but the each file size has increased drastically to the total of files in the directory. Although when I open any file it seems to look normal.
Anyone has ideas how to fix it. My litmus test would be I should revert my text changes and file sizes shouldn't change at all.
You must process the files one at a time:
Get-Item *.txt |
ForEach-Object {
$f = $_.FullName; (Get-Content $f).replace("old text", "new text") | Set-Content $f
}
Note that this will fail with completely empty (zero-byte) files.
Also, irrespective of what the encoding of the input files was, the output files will have Default encoding, according to the system's legacy code page (typically, a single-byte, extended-ASCII encoding).
As for what you tried:
(Get-Content .*.txt) sends the lines from all *.txt files as a single array of lines through the pipeline.
Set-Content *.txt then sends that one array (with replacements made) as a whole to every *.txt file in the current directory.
I have 3 .txt files that each need to be converted into .csv files. Each file has 12 columns and some of these columns have data with leading zeroes. These zeroes need to remain. Is there a way through PowerShell to write a loop that will export each of these to a .csv and maintain the leading zeros?
The closest thing I could do was to export them one at a time, but this doesn't maintain the trailing zeros that I need.
Import-Csv C:\AcctsLog.txt -Delimiter ";" | Export-Csv C:\AcctsLog.csv
A sample line would be something like:
Joe Smith;1933 Test Lane;Apt 34;Los Angeles;CA;90003-3444;0000000023;0002;New Car;SmithJoe#yahoo.com;00934200034006700213;0000666666
See if this works with your data:
Import-Csv C:\AcctsLog.txt -Delimiter ';' -Header (1..12) |
ConvertTo-Csv -NoTypeInformation | select -Skip 1 |
Set-Content C:\AcctsLog.csv
If you explicitly want it to include the leading 0's in Excel you would have to save it as an Excel file (otherwise Excel strips leading zeros off values that it interprets as numbers when opening a CSV). You could paste the data into Excel after formatting the cells as Text, then save the files as excel files. But if you want CSV files then go with mjolinor's answer since it produces CSV files with the leading zeros, exactly like you asked for.
To work with Excel you have to create an Excel ComObject. Then you can get the content of your file, replace the semicolons with tabs, pipe to Clip, and paste right into Excel (after creating a workbook and formatting the 12 columns that you need). Should be pretty simple:
$Excel = New-Object -ComObject Excel.Application
$Excel.Visible = $true
$FileList = #("C:\Temp\AcctsLog.txt","C:\Temp\SecondFile.txt","C:\Temp\ThirdFile.txt")
ForEach($File in $FileList){
[void]$Excel.Workbooks.Add()
$Excel.ActiveSheet.Range("A:L").NumberFormat = '#'
(Get-Content $File) -replace ';', "`t" | Clip
$Excel.ActiveSheet.Paste()
$Excel.ActiveWorkbook.SaveAs(($File -replace "txt$","xlsx"))
$Excel.ActiveWorkbook.Close($false)
}
$Excel.Quit()
There is a simple way to maintain the leading zeroes in Excel.
Simply add this to the cell and type whatever value you need and the zeroes will be retained
For ex: If I want 0000000023
Type into a cell '0000000023
That ' symbol seems to retain the zeroes as long as you type it before the values.