Count characters in string then insert delimiter using PowerShell - powershell

I have a linux server that will be generating several files throughout the day that need to be inserted in to a database; using Putty I can sftp them off to a server running SQL 2008. Problem is is the structure of the file itself, it has a string of text that are to be placed in different columns, but bulk insert in sql tries to put it all in to one column instead of six. Powershell may not be the best method, but I have seen on several sites how it can find and replace or append to the end of the line, can it count and insert?
So the file looks like this: '18240087A +17135555555 3333333333', where 18, 24, 00, 87, A are different columns, then there is a blank space between the A and the +, that is character count 10-19 which is another column, then characters 20-30 are a column, characters 31-36 are a space which is new column and so on. So I want to insert a '|' or a ',' so that sql understands where the columns end. Is this possible for PowerShell to count randomly?
This may not be the way to respond to all who did answer, i apologize in advance. As this is my first PowerShell script, I appreciate the input from each of you. This is an Avaya SIP server that is generating CDR records, which I must pull from the server and insert in to SQL for later reports. The file exported looks like this:
18:47 10/15
18470214A +14434444444 3013777777 CME-SBC HHHH-CM 4 M00 0
At first I just thought to delete the first line and run a script against the output, which I modified from Kieranties post:
$test = Get-Content C:\Share\CDR\testCDR.txt
$pattern = "^(.{2})(.{2})(.{1})(.{2})(.{1})(.{1})\s*(.{15})(.{10})\s*(.{7})\s*(.{7})\s*(.{1})\s*(.{1})(.{1})(.{1})\s*(.*)$"
if($test -match $pattern){
$result = $matches.Values | select -first ($matches.Count-1)
[array]::Reverse($result, 0, $result.Length)
$result = $result -join "|"
$result | Out-File c:\Share\CDR\results1.txt
}
But then i realized I need that first line as it contains the date. I can try to work that out another way though.
I also now see that there are times when the file contains 2 or more lines of CDR info, such as:
18:24 10/15
18240087A +14434444444 3013777777 CME-SBC HRSA-CM 4 M00 0
18240096A +14434444445 3013777778 CME-SBC HRSA-CM 4 M00 0
Whereas the .ps1 file I made does not give the second string, so I tried adding in this:
foreach ($Data in $test)
{
$Data = $Data -split(',')
and it fails to run. How can I do multiple lines (and possibly that first line)? If you know of a tutorial that can help, that's greatly appreciated as well!

PowerShell is a great tool that I love and it can do many things. I see that you are using SQL Server 2008. Depending on the edition of SQL Server you have running on the server, it most likely has SQL Server Integration Services (SSIS), which is an Extract, Transform, and Load (ETL) tool designed to help migrate data in many scenarios, such as yours. The file you describe here is sounds like a fixed width file, which SSIS can easily handle and import and SQL Server has great ways to automate the loads if this is a recurring need (Which it sounds like), including the automation of the sftp task, and even running PowerShell scripts as part of the ETL (I've done that several times).
If your file truly is fixed width and you want to use PowerShell to transform it into a delimited file, the regex approach you have in your answer works well, or there are several approaches using the System.String methods, like .insert() which allows you to insert a delimiter character using a character index in your line (use Get-Content to read the file and create one String object per line, then loop through them using Foreach loop or Foreach-Object and the pipeline). A slightly more difficult approach would be to use the .Substring() method. You could build your new String line using Substring to extract each column and concatenating those values with a delimiter. That's probably a lot for someone new to PowerShell, but one of the best ways to learn and gain proficiency with it is to practice writing the same script multiple ways. You can learn new techniques that may solve other problems you might encounter in the future.

This is a way (really ugly IMO, I think it can better done):
$a = '18240087A +17135555555 3333333333'
$b = #( ($a[0..1] -join ''), ($a[2..3] -join ''), ($a[4..5] -join ''),
($a[6..7] -join ''), ($a[8] -join ''), ($A[10..19] -join ''),
($a[20..30] -join ''), ($a[31..36] -join ''))
$c = $b -join '|'
$c
18|24|00|87|A|+171355555|55 33333333|33
I don't know if is the rigth splitting you need, but changing the values in each [x..y] you can do what better fit your need. Remenber that character array are 0-based, then the first char is 0 and so on.

I don't quite follow the splitting rules. What kind of software writes the text file anyway? Maybe it can be instructed to change the structure?
That being said, inserting pipes is easy enough with .Insert()
$a= '18240087A +17135555555 3333333333'
$a.Substring(0, $a.IndexOf('+')).Insert(2, '|').insert(5,'|').insert(8, '|').insert(11, '|').insert(13, '|')
# Output: 18|24|00|87|A|
# Rest of the line:
$a.Substring($a.IndexOf('+')+1)
# Output: 17135555555 3333333333
From there you can proceed to splitting the rest of the row data.

I've improved my answer based on your response (note, it's probably best you update your actual question to include that information!)
The nice thing about Get-Content in Powershell is that it returns the content as an array split on the end of line characters. Couple that with allowing multiple assignment from an array and you end up with some neat code.
The following has a function to process each line based on your modified version of my original answer. It's then wrapped by a function which processes the file.
This reads the given file, setting the first line to $date and the rest of the content to $content. It then creates an output file adds the date to the output, then loops over the rest of the content performing the regex check and adding the parsed version of the content if the check is successful.
Function Parse-CDRFileLine {
Param(
[string]$line
)
$pattern = "^(.{2})(.{2})(.{1})(.{2})(.{1})(.{1})\s*(.{15})(.{10})\s*(.{7})\s*(.{7})\s*(.{1})\s*(.{1})(.{1})(.{1})\s*(.*)$"
if($line -match $pattern){
$result = $matches.Values | select -first ($matches.Count-1)
[array]::Reverse($result, 0, $result.Length)
$result = $result -join "|"
$result
}
}
Function Parse-CDRFile{
Param(
[string]$filepath
)
# Read content, setting first line to $date, the rest to $content
$date,$content = Get-Content $filepath
# Create the output file, overwrite if neccessary
$outputFile = New-Item "$filepath.out" -ItemType file -Force
# Add the date line
Set-Content $outputFile $date
# Process the rest of the content
$content |
? { -not([string]::IsNullOrEmpty($_)) } |
% { Add-Content $outputFile (Parse-CDRFileLine $_) }
}
Parse-CDRFile "C:\input.txt"
I used your sample input and the result I get is:
18:24 10/15
18|24|0|08|7|A|+14434444444 30|13777777 C|ME-SBC |HRSA-CM|4|M|0|0|0
18|24|0|09|6|A|+14434444445 30|13777778 C|ME-SBC |HRSA-CM|4|M|0|0|0
There are an incredible amount of resources out there but one I particularly suggest is Douglas Finkes Powershell for Developers It's short, concise and full of great info that will get you thinking in the right mindset with Powershell

Related

Powershell - Efficient way to keep content and append to the same file?

I want to keep the first comment section lines of a file and overwrite everything else. Currently this section is 27 lines long.
Each line begins with a # (think of it as a giant comment section).
What I want to do is keep the initial comment section, delete everything following the comment section, then append a new string to this file just below this comment section.
I found a way to hardcode it, but I think this is pretty ineffecient. I don't think it's best to hardcode in 27 as a literal.
The way I've handled it is:
$fileProc = Get-Content $someFile
$keep = $fileProc[0..27]
$keep | Set-Content $someFile
Add-Content $someFile "`n`n# Insert new string here"
Add-Content $someFile "`n EMPTY_PROCESS.EXE"
Is there a more efficient way to handle this?
You can use a switch statement to efficiently extract the section of comment lines at the start.
Set-Content out.txt -Value $(
#(
switch -Wildcard -File $someFile {
'#*' { $_ }
default { break } # End of comments section reached.
}
) + "`n`n# Insert new string here", "`n EMPTY_PROCESS.EXE"
)
Note:
To be safe, the above writes to a new file, out.txt, but you can write directly back to $someFile, if desired.
Wildcard expression #* assumes that each line in the comment section starts with #, with no preceding whitespace; if you need to account for preceding whitespace, use the -Regex switch in lieu of -Wildcard, and use regex '^\s*#' in lieu of '#*'
Not sure about limiting it to first set of 27 or so lines but this should work.
First line below is to only keep the lines of file that start with '#'.
(Get-Content $somefile) | Where { $_ -match "^#" } | Set-Content $somefile
Add-Content $somefile "`n`nblah blah"
Add-Content $somefile "`nglug glug blug glug"
You can then use Add-Content for additional lines. Hope this helps :]
Efficient way [...] pretty inefficient [...] a more efficient way
Don't open the file many times, paying the cost of ACL security and AntiVirus checks and disk access delays.
Avoid PowerShell cmdlets and scriptblocks.
Avoid loops in PowerShell, push work to lower layers.
Avoid heavyweight searches like regex and wildcard.
Avoid making arrays of string for the lines.
Open file once, do a single linear scan and truncate when the pattern is found then write new data. Assuming no other comment lines in the data the pattern is "the last "\n#" is the start of the last comment, then the newline after that is the cutoff". e.g.:
$f = [System.IO.FileStream]::new('d:\test.txt', 'open')
$content = [System.IO.StreamReader]::new($f).ReadToEnd()
$lastComment = $content.LastIndexOf("`n#")
$nextLine = $content.IndexOf("`n", 1+$lastComment)
$f.SetLength($nextLine) # truncate
$w = [System.IO.StreamWriter]::new($f)
$w.WriteLine("new next Line")
$w.Close()
If there could be other comment lines, redesign the file so there is a sentinal value to find - easier than finding the absence of a thing.
Compared to mklement0's answer this doesn't cost any PowerShell cmdlet startup time, uses no subshells, no wildcard pattern matching, no arrays of string, and doesn't open the file twice. On a file with 10,000 comment lines:
your original code takes ~0.4 seconds
mklement0's code takes ~0.04 seconds
this code takes ~0.02 seconds.
A more efficient way - QED.

In Powershell, how do I build a file by appending a string with another files' contents?

I am working on a Powershell script that does the following:
Pulls down git ignore patterns from https://gitignore.io
Reads custom git ignore patterns from a text file on the filesystem
Appends both to a .gitignore file on disk, overwriting previous contents.
The end result should be a file with gitignore.io patterns PLUS any custom defined patterns from another file.
Here is what I've come up with so far:
Function gig {
param(
[Parameter(Mandatory=$true)]
[string[]]$list
)
$params = ($list | ForEach-Object { [uri]::EscapeDataString($_) }) -join ","
Invoke-WebRequest -Uri "https://www.gitignore.io/api/$params" | select -ExpandProperty content
}
gig angular,csharp,images,vagrant | cat gitignore-patterns.txt | sc .gitignore
I realize this script is invalid, as the last line doesn't even work. But I typed it up this way to try to show, logically, what I want. I'm normally working in Linux Bash scripts, so what I'm trying models that a bit, although I am realizing it doesn't translate well to Powershell. I'm hoping there's a way to pipe commands together, to build a string, that eventually gets piped out to a file. Or something equally simple.
Note that my research so far into this has shown piecemeal solutions, but no solution pulling this all together. For example, I saw SO questions showing how to concat two strings, write a string to a file, append to a file, etc. But no single solution involving all of them together. I'm hoping to get that bigger picture solution here.
Wrap both statements in an array subexpression #(), and pipe to Set-Content:
#(
gig angular,csharp,images,vagrant
Get-Content gitignore-patterns.txt
) |Set-Content .gitignore

Aligning the corrupted data records in a text file using powershell

My data file(.txt) has records of 31 fields/columns each and the fields are pipe delimited. Somehow, few records are corrupted(the record is split into multiple lines).
Can anyone guide in writing a script that reads this input data file and shapes it into a file containing exactly 31 fields in each record?
PS: I am new to powershell.
Sample data:
Good data - Whole record shows up in a single line.
Bad data - Record is broken into multiple lines.
Below is the structure of the record.
11/16/2007||0007327| 3904|1000|M1||CCM|12/31/2009|000|East 89th Street|01CM1| 11073|DONALD INC|001|Project 077|14481623.8100|0.0000|1.00000|1|EA|September 2007 Invoice|Project 027||000000000000|1330|11/16/2007|X||11/29/2007|2144.57
Here is what i have tried and script hangs
#Setup paths
$Input = "Path\Input.txt"
$Output = "Path\Output.txt"
#Create empty variables to set types
$Record=""
$Collection = #()
#Loop through text file
gc Path\Input.txt | %{
$Record = "$Record$_"
If($Record -Match "(\d{1,2}/\d{1,2}/\d{4}(?:\|.*?){31})(\d{1,2}/\d{1,2}/\d{4}\|.*?\|.*)"){
$Collection+=$Matches[1]
$Record=$Matches[2]
}
}
#Add last record to the collection
$Collection+=$Record $Collection | Out-File $Output
I see some issues that need to be clarified or addressed. First I noticed the line $Record=$Matches[2] does not appear to serve a purpose. Second your regex string appears to have some flaws which you were looking for. When i test your regex against your test data here: http://regex101.com/r/yA9tZ1/1
At least on that site the forward slashes needed to be escaped. Once I escaped the tester threw the error at me
Your expression took too long to evaluate.
I know the root of that issue comes from this portion of your regex which is trying to match your passive group with a non greedy quantifier 31 times. (?:\|.*?){31}
So taking a guess as to your true intention I have the following regex string
(\d{1,2}\/\d{1,2}\/\d{4}.{31}).*?(\d{1,2}\/\d{1,2}\/\d{4}\|.*?\|.*)
You can see the results here: http://regex101.com/r/qY1jZ7/2
While i doubt it is exactly what you wanted I hope this leads you in the right direction.
I just tried this, and while that solution worked for an extremely similar issue where the user only had 11 fields per record, apparently it's just no good for your 31 field records. I'd like to suggest an alternative using -Split alongside a couple of regex matches. This should work faster for you I think.
#Create regex objects to match against
[RegEx]$Regex = "(.*?)(\d{2}/\d{2}/\d{4})$"
[RegEx]$Regex2 = "(\d{2}/\d{2}/\d{4}.*)"
#Setup paths
$Input = "Path\Input.txt"
$Output = "Path\Output.txt"
#Create empty variables to set types
$Record=""
$Collection = #()
#Loop through text file
gc $Input | %{
If($_ -match "^\d{1,2}/\d{1,2}/\d{4}" -and $record.split("|").count -eq 31){$collection+=$record;$record=$_}
else{
$record="$record$_"
if($record.split("|").count -gt 31){
$collection+=$regex.matches(($record.split("|")[0..30]) -join "|").groups[1].value
$record=$regex2.matches(($record.split("|")[30..($record.split("|").count)]) -join "|").groups[1].value
}
}
}
#Add last record to the collection
$collection+=$record
#Output everything to a file
$collection|out-file $Output

handling a CSV with line feed characters in a column in powershell

Currently, I have a system which creates a delimited file like the one below in which I've mocked up the extra line feeds which are within the columns sporadically.
Column1,Column2,Column3,Column4
Text1,Text2[LF],text3[LF],text4[CR][LF]
Text1,Text2[LF][LF],text3,text4[CR][LF]
Text1,Text2,text3[LF][LF],text4[CR][LF]
Text1,Text2,text3[LF],text4[LF][LF][CR][LF]
I've been able to remove the line feeds causing me concern by using Notepad++ using the following REGEX to ignore the valid carriage return/Line feed combinations:
(?<![\r])[\n]
I am unable however to find a solution using powershell, because I think when I get-content for the csv file the line feeds within the text fields are ignored and the value is stored as a separate object in the variable assigned to the get-content action. My question is how can I apply the regex to the csv file using replace if the cmdlet ignores the line feeds when loading the data?
I've also tried the following method below to load the content of my csv which doesn't work either as it just results in one long string, which would be similar to using -join(get-content).
[STRING]$test = [io.file]::ReadAllLines('C:\CONV\DataOutput.csv')
$test.replace("(?<![\r])[\n]","")
$test | out-file .\DataOutput_2.csv
Nearly there, may I suggest just 3 changes:
use ReadAllText(…) instead of ReadAllLines(…)
use -replace … instead of .Replace(…), only then will the first argument be treated as a regex
do something with the replacement result (e.g. assign it back to $test)
Sample code:
[STRING]$test = [io.file]::ReadAllText('C:\CONV\DataOutput.csv')
$test = $test -replace '(?<![\r])[\n]',''
$test | out-file .\DataOutput_2.csv

Find and Replace in a Large File

I want to find a piece of text in a large xml file and want to replace with some other text. The size of the file is around (50GB). I want to do this in command line. I am looking at PowerShell and want to know if it can handle the large size.
Currently I am trying something like this but it does not like it
Get-Content C:\File1.xml | Foreach-Object {$_ -replace "xmlns:xsi=\"http:\/\/www\.w3\.org\/2001\/XMLSchema-instance\"", ""} | Set-Content C:\File1.xml
The text I want to replace is xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" with an empty string "".
Questions
Can PowerShell handle large
files
I don't want the replace to happen in
memory and prefer streaming assuming
that will not bring the server to
its knees.
Are there any other approaches I can take (different
tools/strategy?)
Thanks
I had a similar need (and similar lack of powershell experience) but cobbled together a complete answer from the other answers on this page plus a bit more research.
I also wanted to avoid the regex processing, since I didn't need it either -- just a simple string replace -- but on a large file, so I didn't want it loaded into memory.
Here's the command I used (adding linebreaks for readability):
Get-Content sourcefile.txt
| Foreach-Object {$_.Replace('http://example.com', 'http://another.example.com')}
| Set-Content result.txt
Worked perfectly! Never sucked up much memory (it very obviously didn't load the whole file into memory), and just chugged along for a few minutes then finished.
Aside from worrying about reading the file in chunks to avoid loading it into memory, you need to dump to disk often enough that you aren't storing the entire contents of the resulting file in memory.
Get-Content sourcefile.txt -ReadCount 10000 |
Foreach-Object {
$line = $_.Replace('http://example.com', 'http://another.example.com')
Add-Content -Path result.txt -Value $line
}
The -ReadCount <number> sets the number of lines to read at a time. Then the ForEach-Object writes each line as it is read. For a 30GB file filled with SQL Inserts, I topped out around 200MB of memory and 8% CPU. While, piping it all into Set-Content at hit 3GB of memory before I killed it.
It does not like it because you can't read from a file and write back to it at the same time using Get-Content/Set-Content. I recommend using a temp file and then at the end, rename file1.xml to file1.xml.bak and rename the temp file to file1.xml.
Yes as long as you don't try to load the whole file at once. Line-by-line will work but is going to be a bit slow. Use the -ReadCount parameter and set it to 1000 to improve performance.
Which command line? PowerShell? If so then you can invoke your script like so .\myscript.ps1 and if it takes parameters then c:\users\joe\myscript.ps1 c:\temp\file1.xml.
In general for regexes I would use single quotes if you don't need to reference PowerShell variables. Then you only need to worry about regex escaping and not PowerShell escaping as well. If you need to use double-quotes then the back-tick character is the escape char in double-quotes e.g. "`$p1 is set to $ps1". In your example single quoting simplifies your regex to (note: forward slashes aren't metacharacters in regex):
'xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"'
Absolutely you want to stream this since 50GB won't fit into memory. However, this poses an issue if you process line-by-line. What if the text you want to replace is split across multiple lines?
If you don't have the split line issue then I think PowerShell can handle this.
This is my take on it, building on some of the other answers here:
Function ReplaceTextIn-File{
Param(
$infile,
$outfile,
$find,
$replace
)
if( -Not $outfile)
{
$outfile = $infile
}
$temp_out_file = "$outfile.temp"
Get-Content $infile | Foreach-Object {$_.Replace($find, $replace)} | Set-Content $temp_out_file
if( Test-Path $outfile)
{
Remove-Item $outfile
}
Move-Item $temp_out_file $outfile
}
And called like so:
ReplaceTextIn-File -infile "c:\input.txt" -find 'http://example.com' -replace 'http://another.example.com'
The escape character in powershell strings is the backtick ( ` ), not backslash ( \ ). I'd give an example, but the backtick is also used by the wiki markup. :(
The only thing you should have to escape is the quotes - the periods and such should be fine without.