I have an xml file where i have line some
<!--<__AMAZONSITE id="-123456780" instance ="CATZ00124"__/>-->
and i need the id and instance values from that particular line.
where i need have -123456780 as well as CATZ00124 in 2 different variables.
Below is the sample code which i have tried
$xmlfile = 'D:\Test\sample.xml'
$find_string = '__AMAZONSITE'
$array = #((Get-Content $xmlfile) | select-string $find_string)
Write-Host $array.Length
foreach ($commentedline in $array)
{
Write-Host $commentedline.Line.Split('id=')
}
I am getting below result:
<!--<__AMAZONSITE
"-123456780"
nstance
"CATZ00124"__/>
The preferred way still is to use XML tools for XML files.
As long a line with AMAZONSITE and instance is unique in the file this could do:
## Q:\Test\2019\09\13\SO_57923292.ps1
$xmlfile = 'D:\Test\sample.xml' # '.\sample.xml' #
## see following RegEx live and with explanation on https://regex101.com/r/w34ieh/1
$RE = '(?<=AMAZONSITE id=")(?<id>[\d-]+)" instance ="(?<instance>[^"]+)"'
if((Get-Content $xmlfile -raw) -match $RE){
$AmazonSiteID = $Matches.id
$Instance = $Matches.instance
}
LotPings' answer sensibly recommends using a regular expression with capture groups to extract the substrings of interest from each matching line.
You can incorporate that into your Select-String call for a single-pipeline solution (the assumption is that the XML comments of interest are all on a single line each):
# Define the regex to use with Select-String, which both
# matches the lines of interest and captures the substrings of interest
# ('id' an 'instance' attributes) via capture groups, (...)
$regex = '<!--<__AMAZONSITE id="(.+?)" instance ="(.+?)"__/>-->'
Select-String -LiteralPath $xmlfile -Pattern $regex | ForEach-Object {
# Output a custom object with properties reflecting
# the substrings of interest reported by the capture groups.
[pscustomobject] #{
id = $_.Matches.Groups[1].Value
instance = $_.Matches.Groups[2].Value
}
}
The result is an array of custom objects that each have an .id and .instance property with the values of interest (which is preferable to setting individual variables); in the console, the output would look something like this:
id instance
-- --------
-123456780 CATZ00124
-123456781 CATZ00125
-123456782 CATZ00126
As for what you tried:
Note: I'm discussing your use of .Split(), though for extracting a substring, as is your intent, .Split() is not the best tool, given that it is only the first step toward isolating the substring of interest.
As LotPings notes in a comment, in Windows PowerShell, $commentedline.Line.Split('id=') causes the String.Split() method to split the input string by any of the individual characters in split string 'id=', because the method overload that Windows PowerShell selects takes a char[] value, i.e. an array of characters, which is not your intent.
You could rectify this as follows, by forcing use of the overload that accepts string[] (even though you're only passing one string), which also requires passing an options argument:
$commentedline.Line.Split([string[] 'id=', 'None') # OK, splits by whole string
Note that in PowerShell Core the logic is reversed, because .NET Core introduced a new overload with just [string] (with an optional options argument), which PowerShell Core selects by default. Conversely, this means that if you do want by-any-character splitting in PowerShell Core, you must cast the split string to [char[]].
On a general note, PowerShell has the -split operator, which is regex-based and offers much more flexibility than String.Split() - see this answer.
Applied to your case:
$commentedline.Line -split 'id='
While id= is interpreted a regex by -split, that makes no difference here, given that the string contains no regex metacharacters (characters with special meaning); if you do want to safely split by a literal substring, use [regex]::Escape('...') as the RHS.
Note that -split is case-insensitive by default, as PowerShell generally is; however, you can use the -csplit variant for case-sensitive matching.
Related
After playing around with some powershell script for a while i was wondering if there is a version of this without using c#. It feels like i am missing some information on how to pipe things properly.
$packages = Get-ChildItem "C:\Users\A\Downloads" -Filter "*.nupkg" |
%{ $_.Name }
# Select-String -Pattern "(?<packageId>[^\d]+)\.(?<version>[\w\d\.-]+)(?=.nupkg)" |
# %{ #($_.Matches[0].Groups["packageId"].Value, $_.Matches[0].Groups["version"].Value) }
foreach ($package in $packages){
$match = [System.Text.RegularExpressions.Regex]::Match($package, "(?<packageId>[^\d]+)\.(?<version>[\w\d\.-]+)(?=.nupkg)")
Write-Host "$($match.Groups["packageId"].Value) - $($match.Groups["version"].Value)"
}
Originally i tried to do this with powershell only and thought that with #(1,2,3) you could create an array.
I ended up bypassing the issue by doing the regex with c# instead of powershell, which works, but i am curious how this would have been done with powershell only.
While there are 4 packages, doing just the powershell version produced 8 lines. So accessing my data like $packages[0][0] to get a package id never worked because the 8 lines were strings while i expected 4 arrays to be returned
Terminology note re without using c#: You mean without direct use of .NET APIs. By contrast, C# is just another .NET-based language that can make use of such APIs, just like PowerShell itself.
Note:
The next section answers the following question: How can I avoid direct calls to .NET APIs for my regex-matching code in favor of using PowerShell-native commands (operators, automatic variables)?
See the bottom section for the Select-String solution that was your true objective; the tl;dr is:
# Note the `, `, which ensures that the array is output *as a single object*
%{ , #($_.Matches[0].Groups["packageId"].Value, $_.Matches[0].Groups["version"].Value) }
The PowerShell-native (near-)equivalent of your code is (note tha the assumption is that $package contains the content of the input file):
# Caveat: -match is case-INSENSITIVE; use -cmatch for case-sensitive matching.
if ($package -match '(?<packageId>[^\d]+)\.(?<version>[\w\d\.-]+)(?=.nupkg)') {
"$($Matches['packageId']) - $($Matches['Version'])"
}
-match, the regular-expression matching operator, is the equivalent of [System.Text.RegularExpressions.Regex]::Match() (which you can shorten to [regex]::Match()) in that it only looks for (at most) one match.
Caveat re case-sensitivity: -match (and its rarely used alias -imatch) is case-insensitive by default, as all PowerShell operators are; for case-sensitive matching, use the c-prefixed variant, -cmatch.
By contrast, .NET APIs are case-sensitive by default; you'd have to pass the [System.Text.RegularExpressions.RegexOptions]::IgnoreCase flag to [regex]::Match() for case-insensitive matching (you may use 'IgnoreCase', which PowerShell auto-converts for you).
As of PowerShell 7.2.x, there is no operator that is the equivalent of the related return-ALL-matches .NET API, [regex]::Matches(). See GitHub issue #7867 for a green-lit but yet-to-be-implemented proposal to introduce one, named -matchall.
However, instead of directly returning an object describing what was (or wasn't) matched, -match returns a Boolean, i.e. $true or $false, to indicate whether matching succeeded.
Only if -match returns $true does information about a match become available, namely via the automatic $Matches variable, which is a hashtable reflecting the matching parts of the input string: entry 0 is always the full match, with optional additional entries reflecting what any capture groups ((...)) captured, either by index, if they're anonymous (starting with 1) or, as in your case, for named capture groups ((?<name>...)) by name.
Syntax note: Given that PowerShell allows use of dot notation (property-access syntax) even with hashtables, the above command could have used $Matches.packageId instead of $Matches['packageId'], for instance, which also works with the numeric (index-based) entries, e.g., $Matches.0 instead of $Matches[0]
Caveat: If an array (enumerable) is used as the LHS operand, -match' behavior changes:
$Matches is not populated.
filtering is performed; that is, instead of returning a Boolean indicating whether matching succeeded, the subarray of matching input strings is returned.
Note that the $Matches hashtable only provides the matched strings, not also metadata such as index and length, as found in [regex]::Match()'s return object, which is of type [System.Text.RegularExpressions.Match].
Select-String solution:
$packages |
Select-String '(?<packageId>[^\d]+)\.(?<version>[\w\d\.-]+)(?=.nupkg)' |
ForEach-Object {
"$($_.Matches[0].Groups['packageId'].Value) - $($_.Matches[0].Groups['version'].Value)"
}
Select-String outputs Microsoft.PowerShell.Commands.MatchInfo instances, whose .Matches collection contains one or more [System.Text.RegularExpressions.Match] instances, i.e. instances of the same type as returned by [regex]::Match()
Unless -AllMatches is also passed, .Matches only ever has one entry, hence the use of [0] to target that entry above.
As you can see, working with Select-Object's output objects requires you to ultimately work with the same .NET type as when you call [regex]::Match() directly.
However, no method calls are required, and discovering the properties of the output objects is made easy in PowerShell via the Get-Member cmdlet.
If you want to capture the matches in a jagged array:
$capturedStrings = #(
$packages |
Select-String '(?<packageId>[^\d]+)\.(?<version>[\w\d\.-]+)(?=.nupkg)' |
ForEach-Object {
# Output an array of all capture-group matches,
# *as a single object* (note the `, `)
, $_.Matches[0].Groups.Where({ $_.Name -ne '0' }).Value
}
)
This returns an array of arrays, each element of which is the array of capture-group matches for a given package, so that $capturedStrings[0][0] returns the packageId value for the first package, for instance.
Note:
$_.Matches[0].Groups.Where({ $_.Name -ne '0' }).Value programmatically enumerates all capture-group matches and returns an their .Value property values as an array, using member-access enumeration; note how name '0' must be excluded, as it represents the whole match.
With the capture groups in your specific regex, the above is equivalent to the following, as shown in a commented-out line in your question:
#($_.Matches[0].Groups['packageId'].Value, $_.Matches[0].Groups['version'].Value)
, ..., the unary form of the array-construction operator, is used as a shortcut for outputting the array (symbolized by ... here) as a whole, as a single object. By default, enumeration would occur and the elements would be emitted one by one. , ... is in effect a shortcut to the conceptually clearer Write-Output -NoEnumerate ... - see this answer for an explanation of the technique.
Additionally, #(...), the array subexpression operator is needed in order to ensure that a jagged array (nested array) is returned even in the event that only one array is returned across all $packages.
I'm using powershell to run a command like so:
$getlist=rclone sha1sum remote:"\My Pictures\2009\03" --dry-run
Write-Output $getlist
that outputs a object with the results. Problem being I only want the first column of those results. I've tried things like custom-format --Depth 1 and the other *-format commands but they don't work on this object??
that outputs a object with the results
While that is technically true, it is more specifically an [object[]]-typed array of lines ([string] instances) that assigning the stream of output lines - produced by the external rclone program - to a PowerShell variable implicitly created. (Arrays created by PowerShell are [object[]]-typed, even if all the elements are of the same type, such as [string] in this case).
PowerShell fundamentally only "speaks text" when communicating with external programs.
Therefore, to extract substrings from these lines you must perform text parsing, as implied by AdminOfThings' comment on the question.
A simplified approach is to use the unary form of the -split operator:
# Simulate lines input whose first whitespace-separated token is to
# be extracted.
$getlist = 'foo bar baz', 'more stuff here'
$getlist.ForEach({ (-split $_)[0] })
The above yields:
foo
more
zett42's helpful answer shows a simpler alternative that relies on the -replace operator's (among others) ability to operate directly on each element of an array-valued LHS.
However, the -split approach is useful if you want to extract multiple column values.
If you don't need / want to capture all of the external program's (rclone's) output in memory first, you can use streaming processing in the pipeline, via the ForEach-Object cmdlet:
'foo bar baz', 'more stuff here' | ForEach-Object { (-split $_)[0] }
Note: While slightly slower than collecting all lines in memory up front, the advantage of a pipeline-based approach is reduced memory load: only the extracted substrings are kept in memory (if assigned to a variable).
You can use a regular expression to remove the undesired parts of the output:
$getlist = $getlist -replace '\s.*'
When a PowerShell operator such as -replace is applied to a collection, it will be applied to each element individually, creating a new array that stores the results (see Substitution in a collection).
The regular expression removes everything from the first whitespace up to the end of the string.
RegEx breakdown:
\s - a single whitespace character like space and tab
.* - any character, zero or more times
Suppose I have a file database_partial.xml.
I am trying to strip the file from "_partial" as well as extension (xml) and then capitalize the name so that it becomes DATABASE.
Param($xmlfile)
$xml = Get-ChildItem "C:\Files" -Filter "$xmlfile"
$db = [IO.Path]::GetFileNameWithoutExtension($xml).ToUpper()
That returns DATABASE_PARTIAL, but I don't know how to strip the _PARTIAL part.
You don't need GetFileNameWithoutExtension() for removing the extension. The FileInfo objects returned by Get-ChildItem have a property BaseName that gives you the filename without extension. Uppercase that, then remove the "_PARTIAL" suffix. I would also recommend processing the output of Get-ChildItem in a loop, just in case it doesn't return exactly one result.
Get-ChildItem "C:\Files" -Filter "$xmlfile" | ForEach-Object {
$_.BaseName.ToUpper().Replace('_PARTIAL', '')
}
If the substring after the underscore can vary, use a regular expression replacement instead of a string replacement, e.g. like this:
Get-ChildItem "C:\Files" -Filter "$xmlfile" | ForEach-Object {
$_.BaseName.ToUpper() -replace '_[^_]*$'
}
Ansgar Wiechers's helpful answer provides an effective solution.
To focus on the more general question of how to strip (remove) part of a file name (string):
Use PowerShell's -replace operator, whose syntax is:<stringOrStrings> -replace <regex>, <replacement>:
<regex> is a regex (regular expression) that matches the part to replace,
<replacement> is replacement operand (the string to replace what the regex matched).
In order to effectively remove what the regex matched, specify '' (the empty string) or simply omit the operand altogether - in either case, the matched part is effectively removed from the input string.
For more information about -replace, see this answer.
Applied to your case:
$db = 'DATABASE_PARTIAL' # sample input value
PS> $db -replace '_PARTIAL$', '' # removes suffix '_PARTIAL' from the end (^)
DATABASE
PS> $db -replace '_PARTIAL$' # ditto, with '' implied as the replacement string.
DATABASE
Note:
-replace is case-insensitive by default, as are all PowerShell operators. To explicitly perform case-sensitive matching, use the -creplace variant.
By contrast, the [string] type's .Replace() method (e.g., $db.Replace('_PARTIAL', ''):
matches by string literals only, and therefore offers less flexibility; in this case, you couldn't stipulate that _PARTIAL should only be matched at the end of the string, for instance.
is invariably case-sensitive in the .NET Framework (though .NET Core offers a case-insensitive overload).
Building on Ansgar's answer, your script can therefore be streamlined as follows:
Param($xmlfile)
$db = ((Get-ChildItem C:\Files -Filter $xmlfile).BaseName -replace '_PARTIAL$').ToUpper()
Note that in PSv3+ this works even if $xmlfile should match multiple files, due to member-access enumeration and the ability of -replace to accept an array of strings as input, the desired substring removal would be performed on the base names of all files, as would the subsequent uppercasing - $db would then receive an array of stripped base names.
I would expect that Select-String consider \r\n (carriage-return + newline) the end of a line in Powershell.
However, as can be seen below, abc matches the whole the whole input:
PS C:\Tools\hashcat> "abc`r`ndef" | Select-String -Pattern "abc"
abc
def
If I break the string up into two parts, then Select-String behaves as I would expect:
PS C:\Tools\hashcat> "abc", "def" | Select-String -Pattern "abc"
abc
How can I give Select-String a string whose lines are terminated by \r\n, and then make this cmdlet only returns those strings that contain a match?
Select-String operates on each (stringified on demand[1]) input object.
A multi-line string such as "abc`r`ndef" is a single input object.
By contrast, "abc", "def" is a string array with two elements, passed as two input objects.
To ensure that the lines of a multi-line string are passed individually, split the string into an array of lines using PowerShell's -split operator: "abc`r`ndef" -split "`r?`n"
(The ? makes the `r optional so as to also correctly deal with `n-only (LF-only, Unix-style) line endings.)
In short:
"abc`r`ndef" -split "`r?`n" | Select-String -Pattern "abc"
The equivalent, using a PowerShell string literal with regular-expression (regex) escape sequences (the RHS of -split is a regex):
"abc`r`ndef" -split '\r?\n' | Select-String -Pattern "abc"
It is somewhat unfortunate that the Select-String documentation talks about operating on lines of text, given that the real units of operations are input objects - which may themselves comprise multiple lines, as we've seen.
Presumably, this comes from the typical use case of providing input objects via the Get-Content cmdlet, which outputs a text file's lines one by one.
Note that Select-String doesn't return the matching strings directly, but wraps them in [Microsoft.PowerShell.Commands.MatchInfo] objects containing helpful metadata about the match.
Even there the line metaphor is present, however, as it is the .Line property that contains the matching string.
[1] Optional reading: How Select-String stringifies input objects
If an input object isn't a string already, it is converted to one, though possibly not in the way you might expect:
Loosely speaking, the .ToString() method is called on each non-string input object[2]
, which for non-strings is not the same as the representation you get with PowerShell's default output formatting (the latter is what you see when you print an object to the console or use Out-File, for instance); by contrast, it is the same representation you get with string interpolation in a double-quoted string (when you embed a variable reference or command in "...", e.g., "$HOME" or "$(Get-Date)").
Often, .ToString() just yields the name of the object's type, without containing any instance-specific information; e.g., $PSVersionTable stringifies to System.Management.Automation.PSVersionHashTable.
# Matches NOTHING, because Select-String sees
# 'System.Management.Automation.PSVersionHashTable' as its input.
$PSVersionTable | Select-String PSVersion
In case you do want to search the default output format line by line, use the following idiom:
... | Out-String -Stream | Select-String ...
However, note that for non-string input it is more robust and preferable for subsequent processing to filter the input by querying properties with a Where-Object condition.
That said, there is a strong case to be made for Select-String needing to implicitly apply Out-String -Stream stringification, as discussed in this GitHub feature request.
[2] More accurately, .psobject.ToString() is called, either as-is, or - if the object's ToString method supports an IFormatProvider-typed argument - as .psobject.ToString([cultureinfo]::InvariantCulture) so as to obtain a culture-invariant representation - see this answer for more information.
"abc`r`ndef"
is one string which if you echo (Write-Output) out in console would result in:
PS C:\Users\gpunktschmitz> echo "abc`r`ndef"
abc
def
The Select-String will echo out every string where "abc" is part of it. As "abc" is part the string this very string will be selected.
"abc", "def"
is a list of two strings. Using the Select-String here will first test "abc" and then "def" if the pattern matches "abc". As only the first one matches only it will be selected.
Use the following to split the string into a list and select only the elements containing "abc"
"abc`r`ndef".Split("`r`n") | Select-String -Pattern "abc"
Basically Mr. Guenther Schmitz explained the correct usage of Select-String, but I want to just add some points to support his answer.
I did some reverse engineering work against this Select-String cmdlet. It's in the Microsoft.PowerShell.Utility.dll. Some relevant code snippets are as follows, notice these are codes from reverse engineering for reference, not the actual source code.
string text = inputObject.BaseObject as string;
...
matchInfo = (inputObject.BaseObject as MatchInfo);
object operand = ((object)matchInfo) ?? ((object)inputObject);
flag2 = doMatch(operand, out matchInfo2, out text);
We can find out that it just treat the inputObject as a whole string, it doesn't do any split.
I don't find the actual source code of this cmdlet on github, probably this utility part is not open source yet. But I find the unit test of this Select-String.
$testinputone = "hello","Hello","goodbye"
$testinputtwo = "hello","Hello"
The test strings they are using for unit test are actually lists of strings. It means that they were not even thinking about your use case and very possibly it's just designed to accept input of string collection.
However if we look at the official document of Microsoft regarding Select-String we do see it talks about line a lot while it can't recognize a line in a string. My personal guess is the concept of line is only meaningful while the cmdlet accept a file as an input, in the case the file is like a list of string, each item in the list represents a single line.
Hope it can make things more clear.
I am attempting to isolate and return a small variable string from a larger string.
I am struggling because the larger string I am extracting from is in list format. I can split this into substrings successfully, but I do not know how to select one of these substrings without returning the entire string. The string is generated by a command line process.
$StringList
AppTitle1.1.1221.aaa111
AppSubTitle
AnotherAppTitle1.1.1221.aaa111
AnotherAppSubTitle
...and so on
I can split the list string into substrings by line using regular expressions to split at whitespace (there is no whitespace within any given line).
$StringList -split "\s"
Once I have split the string into the desired substrings, however, I am not sure how to select the desired substring. The length of the list (i.e. the number of apps present in it) and the location of the app I need to retrieve the title of within that list are entirely variable, so I cannot simply use substring reference numbers. I've tried several approaches to selecting the substring, but each has simply returned the entire string, or nothing at all.
Here are two approaches I've attempted. The first returns the entire string list and the second returns nothing.
$DesiredAppTitle = Select-String -InputObject $StringList -Pattern "AnotherAppTitle"
or
$DesiredAppTitle = foreach ($_.substring in $StringList)
{
if ($_.substring -contains "AnotherAppTitle")
{
return $_.name
}
}
What I'd like for it to return is:
AnotherAppTitle1.1.1221.aaa111
I'm sure there are a million ways to do this, so if neither of my approaches seems like a good fit, I'm open to other suggestions. Any assistance would be greatly appreciated. Thanks in advance!
# Multi-line input string.
$StringList = #'
AppTitle1.1.1221.aaa111
AppSubTitle
AnotherAppTitle1.1.1221.aaa111
AnotherAppSubTitle
'#
# Split it into whitespace-separated tokens.
$tokens = -split $StringList
# Match the token of interest.
$tokens -match '^AnotherAppTitle'
The above yields:
AnotherAppTitle1.1.1221.aaa111
Note the use of regex-matching operator with anchor ^ to ensure that the search term matches at the start of a token, and the use of the unary form of the -split operator, which splits the input by any nonempty whitespace runs.
As for what you tried:
If you pass a multi-line string to Select-String, it is considered a single "line" and, in case of a match, that whole "line" is output.
foreach ($_.substring in $StringList) won't even run, because $_.substring is not a valid iteration variable (you shouldn't use $_, which is an automatic variable, as an enumeration variable at all, and the .substring access breaks the syntax).
If you used $_ instead of $_.substring, the loop would technically work (even though, again, $_ shouldn't be used as an iteration variable), but the loop would only execute once, for the entire multi-line string.
Even if $_.substring did refer to a line (it doesn't), -contains is the wrong operator to use, because it tests if a LHS collection contains the RHS value in full.
Also, use break to exit a loop, not return.
Using the -match approach as demonstrated at the top is the better approach, but if you did want to solve this with a foreach loop:
$DesiredAppTitle = foreach ($token in -split $StringList) {
if ($token -match '^AnotherAppTitle') { $token; break }
}