Substitute text only between tokens (single-line) - sed

I'd like to remove spaces in strings that are between square brackets, with a single-line input.
More precisely, strings that match \[[a-zA-Z0-9 ,]+\] (caseless alphanum comma and space, between square brackets)
For example:
[ "This is a test": [a, b, c] ]
Should become:
[ "This is a test": [a,b,c] ]
I have tried several attempts with branching but couldn't find a syntax that worked.
For example:
/\[[a-zA-Z ,]\+\]/ba; :a;s/ //g;
but this replaces spaces on the whole line, as sed is line-based (my input is single-line).
I also tried the ;e command which would work if the whole string was prefixed with echo " and suffixed with ", but then that would be a single/double-quote escape hell (the whole string may contain ' and ").
GNU sed is welcome, but I would like to keep the dependencies minimal, so no perl unless required and no ruby, python, php...
Indeed, I know the following works perfectly, but php is a too large dependency:
echo preg_replace_callback(
"/\[[a-zA-Z ,]+\]/",
function ($m) { return str_replace(" ", "", $m[0]); },
'{"a":{"a":{"a":"a b c"},"b":{"b":[a, b]}}}'
);
outputs:
{"a":{"a":{"a":"a b c"},"b":{"b":[a,b]}}}

First pass — it works, but it is clumsy
Here is a solution that works with GNU and BSD sed:
sed -E \
-e '/\[[[:alnum:] ,]+\]/ {
s/\[([[:alnum:] ,]+)\]/^B\1^E/
:a
s/(^B[[:alnum:],]*) +/\1/
t a
s/^B/[/
s/^E/]/
}' \
data
The appearances of ^B and ^E are control characters (Control-B and Control-E in the original) that aren't going to appear in the actual text. (When first copied, I got ^B showing as and ^E showing as .)
Explanation:
/\[[[:alnum:] ,]+\]/ { — match lines containing square brackets with alphanumerics plus space and comma between them, and do the action sequence from { to the matching }.
s/\[([[:alnum:] ,]+)\]/^B\1^E/ — replace the square brackets with the control characters.
:a — set a label
s/(^B[[:alnum:],]*) +/\1/ — replace a ^B plus a sequence of alphanumerics or commas (which is captured) and a string of one or more spaces with just the capture.
t a — if the s/// command made a change, jump back to label a.
s/^B/[/ — replace the ^B with open square bracket.
s/^E/]/ — replace the ^E with close square bracket.
} — done
The branch is necessary because normally, the s/// operator won't rescan material that it has just substituted, whereas it is crucial that this keeps rescanning.
Given the slightly more extensive input file:
\[[a-zA-Z0-9 ,]+\] (caseless alphanum comma and space, between square brackets)
For example:
[ "This is a test": [a, b c] ]
[ "This is a test": [a, b, c] ]
[ "This is test 3": [ XXX, YYY, XXX ] ]
Should become:
[ "This is a test": [a,bc] ]
[ "This is a test": [a,b,c] ]
[ "This is test 3": [XXX,YYY,XXX] ]
the script generates:
\[[a-zA-Z0-9 ,]+\] (caseless alphanum comma and space, between square brackets)
For example:
[ "This is a test": [a,bc] ]
[ "This is a test": [a,b,c] ]
[ "This is test 3": [XXX,YYY,XXX] ]
Should become:
[ "This is a test": [a,bc] ]
[ "This is a test": [a,b,c] ]
[ "This is test 3": [XXX,YYY,XXX] ]
Second pass — it pays to review and refine
Looking at it, the ^E is not necessary, and maybe not the ^B either. The version above only deals with the first such set of square brackets on the line. You need more sensitive detector regexes (ones that insist on at least one space in between the markers) to handle multiple such patterns on a single line.
For example:
sed -E \
-e ':a
/\[[[:alnum:],]* [[:alnum:] ,]*\]/ s/(\[[[:alnum:],]*) +/\1/
t a' \
data
Explanation:
:a – set a label
/\[[[:alnum:],]* [[:alnum:] ,]*\]/ — if the line contains an open square bracket, zero or more alphanumeric-or-comma characters, one or more blanks, and zero or more alphanumeric-or-comma-or-blank followed by close square bracket, then …
s/(\[[[:alnum:],]*) +/\1/ — replace the open square and sequence of zero or more alphanumeric-or-comma characters and one or more blanks by just the non-blanks, and …
t a — jump to label a if there was a substitution done
Given:
[ "This is a test": [a, b c] ]
[ "This is test 2": [a, b, c] ]
[ "This is test 3": [ XXX , YYY , XXX ] ]
[ "This is test 4": [ XXX , YYY , XXX ] [ 1 , 2 , 3 ] ]
[ "This is test 5": [ XXX , YYY , XXX ] [ 1 , 2 , 3 ] [ abc ] [ ] ]
this produces:
["This is a test": [a,bc] ]
["This is test 2": [a,b,c] ]
["This is test 3": [XXX,YYY,XXX] ]
["This is test 4": [XXX,YYY,XXX] [1,2,3] ]
["This is test 5": [XXX,YYY,XXX] [1,2,3] [abc] [] ]
This is mostly equivalent to the answer by Beta; it could be further simplified by eliminating the match before the substitute command and modifying (slightly complicating) the substitute so it matches the work by Beta.

I think this will work:
sed -e ':a' -e 's#\(\[[a-zA-Z0-9,]*\) \([a-zA-Z0-9 ,]*\]\)#\1\2#
t a' filename

Related

Use sed to replace `,` within brackets

I'd like to replace commas within brackets with spaces (and also remove the brackets). I used sed, but the solution I could come up to is dependent on the elements in the list.
sed 's/\[\(.*\), \(.*\)\]/\1 \2/g'
# [-0.0, 1.23] => -0.0 1.23 (works)
# [-0.0, 1.23, 4.56] => -0.0, 1.23 4.56 (doesn't work)
# foo=[12.3, 4.5, 3.0, 4.1], bar=123.0, xyz=6.7 => foo=12.3, 4.5, 3.0 4.1, bar=123.0, xyz=6.7` (doesn't work, expected: foo=12.3 4.5 3.0 4.1, bar=123.0, xyz=6.7)
Is there any way sed can be used to do what I want?
Consider this test file:
$ cat file
[-0.0, 1.23]
[-0.0, 1.23, 4.56]
foo=[12.3, 4.5, 3.0, 4.1], bar=123.0, xyz=6.7
[1,2,-3,4]
To remove any commas within square brackets and also the remove square brackets:
$ sed -E ':a; s/(\[[^],]*), */\1 /; ta; s/\[([^]]*)\]/\1/g' file
-0.0 1.23
-0.0 1.23 4.56
foo=12.3 4.5 3.0 4.1, bar=123.0, xyz=6.7
1 2 -3 4
How it works
:a
This defines a label a.
s/(\[[^],]*), */\1 /
This looks for the first comma within a square bracket and removes it.
[^],] matches any character except ] or ,. Thus, (\[[^],]*) matches [ followed by any number of characters not ] or , and stores the result in group 1.
ta
If the above substitution resulted in a change, jump back to label a so we can try the substitution again.
s/\[([^]]*)\]/\1/g
After we have finished removing commas, this removes the square brackets.
Note that [^]] matches any character that is not ]. Thus \[([^]]*)\] matches a [ followed by any number of any character except ] followed by ]. In other words, it matches a single bracketed expression and the contents of the expression, excluding the square brackets, are stored in group 1.

How to find the first element of a block of strings whose first character matches an input character?

Given weapons: ["rock" "scissors" "paper"]
If I did player-choice: ask "(r)ock, (p)aper, (s)cissors or (q)uit? "
how could i look for the character entered by the user in the block with word weapons attached to it
If you only want one match, and to use only the actual item names in your block, your own solution is fine. But one of the important things about Red is how you can structure your data to make things easier. For example, if you want to select items from a list based only on a known key (e.g. first character), you can make that explicit, rather than implicit.
weapons: ["r" "rock" "s" "scissors" "p" "paper"]
player-choice: ask "(r)ock, (p)aper, (s)cissors or (q)uit? "
print select weapons player-choice
weapons: ["rock" "scissors" "paper"]
matching-weapon: func [abbrev][
foreach weapon weapons [
if (first weapon) = first abbrev [
return weapon
]
]
]
>> abr: "p"
== "p"
>> parse weapons [some [into [x: abr (print x)] | skip] ]
paper
or
>> parse weapons [collect some [into [x: abr keep (x)] | skip] ]
== ["paper"]
If you want the block starting from what is found, remove index?
switch player-choice [
"r" [index? find weapons "rock"]
"s" [index? find weapons "scissors"]
"p" [index? find weapons "paper"]
"q" ["quit"]
]

Netlogo - read and import string data from txt file

I am trying to read a .txt file containing strings:
Delivery LHR 2018
Delivery LHR 2016
Delivery LHR 2014
Delivery LHR 2011
Delivery LHR 2019
Delivery LHR 1998
I have tried below codes but not working. It reported "expect a literal value" when running file-read
globals [input]
to setup
set input []
file-open "test.txt"
while [not file-at-end?]
[
let a quote file-read
let b quote file-read
set input lput a input
set input lput b input
print input
]
file-close
end
to-report quote [ #thing ]
ifelse is-number? #thing
[ report #thing ]
[ report (word "\"" #thing "\"") ]
end
You can kind-of get what you want with the csv extension which comes with NetLogo. It at least let's you specify a delimiter, so " ", but you'll have to manually read past all the blank columns it'll see.
extensions [csv]
globals [input]
to setup
set input []
let lines (csv:from-file "test.txt" " ")
foreach lines [ line ->
let col1 (item 0 line)
let i 1
while [item i line = ""] [ set i (i + 1) ]
let col2 (item i line)
show col2
set i (i + 1)
while [item i line = ""] [ set i (i + 1) ]
let col3 (item i line)
show col3
set input lput col1 input
]
show input
end
The reason it doesn`t work can be found in the file-read description from the NetLogo Dictionary Manual (https://ccl.northwestern.edu/netlogo/docs/dictionary.html#file-read)
[...]Note that strings need to have quotes around them.[...]
It is not a solution to add the quotes within NetLogo because file-read already throws an Error, if the next entry in the file is not one of number, list, string, boolean, or the special value nobody. And string in this case means, it needs to have quotes around it.
Thus, to read the file into NetLogo you have to put quotes around the strings in your input file. Alternatively, if the strings in your input file always have the same length, you could try to read the file using the primitive file-read-characters. Here is an example that should work with your input file:
to setup
file-open "test.txt"
while [not file-at-end?]
[
let a file-read-characters 8
let skip file-read-characters 4
let b file-read-characters 3
let c file-read
print (list a b c)
]
file-close
end

Datastage, Remove only last two characters of string

This function: Trim(In.Col, Right(In.Col, 2), 'T') works unless the last >2 characters are the same.
What I want:
abczzzz -> abczz
What I get:
abczzzz -> abc
How do I solve this?
The "T" option removes all trailing occurrences. Since you are limiting your input to only two characters with the Right() function, the second occurence will never be a trailing char.
It sounds though like you are just doing a substring..? If so, then you might just want to do a substring [ ] instead.
expression [ [ start, ] length ]
In.Col[(string length) - 2]
I prefer the Left() function, although it's equivalent here, as it's self-documenting.
Left(InLink.MyString, Len(InLink.MyString) - 2)

why can the character's order in regex expression affect sed?

The tv.txt file is as following:
mms://live21.gztv.com/gztv_gz 广州台[可于Totem/VLC/MPlayer播放,记得把高宽比设置成4:3]
mms://live21.gztv.com/gztv_news 广州新闻台·直播广州(可于Totem/VLC/MPlayer播放,记得把高宽比设置成4:3)
mms://live21.gztv.com/gztv_kids 广州少儿台(可于Totem/VLC/MPlayer播放,记得把高宽比设置成4:3)
mms://live21.gztv.com/gztv_econ 广州经济台
I want to group it into three groups.
sed -r 's/([^ ]*)\s([^][()]*)((\(.+\))*|(\[.+\])*)/\3/' tv.txt
got the result:
[可于Totem/VLC/MPlayer播放,记得把高宽比设置成4:3]
(可于Totem/VLC/MPlayer播放,记得把高宽比设置成4:3)
(可于Totem/VLC/MPlayer播放,记得把高宽比设置成4:3)
When I write it into
sed -r 's/([^ ]*)\s([^][()]*)((\(.+\))*|(\[.+\])*)/\3/' tv.txt
It can't work.
The only difference is [^][()] and [^[]()]; neither of the [^\[\]()] ,escape characters can not make it run properly.
I want to know the reason.
The POSIX rules for getting ] into a character class are a little arcane, but they make sense when you think about it hard.
For a positive (non-negated) character class, the ] must be the first character:
[]and]
This recognizes any character a, n, d or ] as part of the character class.
For a negated character class, the ] must be the first character after the ^:
[^]and]
This recognizes any character except a, n, d or ] as part of the character class.
Otherwise, the first ] after the [ marks the end of the character class. Inside a character class, most of the normal regex special characters lose their special meaning, and others (notably - minus) acquire special meanings. (If you want a - in a character class, it has to be 'first' or last, where 'first' means 'after the optional ^ and only if ] is not present'.)
In your examples:
[^][()] — this is a negated character class that recognizes any character except [, ], ( or ), but
[^[]()] — this is a negated character class that recognizes any character except [, followed by whatever () symbolizes in the regex family you're using, and ] which represents itself.