Extracting values from a single file - sed

I have a file with multiple lines; but a specific line contains tons of information, with several repeated expressions. I'm trying to extract some specific values. I first tried some commands with sed, for instance, but with no success. So, I was wondering if you could give me some insights.
So, here you have one fraction of the unique line of the given document I mentioned:
[...]6[&length_range={0.19
[... a lot of more information here in between ...]
0.01},habitat.set.prob={0.01,0.03,0.56,0.01,0.01,0.34,0.01,0.01,0.01},DLOOP.rate_median=0.04131395026396427,length=
[...]
10[&length_range={0.19
[... a lot of more information here in between ...]
0.01},habitat.set.prob={0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61},DLOOP.rate_median=0.04131395026396427,length=
[...]
My aim here is first to extract all the values that is between the brackets, after "habitat.set.prob={". and put them in a single line in a text file.
Also, it would be important to extract the numbers that appears just before the expression "[&length_range=]", which in this case are "6" and "10". They are the label of the set of numbers after "prob={"
So the set of numbers I want to extract always appears between "habitat.set.prob={" and "},DLOOP.rate_median", while the other number (the label) is always rigth before "[&length_range="; but what is before the label is not the same expression; actually it is a random number.
The goal then is end up with a file with the following characteristcs:
6 0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61
10 0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61
and so on …
What do you think? Is this possible?
I started with this very basic command at least to try to extract the set of numbers, but it didn't work
sed -n "/habitat.set.prob={/,/},DLOOP.rate_median=/ p"
| Well... I got some improvement.
I was able to get the values at least:
awk '{gsub("habitat.set.prob={","\n");printf"%s",$0}' filename | awk -F'},' '{print $1"}"}' | grep -iv "TREE" > stats.txt
|
Many thanks in advance.
Cheers,
Luiz

Something like that:
sed -rn '/.*[0-9]+\[&length_range=\{/,/habitat.set.prob=\{/{s/.*\b([0-9]+)\[&length_range.*/\1/p; s/.*habitat.set.prob=\{([^D]+)\},DLOOP.rate.*/\1/p}' habitat
6
0.01,0.03,0.56,0.01,0.01,0.34,0.01,0.01,0.01
10
0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61
The first part '/.a./,/.b./' searches from pattern a to b, distributed over multiple lines. The -n told sed to do non-printing as default.
In '/.a./,/.b./{s/.c./.d./p; s/.e./.f./p}'
there are two substitution commands with p=print in curly braces.

I am not sure if you really digged a little, so not providing the complete answer, but let's hope this would help you:
for the first part: getting the no(which you call as label) you didn't mention if there is any specific pattern, so try this (data is the file which contains the actual input) - you need to work on how to get the number and tweak the RE a bit
sed -n 's/.*\([0-9][0-9]*\).*length_range.*/\1/p' data
For the other part which gives the numericals between habitat and DLOOP:
sed -n 's/.*habitat.set.prob=\(.*\),DLOOP.*/\1/pg' data | tr '{' ' ' | tr '}' ' '
Now, try to take this as a starter and work on your output to get your desired result!
To explain a bit:
In the first section - I am trying to capture the numericals between anything(.*) and (.*)length_range [you can escape the character [ and & by using \ in front of them]
In the second section: I am capturing pattern in between habitat.set.prob and DLOOP and then doin a tr to remove the brackets.

#include <iostream>
using namespace std;
int main()
{
string p = "1:2:3:4"; //input your string
int arr[4] = {}; //create a new empty integer array to put the integers in it
for(int i=0, j=0; i <p.length(); i++){//loop on the string to extract integers
if( p[i] == ':'){continue;}//if the value = ':' skip it and continue
arr[j]=(int)p[i]-48;j++;//put the integer in the array we created
}
cout << "String={"<<arr[0]<<" "<<arr[1]<<" "<<arr[2]<<" "<<arr[3]<<"}";//print the array
return 0;
}

Related

How to replace groups of characters between flags in MATLAB

Suppose I have a char variable in Matlab like this:
x = 'hello ### my $ name is Sean Daley.';
I want to replace the first '###' with the char '&', and the first '$' with the char '&&'.
Note that the character groups I wish to swap have different lengths [e.g., length('###') is 3 while length('&') is 1].
Furthermore, if I have a more complicated char such that pairs of '###' and '$' repeat many times, I want to implement the same swapping routine. So the following:
y = 'hello ### my $ name is ### Sean $ Daley ###.$.';
would be transformed into:
'hello & my && name is & Sean && Daley &.&&.'
I have tried coding this (for any arbitrary char) manually via for loops and while loops, but the code is absolutely hideous and does not generalize to arbitrary character group lengths.
Are there any simple functions that I can use to make this work?
y = replace(y,["###" "$"],["&" "&&"])
The function strrep is what you are looking for.

convert row to column based on text

I have a rather large file (single column) with data similar to this:
BT1111
2.2.2.2/3
3.3.3.3/4
7.2.1.1/5
BT6766
2.2.1.1/5
4.5.1.1/7
BT9898
4.4.4.4/2
8.8.8.8/9
I wish to find a function that can align it into two columns, by moving all entries starting with digit one column ($1 to $2) and enrich it with the corresponding BT field, so desired output should be
BT1111;2.2.2.2/3
BT1111;3.3.3.3/4
BT1111;7.2.1.1/5
BT6766;2.2.1.1/5
BT6766;4.5.1.1/7
BT9898;4.4.4.4/2
BT9898;8.8.8.8/9
I can't imagine how to ensure the "look for next occurence" should be performed, but hope there is a function for it I have managed to overlook ?
perl -nle'if (/^\D/) { $n=$_ } else { print "$n;$_" }' input.txt
See Specifying file to process to Perl one-liner for alternate usages.
$ awk '/BT/{a=$1; next}{print a ";" $1}' input.txt
BT1111;2.2.2.2/3
BT1111;3.3.3.3/4
BT1111;7.2.1.1/5
BT6766;2.2.1.1/5
BT6766;4.5.1.1/7
BT9898;4.4.4.4/2
BT9898;8.8.8.8/9

sed: replace letter between square brackets

I have the following string:
signal[i]
signal[bg]
output [10:0]
input [i:1]
what I want is to replace the letters between square brackets (by underscore for example) and to keep the other strings that represents table declaration:
signal[_]
signal[__]
output [10:0]
input [i:1]
thanks
try:
awk '{gsub(/\[[a-zA-Z]+\]/,"[_]")} 1' Input_file
Globally substituting the (bracket)alphabets till their longest match then with [_]. Mentioning 1 will print the lines(edited or without edited ones).
EDIT: Above will substitute all alphabets with one single _, so to get as many underscores as many characters are there following may help in same.
awk '{match($0,/\[[a-zA-Z]+\]/);VAL=substr($0,RSTART+1,RLENGTH-2);if(VAL){len=length(VAL);;while(i<len){q=q?q"_":"_";i++}};gsub(/\[[a-zA-Z]+\]/,"["q"]")}1' Input_file
OR
awk '{
match($0,/\[[a-zA-Z]+\]/);
VAL=substr($0,RSTART+1,RLENGTH-2);
if(VAL){
len=length(VAL);
while(i<len){
q=q?q"_":"_";
i++
}
};
gsub(/\[[a-zA-Z]+\]/,"["q"]")
}
1
' Input_file
Will add explanation soon.
EDIT2: Following is the one with explanation purposes for OP and users.
awk '{
match($0,/\[[a-zA-Z]+\]/); #### using match awk's built-in utility to match the [alphabets] as per OP's requirement.
VAL=substr($0,RSTART+1,RLENGTH-2); #### Creating a variable named VAL which has substr($0,RSTART+1,RLENGTH-2); which will have substring value, whose starting point is RSTART+1 and ending point is RLENGTH-2.
RSTART and RLENGTH are the variables out of the box which will be having values only when awk finds any match while using match.
if(VAL){ #### Checking if value of VAL variable is NOT NULL. Then perform following actions.
len=length(VAL); #### creating a variable named len which will have length of variable VAL in it.
while(i<len){ #### Starting a while loop which will run till the value of VAL from i(null value).
q=q?q"_":"_"; #### creating a variable named q whose value will be concatenated it itself with "_".
i++ #### incrementing the value of variable i with 1 each time.
}
};
gsub(/\[[a-zA-Z]+\]/,"["q"]") #### Now globally substituting the value of [ alphabets ] with [ value of q(which have all underscores in it) then ].
}
1 #### Mentioning 1 will print (edited or non-edited) lines here.
' Input_file #### Mentioning the Input_file here.
Alternative gawk solution:
awk -F'\\[|\\]' '$2!~/^[0-9]+:[0-9]$/{ gsub(/./,"_",$2); $2="["$2"]" }1' OFS= file
The output:
signal[_]
signal[__]
output [10:0]
-F'\\[|\\]' - treating [ and ] as field separators
$2!~/^[0-9]+:[0-9]$/ - performing action if the 2nd field does not represent table declaration
gsub(/./,"_",$2) - replace each character with _
This might work for you (GNU sed);
sed ':a;s/\(\[_*\)[[:alpha:]]\([[:alpha:]]*\]\)/\1_\2/;ta' file
Match on opening and closing square brackets with any number of _'s and at least one alpha character and replace said character by an underscore and repeat.
awk '{sub(/\[i\]/,"[_]")sub(/\[bg\]/,"[__]")}1' file
signal[_]
signal[__]
output [10:0]
input [i:1]
The explanation is as follows: Since bracket is as special character it has to be escaped to be handled literally then it becomes easy use sub.

How to search/replace with random array value in perl from command line

I'm trying to search/replace where the replacement has a random value from an array -- all from the command line in Perl. I can't figure out what's wrong here. I've tried a lot of variations of this, and read many examples online (which are all for non-command-line).
echo "Test z|Test z|Test z|" | tr '|' '\n' | \
perl -pe '#numbers=[12.3, 45.6, 78.9]; $number = $numbers[rand #numbers]; s/z/" : ".( int ($number) )/ge'
The actual output is something like this (the numbers change):
Test : 24591392
Test : 24591752
Test : 24591416
The expected output is:
Test : 45.6
Test : 12.3
Test : 78.9
Where the actual numbers are randomly selected. Any tips are welcome, including pointing out silly typos. Thanks!
You want to say #numbers=(12.3, 45.6, 78.9), not #numbers=[12.3, 45.6, 78.9]. The latter creates an array with a reference to another array as its only element. The output you are currently seeing is the numification of a reference value, not the contents of the array.

Insert double quotes multiple times into string

I have inherited a flat html file with a few hundred lines similar to this:
<blink>
<td class="pagetxt bordercolor="#666666 width="203 colspan="3 height="20>
</blink>
So far I have not been able to work out a sed way of inserting the closing double quotes for each element. Probably needs something other than sed to do this. Can anyone suggest an easy way to do this?
Thanks
sed -i 's/"\([^" >]\+\)\( \|>\)/"\1"\2/g' file.html
Explanation:
" - leading double quote
\([^" >]\+\) - non-quote-or-space-or-'>' chars, grouped (into group 1)
\( \|>\) - terminating space or '>', grouped (into group 2)
We replace it with '"<group1>"<group2>'.
One solution that pops out at me is to parse through each line of the file looking for the quote. When it finds one, activate a flag to keep track of being inside a quoted area, then continue parsing the line until it hits the first space or > it comes to and inserts an additional " just before it. Flip the flag off, then continue through the string looking for the next quote. Probably not a perfect solution, but a start perhaps.
If all lines share the same structure, you could use a simple texteditor to globally replace
' bordercolor'
with
'" bordercolor'
(without single-quotes). This is then independend from the field values and works similarly for the other fields. You still have to do some manual work, but if it's just one big file, I'd bite the bullet this time and not waste probably more time working out a sed-solution.
This should do if your file is simple - it won't work if you have whitespace which should be inside the quotes - in that case, a more complex code will be needed, but can be done along the same lines.
#!usr/bin/env python
#change the "utf-8" bellow to your files encoding
data = open("<myfile.html>").read().decode("utf-8")
new_data = []
inside_tag = False
inside_quotes = False
for char in data:
if char == "<":
inside_tag = True
if char == '"':
inside_quotes = True
if inside_tag and (char.isspace() or char==">") and inside_quotes:
new_data.append('"')
inside_quotes = False
if char == ">":
inside_tag = False
new_data.append(char)
outputfile = open("<mynewfile.html>", "wt")
outputfile.write("".join(new_data).encode("utf-8"))
outputfile.close()
with bash
for file in *
do
flag=0
while read -r line
do
case "$line" in
*"<blink>"*)
flag=1
;;
esac
if [ "$flag" -eq 1 ];then
case "$line" in
*class=\"pagetxt*">" )
line="${line%>}\">"
flag=0
;;
esac
fi
echo "${line}"
done <"file" > temp
mv temp "$file"
done
Regular expressions are your friend:
Find: (="[^" >]+)([ >])
Replace: \1"\2
After you've done that, make sure to run this one too:
Find: </?blink>
Replace: \n
(This won't fix more than one class on an element, like <element class="class1 class2 id="jimmy">)