sed/awk/cut/grep - Best way to extract string - sed

I have a results.txt file that is structured in this format:
Uncharted 3: Javithaxx l Rampant l Graveyard l Team Deathmatch HD (D1VpWBaxR8c)
Matt Darey feat. Kate Louise Smith - See The Sun (Toby Hedges Remix) (EQHdC_gGnA0)
The Matrix State (SXP06Oax70o)
Above & Beyond - Group Therapy Radio 014 (guest Lange) (2013-02-08) (8aOdRACuXiU)
I want to create a new file extracting the youtube URL ID specified in the last characters in each line line "8aOdRACuXiU"
I'm trying to build a URL like this in a new file:
http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
Note, I appended the &hd=1 to the string that I am trying to be replaced. I have tried using Linux reverse and cut but reverse or rev munges my data. The hard part here is that each line in my text file will have entries with parentheses and I only care about getting the data between the last set of parentheses. Each line has a variable length so that isn't helpful either. What about using grep and .$ for the end of the line?
In summary, I want to extract the youtube ID from results.txt and export it to a new file in the following format: http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1

Using awk:
awk '{
v = substr( $NF, 2, length( $NF ) - 2 )
printf "%s%s%s\n", "http://www.youtube.com/watch?v=", v, "&hd=1"
}' infile
It yields:
http://www.youtube.com/watch?v=D1VpWBaxR8c&hd=1
http://www.youtube.com/watch?v=EQHdC_gGnA0&hd=1
http://www.youtube.com/watch?v=SXP06Oax70o&hd=1
http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1

$ sed 's!.*(\(.*\))!http://www.youtube.com/watch?v=\1\&hd=1!' results.txt
http://www.youtube.com/watch?v=D1VpWBaxR8c&hd=1
http://www.youtube.com/watch?v=EQHdC_gGnA0&hd=1
http://www.youtube.com/watch?v=SXP06Oax70o&hd=1
http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
Here, .*(\(.*\)) looks for the last occurrence of a pair of parentheses, and captures the characters inside those parentheses. The captured group is then inserted into the URL using \1.

Using a perl one-liner :
perl -lne 'printf "http://www.youtube.com/watch?v=%s&hd=1\n", $& if /[^\(]+(?=\)$)/' file.txt
Or multi-line version :
perl -lne '
printf(
"http://www.youtube.com/watch?v=%s&hd=1\n",
$&
) if /[^\(]+(?=\)$)/
' file.txt

Related

Decode binary octet string in a file with perl

I have a file that contains for some of the lines a number that is coded as text -> binary -> octets and I need to decode that to end up with the number.
All the lines where this encoded string is, begins with STRVID:
For example I have in one of the lines:
STRVID: SarI3gXp
If I do this echo "SarI3gXp" | perl -lpe '$_=unpack"B*"' I get the number in binary
0101001101100001011100100100100100110011011001110101100001110000
Now just to decode from binary to octets I do this (assign the previous command to a variable and then convert binary to octets
variable=$(echo "SarI3gXp" | perl -lpe '$_=unpack"B*"') ; printf '%x\n' "$((2#$variable))"
The result is the number but not in the correct order
5361724933675870
To get the previous number in the correct order I have to get for each couple of digits first the second digit and then the first digit to finally have the number I'm looking for. Something like this:
variable=$(echo "SarI3gXp" | perl -lpe '$_=unpack"B*"') ; printf '%x\n' "$((2#$variable))" | gawk 'BEGIN {FS = ""} {print $2 $1 $4 $3 $6 $5 $8 $7 $10 $9 $12 $11 $14 $13 $16 $15}'
And finally I have the number I'm looking for:
3516279433768507
I don't have any clue on how to do this automatically for every line that begins with STRVID: in my file. At the end what I need is the whole file but when a line begins with STRVID: then the decoded value.
When I find this:
STRVID: SarI3gXp
I will have in my file
STRVID: 3516279433768507
Can someone help with this?
First of all, all you need for the conversion is
unpack "h*", "SarI3gXp"
A perl one-liner using -p will execute the provided program for each line, and s///e allows us to modify a string with code as the replacement expression.
perl -pe's/^STRVID:\s*\K\S+/ unpack "h*", $& /e'
See Specifying file to process to Perl one-liner.
Please inspect the following sample demo code snippet for compliance with your problem.
You do not need double conversion when it can be done in one go.
Note: please read pack documentation , unpack utilizes same TEMPLATE
use strict;
use warnings;
use feature 'say';
while( <DATA> ) {
chomp;
/^STRVID: (.+)/
? say 'STRVID: ' . unpack("h*",$1)
: say;
}
__DATA__
It would be nice if you provide proper input data sample
STRVID: SarI3gXp
Perhaps the result of this script complies with your requirements.
To work with real input data file replace
while( <DATA> ) {
with
while( <> ) {
and pass filename as an argument to the script.
Output
It would be nice if you provide proper input data sample
STRVID: 3516279433768507
Perhaps the result of this script complies with your requirements.
To work with real input data file replace
while( <DATA> ) {
with
while( <> ) {
and pass filename as an argument to the script.
./script.pl input_file.dat
you can cross flip the numbers entirely via regex (and without back-references either) :
variable=$(echo "SarI3gXp" | perl -lpe '$_=unpack"B*"') ;
printf '%x\n' "$((2#$variable))" |
mawk -F'^$' 'gsub("..", "_&=&_") + gsub(\
"(^|[0-9]_)(_[0-9]|$)", _)+gsub("=",_)^_'
1 3516279433768507
The idea is to make a duplicate copy on the other side, like this :
_53=53__61=61__72=72__49=49__33=33__67=67__58=58__70=70_
then scrub out the leftovers, since the numbers u now want are anchoring the 2 sides of each equal sign ("=")

Replacing all occurrence after nth occurrence in a line in perl

I need to replace all occurrences of a string after nth occurrence in every line of a Unix file.
My file data:
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
My output data:
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
tried using sed: sed 's/://3g' test.txt
Unfortunately, the g option with the occurrence is not working as expected. instead, it is replacing all the occurrences.
Another approach using awk
awk -v c=':' -v n=2 'BEGIN{
FS=OFS=""
}
{
j=0;
for(i=0; ++i<=NF;)
if($i==c && j++>=n)$i=""
}1' file
$ cat file
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
$ awk -v c=':' -v n=2 'BEGIN{FS=OFS=""}{j=0;for(i=0; ++i<=NF;)if($i==c && j++>=n)$i=""}1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
With GNU awk, using gensub please try following. This is completely based on your shown samples, where OP wants to remove : from 3rd occurrence onwards. Using gensub to segregate parts of matched values and removing all colons from 2nd part(from 3rd colon onwards) in it as per OP's requirement.
awk -v regex="^([^:]*:)([^:]*:)(.*)" '
{
firstPart=restPart=""
firstPart=gensub(regex, "\\1 \\2", "1", $0)
restPart=gensub(regex,"\\3","1",$0)
gsub(/:/,"",restPart)
print firstPart restPart
}
' Input_file
I have inferred based on the limited data you've given us, so it's possible this won't work. But I wouldn't use regex for this job. What you have there is colon delimited fields.
So I'd approach it using split to extract the data, and then some form of string formatting to reassemble exactly what you like:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
chomp;
my ( undef, $first, #rest ) = split /:/;
print ":$first:", join ( "", #rest ),"\n";
}
__DATA__
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
This gives you the desired result, whilst IMO being considerably clearer for the next reader than a complicated regex.
You can use the perl solution like
perl -pe 's~^(?:[^:]*:){2}(*SKIP)(?!)|:~~g if /^:account_id:/' test.txt
See the online demo and the regex demo.
The ^(?:[^:]*:){2}(*SKIP)(?!)|: regex means:
^(?:[^:]*:){2}(*SKIP)(?!) - match
^ - start of string (here, a line)
(?:[^:]*:){2} - two occurrences of any zero or more chars other than a : and then a : char
(*SKIP)(?!) - skip the match and go on to search for the next match from the failure position
| - or
: - match a : char.
And only run the replacement if the current line starts with :account_id: (see if /^:account_id:/').
Or an awk solution like
awk 'BEGIN{OFS=FS=":"} /^:account_id:/ {result="";for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result}' test.txt
See this online demo. Details:
BEGIN{OFS=FS=":"} - sets the input/output field separator to :
/^:account_id:/ - line must start with :account_id:
result="" - sets result variable to an empty string
for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result} - iterates over the fields and if the field number is greater than 2, just append the current field value to result, else, append the value + output field separator; then print the result.
I would use GNU AWK following way if n fixed and equal 2 following way, let file.txt content be
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
then
awk 'BEGIN{FS=":";OFS=""}{$2=FS $2 FS;print}' file.txt
output
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
Explanation: use : as field separator and nothing as output field separator, this itself does remove all : so I add : which have to be preserved: 1st (before second column) and 2nd (after second column). Beware that I tested it solely for this data, so if you would want to use it you should firstly test it with more possible inputs.
(tested in gawk 4.2.1)
This might work for you (GNU sed):
sed 's/:/\n/3;h;s/://g;H;g;s/\n.*\n//' file
Replace the third occurrence of : by a newline.
Make a copy of the line.
Delete all occurrences of :'s.
Append the amended line to the copy.
Join the two lines by removing everything from third occurrence of the copy to the third occurrence of the amended line.
N.B. The use of the newline is the best delimiter to use in the case of sed, as the line presented to seds commands are initially devoid of newlines. However the important property of the delimiter is that it is unique and therefore can be any such character as long as it is not found anywhere in the data set.
An alternative solution uses a loop to remove all :'s after the first two:
sed -E ':a;s/^(([^:]*:){2}[^:]*):/\1/;ta' file
With GNU awk for the 3rd arg to match() and gensub():
$ awk 'match($0,/(:[^:]+:)(.*)/,a){ $0=a[1] gensub(/:/,"","g",a[2]) } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
and with any awk in any shell on every Unix box:
$ awk 'match($0,/:[^:]+:/){ tgt=substr($0,1+RLENGTH); gsub(/:/,"",tgt); $0=substr($0,1,RLENGTH) tgt } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus

Extracting fasta ids after string match

I have a list of fasta sequences as following:
>Product_1_001:299:H377WBGXB:1:11101
TGATCATCTCACCTACTAATAGGACGATGACCCAGTGACGATGA
>Product_2_001:299:H377WBGXB:2:11101
CATCGATGATCATTGATAAGGGGCCCATACCCATCAAAACCGTT
The original fasta sequence is much longer than the subset posted here. I wanted to extract the 10 characters after the pattern "TCAT" into a separate file and did this
grep -oP "(?<=TCAT).{10}"
I do get the needed result as:
CTCACCTACT
TGATAAGGGG
I would like their corresponding fasta ids as one column and the extracted pattern as second column like:
>Product_1_001:299:H377WBGXB:1:11101 CTCACCTACT
>Product_2_001:299:H377WBGXB:2:11101 TGATAAGGGG
Try this one-liner
perl -lne ' /^[^<].+?(?<=TCAT)(.{10})/ and print $p,"\t",$1; $p=$_ ' file
with your given inputs
$ cat fasta.txt
>Product_1_001:299:H377WBGXB:1:11101
TGATCATCTCACCTACTAATAGGACGATGACCCAGTGACGATGA
>Product_2_001:299:H377WBGXB:2:11101
CATCGATGATCATTGATAAGGGGCCCATACCCATCAAAACCGTT
$ perl -lne ' /^[^<].+?(?<=TCAT)(.{10})/ and print $p,"\t",$1; $p=$_ ' fasta.txt
>Product_1_001:299:H377WBGXB:1:11101 CTCACCTACT
>Product_2_001:299:H377WBGXB:2:11101 TGATAAGGGG
$
Another way will be ussing awk command like this :
cat <your_file>| awk -F"_" '/Product/{printf "%s", $0; next} 1'|awk -F"TCAT" '{ print substr($1,1,35) "\t" substr($2,1,10)}'
the output :
Product_1_001:299:H377WBGXB:1:11101 CTCACCTACT
Product_2_001:299:H377WBGXB:2:11101 TGATAAGGGG
hope it help you.

Replace first 3 occurrences of a character in each line

I have a tab-delimited file of genetic variants with an INFO column of many semicolon-delimited tags:
Chr Start End Ref Alt ExAC_ALL ExAC_AFR ExAC_AMR ExAC_EAS ExAC_FIN ExAC_NFE ExAC_OTH ExAC_SAS Otherinfo QUAL DP Chr Start Ref Alt QUAL FILTER INFO
1 15847952 15847952 G C . . . . . . . . . 241.9 76196 1 15847952 . G C 241.9 PASS AC=2;AF=0;AN=18332;BaseQRankSum=0.731;ClippingRankSum=-0.731;DP=76196;ExcessHet=3.1;FS=0;InbreedingCoeff=-0.0456;MLEAC=2;MLEAF=0;MQ=38.93;MQRankSum=0.515;NEGATIVE_TRAIN_SITE;QD=10.52;ReadPosRankSum=0.89;SOR=0.481;VQSLOD=-1.406 culprit=MQ
1 15847963 15847963 A C . . . . . . . . . 1607.1 126156 1 15847963 . A C 1607.1 PASS AC=2;AF=0;AN=22004;BaseQRankSum=0.851;ClippingRankSum=-0.419;DP=126156;ExcessHet=3.4904;FS=0;InbreedingCoeff=0.0299;MLEAC=2;MLEAF=0;MQ=59.29;MQRankSum=0.18;QD=1.55;ReadPosRankSum=0.067;SOR=0.651;VQSLOD=0.995 culprit=QD
1 15847964 15847966 GCC - . . . . . . . . . 1607.1 126156 1 15847963 . AGCC A 1607.1 PASS AC=63;AF=0.003;AN=22004;BaseQRankSum=0.851;ClippingRankSum=-0.419;DP=126156;ExcessHet=3.4904;FS=0;InbreedingCoeff=0.0299;MLEAC=55;MLEAF=0.002;MQ=59.29;MQRankSum=0.18;QD=1.55;ReadPosRankSum=0.067;SOR=0.651;VQSLOD=0.995 culprit=QD
1 15847978 15847978 C T . . . . . . . . . 648.41 234344 1 15847978 . C T 648.41 PASS AC=9;AF=0;AN=25894;BaseQRankSum=-0.572;ClippingRankSum=-0.404;DP=234344;ExcessHet=3.348;FS=2.639;InbreedingCoeff=-0.0098;MLEAC=6;MLEAF=0;MQ=58.71;MQRankSum=-0.456;NEGATIVE_TRAIN_SITE;QD=4.13;ReadPosRankSum=-0.456;SOR=0.452;VQSLOD=-1.238 culprit=QD
I want to split the first 3 semicolon-delimited terms in the INFO column:
AC=2;AF=0;AN=18332
So that they become:
AC=2 AF=0 AN=18332 BaseQRankSum=0.731;ClippingRankSum=-0.731;DP=76196;ExcessHet=3.1;FS=0;InbreedingCoeff=-0.0456;MLEAC=2;MLEAF=0;MQ=38.93;MQRankSum=0.515;NEGATIVE_TRAIN_SITE;QD=10.52;ReadPosRankSum=0.89;SOR=0.481;VQSLOD=-1.406 culprit=M
So far I've tried the following expression with sed:
sed -i .bk 's/\(A.=.*\);/\1 /g' allChr_ExAC38.hg38_multianno.txt
But this yields no changes.
Ideally I was looking for a way to tell sed to replace the first 3 occurences of a semicolon ; for a tab, but 's/;/ /g3' doesn't seem to mean that.
Use Perl instead of sed:
perl -i.bk -pe '$c = 0; s/;/\t/ while $c++ < 3' -- file.txt
You can try this awk
awk '{for(i=1;i<4;i++)sub(";","\t")}1' infile
The .* in your regex is greedy, and will match as much text as possible on the line, up to just before the last semicolon (but not beyond, because then the entire regex won't match at all).
You cannot mix /3 and /g; the latter means, replace all occurrences on every line, so it is directly at odds with the /3 which says to replace only a maximum of three occurrences on a line.
"No changes" seems wrong, though; if your regex matched at all, the last semicolon on matching lines will have been replaced.
Some regex engines support non-greedy matching, but sed isn't one of them. As long as there is a single delimiter character you can use to limit the greediness, using that is a much better solution anyway. In your case, simply replace . with [^;] to say "any character except (newline or) semicolon" instead of "any character (except newline)."
sed 's/\(A.=[^;]*\);/\1 /3' allChr_ExAC38.hg38_multianno.txt
(This will print to standard output for verification; put back the -i .bk once you see the result is correct.)
Based on your example data, perhaps consider replacing the remaining . in the expression with [A-Z] and [^;] with [^;=] or even [0-9]. The more specific you can make your regex, the better.
Could you please try following and let me know if this helps you.
awk '
FNR==1{
print;
next}
{
num=split($(NF-1),array,";");
for(i=4;i<=num;i++){
val=val?val ";"array[i]:array[i]};
$(NF-1)=array[1] OFS array[2] OFS array[3] OFS val;
val="";
$1=$1
}
1
' OFS="\t" Input_file
This might work for you (GNU sed):
sed -i.bak 's/;/\n/3;h;y/;/\t/;G;s/\n.*\n/\t/' file
Replace the third ; with a newline, make a copy of the line, replace all ;'s with \t's, append the copy and replace the end of the first line to the middle of the second line with a \t.
Since by definition a line is demarcated by a newline, lines cannot contain a newline unless they are introduced by a programmer.
If the number of occurrences is reasonable you can pipe sed multiple times i.e.
sed -E -e 's/[0-9]{4}/****/'| sed -E -e 's/[0-9]{4}/****/'| sed -E -e 's/[0-9]{4}/****/'
will mask first 3 4-digit groups of credit card number like so
Input:
1234 5678 9101 1234
Output:
**** **** **** 1234

Keeping first character in string, in a specific single field

I am trying to remove all but the first character of a specific field in a .tab file. I want to keep only first character in fields 10 and 11.
Normally the fields have 35 characters in them, so I used:
awk '{gsub ("..................................$","",$10;print} file
however, there are some fields which have less than 35, and were ignored by this replace function. I tired using substring, but I cannot figure out how to make it field specific. I believe there is a way to use perl inside awk so that I can use the function
perl -pe 's/(.).*/$1/g'
but I am not sure how to do that and use the field as the input value, so the file comes out identical except for the altered field.
is there a way to do the perl equivalent with gsub, or the awk equivalent with perl?
help is appreciated!
One way using awk:
awk '{ for (i=10;i<=11;i++) { $i = substr( $i, 1, 1) } } { print }' infile
Another way using gensub function of gawk
gawk '{ for (i=10;i<=11;i++) { $i = gensub(/(.).*/ , "\\1", G , $i) } }1' infile
A shortest awk version, I could figure out:
awk '($10=substr($10,1,1))&&$11=substr($11,1,1)' infile
If the 10th and/or 11th field is not existing then the line is not printed.
Similar version in perl
perl -ane '$F[9]=~s/(.).*/$1/;$F[10]=~s/(.).*/$1/;print "#F\n"' infile
This prints the line even if 10th and/or 11th field is not defined.
Another way with perl:
perl -pe '$c=0; s/(\S+)/(++$c < 10 || $c > 11) ? $1 : substr($1,0,1)/eg' filename