thanks for reading.
I have a plain text file with some simple user information
The thing is, sometimes one of those items is missing.
Notice how Norman and Reggie show an email addr, but Missy doesn't:
Name: Norman Normalrecord
Email: norman#ooga.com
Addr: 123 Main street
Name: Missy Missington
Addr: 789 Back street
Name: Reggie Regularrecord
Email: reggie#booga.com
Addr: 456 Middle street
I would like to grep / sed and say "If no email address is found, substitute with the text missing_email_addr", so I get this result :
Norman Normalrecord
norman#ooga.com
123 main street
Missy Missington
MISSING_EMAIL_ADDR
789 back street
Reggie Regularrecord
reggie#booga.com
456 middle street
The problem is, in all my experiments when nothing is found grep / sed produce absolutely nothing, so I can't even do a second pass to global-replace.
What I dream of is something like (obviously pseudo-grep) that provides what to print when a search doesn't find anything :
grep /Name:/MISSING_NAME/email:/MISSING_EMAIL_ADDR/Addr:/MISSING_STREET_ADDR/
Is there any way to do something like this? Thanks again.
Here's a start. It replaces missing e-mail lines with "Email: N/A".
awk -v RS='\n\n' -v FS='\n' -v OFS='\n' \
'{ if (!$3) $3 = "Email: N/A"; print; print "" }' users.txt
Output:
Name: Norman Normalrecord
Email: norman#ooga.com
Addr: 123 Main street
Name: Missy Missington
Addr: 789 Back street
Email: N/A
Name: Reggie Regularrecord
Email: reggie#booga.com
Addr: 456 Middle street
This might work for you (GNU sed):
sed '/^Name: /!b;:a;$!N;/\nAddr: /!ba;/\nEmail: /!s/\n/&Email: MISSING_EMAIL_ADDR&/' file
If you want to remove the labels:
sed -r '/^Name: /!b;:a;$!N;/\nAddr: /!ba;/\nEmail: /!s/\n/&Email: MISSING_EMAIL_ADDR&/;s/(Name|Email|Addr): //g' file
Using GNU awk for gensub():
$ cat tst.awk
BEGIN { RS=""; ORS="\n\n"; FS=OFS="\n" }
NF<3 { $3=$2; $2="Email: MISSING_EMAIL_ADDR" }
{ print gensub(/(^|\n)[^:]+:[[:space:]]*/,"\\1","g") }
$ gawk -f tst.awk file
Norman Normalrecord
norman#ooga.com
123 Main street
Missy Missington
MISSING_EMAIL_ADDR
789 Back street
Reggie Regularrecord
reggie#booga.com
456 Middle street
You can do the same in any awk using sub(/^..) then gsub(/\n...) instead of gensub(/(^|\n)...).
In case it's useful, to identify ANY missing field and provide a "missing" indication for it in the order the fields are used in your input and without having to explicitly name any of the fields up front (assume every field appears in at least one record) would be:
$ cat tst.awk
BEGIN { RS=""; FS=OFS="\n" }
{
for (fldNr=1; fldNr<=NF; fldNr++) {
split($fldNr,nameVal,/:[[:space:]]*/)
name = nameVal[1]
val = nameVal[2]
rec[NR,name] = val
if (!seen[name]++) {
for (nameNr=++numNames; nameNr>fldNr; nameNr--) {
names[nameNr] = names[nameNr-1]
}
names[nameNr] = name
}
}
}
END {
for (recNr=1; recNr<=NR; recNr++) {
for (nameNr=1; nameNr<=numNames; nameNr++) {
name = names[nameNr]
key = recNr SUBSEP name
if (key in rec) {
print rec[key]
}
else {
print "MISSING_" toupper(name)
}
}
print ""
}
}
$
$ cat file
Name: Norman Normalrecord
Email: norman#ooga.com
Addr: 123 Main street
Name: Missy Missington
Addr: 789 Back street
Name: Reggie Regularrecord
Email: reggie#booga.com
Addr: 456 Middle street
Whatever: Some useful info
$
$ awk -f tst.awk file
Norman Normalrecord
norman#ooga.com
123 Main street
MISSING_WHATEVER
Missy Missington
MISSING_EMAIL
789 Back street
MISSING_WHATEVER
Reggie Regularrecord
reggie#booga.com
456 Middle street
Some useful info
Here is a sed script that seems to do what you "dream" about (it assumes that the entries are separated with blank lines):
$ cat s.sed
# collect the lines from one entry in the pattern space
# removing the empty line for consistency
:a; $!{N;/\n$/!ba}; s/\n$//
# make substitutions
/Name:/!s/^/MISSING_NAME\n/
/Email:/!s/\n/\nMISSING_EMAIL_ADDR\n/
/Addr:/!s/$/\nMISSING_STREET_ADDR/
# add an empty line back
s/$/\n/p
With your data:
$ sed -nf s.sed info.txt
Name: Norman Normalrecord
Email: norman#ooga.com
Addr: 123 Main street
Name: Missy Missington
MISSING_EMAIL_ADDR
Addr: 789 Back street
Name: Reggie Regularrecord
Email: reggie#booga.com
Addr: 456 Middle street
Another demo:
$ cat info_ext.txt
Email: norman#ooga.com
Addr: 123 Main street
Name: Missy Missington
Addr: 789 Back street
Name: Reggie Regularrecord
Email: reggie#booga.com
$ sed -nf s.sed info_ext.txt
MISSING_NAME
Email: norman#ooga.com
Addr: 123 Main street
Name: Missy Missington
MISSING_EMAIL_ADDR
Addr: 789 Back street
Name: Reggie Regularrecord
Email: reggie#booga.com
MISSING_STREET_ADDR
Related
I altered a code from solo learn app but got confused :
import re
pattern = r'(.+)(.+) \2'
match = re.match(pattern , 'ABC bca cab ABC')
if match:
print('Match 1' , match.group())
match = re.match(pattern , 'abc BCA cab BCA')
if match:
print('Match 2' , match.group())
match = re.match(pattern , 'abc bca CAB CAB')
if match:
print('Match 3' , match.group())
And am getting this output:
Match 1 ABC bca ca
Match 3 abc bca CAB CAB
Any help ?!!
I have a text file with the below format:
Text: htpps:/xxx
Expiry: ddmm/yyyy
object_id: 00
object: ABC
auth: 333
RequestID: 1234
Text: htpps:/yyy
Expiry: ddmm/yyyy
object_id: 01
object: NNN
auth: 222
RequestID: 3456
and so on
...
I want to delete all lines with the exception of lines with prefix "Expiry:" "object:" and "object_id:"
then load it into a table in postgresql
Would really appreciate your help on the above two.
thanks
Nick
I'm sure there will be other methods, but I found an iterative approach if every object has the same format of
Text: htpps:/xxx
Expiry: ddmm/yyyy
object_id: 00
object: ABC
auth: 333
RequestID: 1234
Then you can transform the above with
more test.txt | awk '{ printf "%s\n", $2 }' | tr '\n' ',' | sed 's/,,/\n/' | sed '$ s/.$//'
and, for your example it will generate the entries in CSV format
htpps:/xxx,ddmm/yyyy,00,ABC,333,1234
htpps:/yyy,ddmm/yyyy,01,NNN,222,3456
The above code does:
awk '{ printf "%s\n", $2 }': prints only the second element for each row
tr '\n' ',': transform new lines in ,
sed 's/,,/\n/': removes the empty lines
sed '$ s/.$//': removes the trailing ,
Of course this is probably an oversimplified example, but you could use it as basis. Once the file is in CSV you can load it with psql
my input is split into multiple lines. I want it to output in a single line.
For example Input is :
1|23|ABC
DEF
GHI
newline
newline
2|24|PQR
STU
LMN
XYZ
newline
Output:
1|23|ABC DEF GHI
2|24|PQR STU LMN XYZ
Well, here is one for awk:
$ awk -v RS="" -F"\n" '{$1=$1}1' file
Output:
1|23|ABC DEF GHI
2|24|PQR STU LMN XYZ
I have a text file with the below content:
.....
Phone: 123-456-7899, 555-555-5555, 999-333-7890
Names: Bob Jones, Mary Smith, Bob McAlly,
Sally Fields, Tom Hanks, Jeffery Cook,
Betty White, Tom McDonald, Bruce Harris
Address: 1234 Main, 445 Westlake, 3332 Front Street
.....
I am looking to grab all of the names starting from Bob Jones and ending with Bruce Harris from the file. I have this Scala code, but it only gets the first line:
Bob Jones, Mary Smith, Bob McAlly,
Here is the code:
val addressBookRDD = sc.textFile(file);
val myRDD = addressBookRDD.filter(line => line.contains("Names: ")
I don’t know how to deal with the returns or newlines in the text file, so the code only grabs the first line of the names, but not the rest of the names which are separate lines. I am looking for this type of result:
Bob Jones, Mary Smith, Bob McAlley, Sally Fields, Tom Hanks, Jeffery
Cook, Betty White, Tom McDonald, Bruce Harris
As I pointed out in a comment, to read a file structured this way is not really something Spark is very suitable for. If the file is not very large, using only Scala would probably be a better way to do it. Here is a Scala implementation:
val lines = scala.io.Source.fromFile(file).getLines
val nameLines = lines
.dropWhile(line => !line.startsWith("Names: "))
.takeWhile(line => !line.startsWith("Address: "))
.toSeq
val names = (nameLines.head.drop(7) +: nameLines.tail)
.mkString(",")
.split(",")
.map(_.trim)
.filter(_.nonEmpty)
Printing names using names foreach println will give you:
Bob Jones
Mary Smith
Bob McAlly
Sally Fields
Tom Hanks
Jeffery Cook
Betty White
Tom McDonald
Bruce Harris
I am trying to remove duplicate numbers within parentheses using sed.
So I have the following string:
Abdc 1234 1234 (5678) (5678) (9012) (9012) (3456)
I want to use sed to remove any 4-digit numbers within the parentheses, including the parentheses. So my string should look like this:
Abdc 1234 1234 (5678) (9012) (3456)
In this case the "(5678)" and "(9012)" were removed because they were 4-digit numbers within parentheses that repeated. The "1234" numbers were not removed because they were not within parenthesis. The "(3456)" was not removed because it was not repeating.
I do not know how to do this with sed but you could try the following with awk:
$ echo "Abdc 1234 1234 (5678) (5678) (9012) (9012) (3456)" | awk '
{
for(i=1;i<=NF;i++) {
if(substr($i,0,1) != "(" || (seen[$i] != 1)) {
seen[$i]=1;
printf "%s ",$i
}
};
print ""
}'
Output:
Abdc 1234 1234 (5678) (9012) (3456)
This loops through the line fields then prints each field only if it has never been seen before or if it is not starting with (.
This works for your input:
echo 'Abdc 1234 1234 (5678) (5678) (9012) (9012) (3456)' |
sed 's/\(([0-9][0-9]*)\) \1/\1/g'
It assumes duplicates follow each other, if that is not the case, use this version:
echo 'Abdc 1234 1234 (5678) (5678) (9012) (9012) (3456)' |
sed 's/\(([0-9][0-9]*)\) \(.*\)\1/\1\2/g'
Or a bit shorter with GNU sed extended expressions:
echo 'Abdc 1234 1234 (5678) (5678) (9012) (9012) (3456)' |
sed -r 's/(\([0-9]+\)) (.*)\1/\1\2/g'
Output in all cases:
Abdc 1234 1234 (5678) (9012) (3456)
Edit - handle situation where more than two identical items exist
This can be done by looping over the pattern until it no longer matches:
echo 'Abdc 1234 1234 (5678) (5678) (9012) (9012) (3456) (5678) (5678)' |
sed -r ':a; s/(\([0-9]+\))(.*)\1 ?/\1\2/g; ta'
Using Perl :
$ echo "Abdc 1234 1234 (5678) (5678) (9012) (9012) (3456)" |
perl -ne '
my (#arr, %hash);
for (split) {
if (/^\(.*\)/) {
$hash{$_}++;
push #arr, $_ if $hash{$_} == 1;
}
else {
push #arr, $_;
}
}
print join " ", #arr, "\n";
'
That will works with multi line as input and N occurrences of repeated stuff with parenthesis.
This might work for you (GNU sed):
sed ':a;s/\(\(([0-9]\+) *\).*\)\2/\1/g;ta' file
awk -F"(" '{for(i in a)delete a[i];for(i=2;i<=NF;i++){if($i in a){$i="";}else{a[$i];$i="("$i}}print $0}' your_file
Tested below:
input:
> cat temp
Abdc 1234 1234 (5678) (5678) (9012) (9012) (3456)
1234 1234 (1234) (5678) (9012) (1234) (3456)
(5678) (6467) (6467) (9012) (5678)
Now the execution:
> awk -F"(" '{for(i in a)delete a[i];for(i=2;i<=NF;i++){if($i in a){$i="";}else{a[$i];$i="("$i}}print $0}' temp
Abdc 1234 1234 (5678) (9012) (3456)
1234 1234 (1234) (5678) (9012) (3456)
(5678) (6467) (9012) (5678)
>