How to replace white space between 2 strings using sed - sed

How I can replace a single white space for underscore just between 2 strings like the example below, using sed:
xxx ccc vvv bbb 333 444 555
^ ^ ^ ^^^^^^^^ ^^ ^^ <--- spaces visualized for easier counting
Desired output:
xxx_ccc_vvv_bbb 333 444 555

That's easy, you just do a global (g) replace (s) of single whitespace characters (\s) surrounded by word boundaries (\b) with underscores (_):
sed 's/\b\s\b/_/g'
Your example could be run like this:
echo "xxx ccc vvv bbb 333 444 555" | sed 's/\b\s\b/_/g'
which produces the output you want:
xxx_ccc_vvv_bbb 333 444 555

Related

How to outputs the first row with a matching key in Scala spark [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have a data.txt text file as below.
0000007 aaa 20060201 117
0000007 aaa 20060202 136
0000007 aaa 20060203 221
0000017 bbbb 20060201 31
0000017 bbbb 20060202 127
0000017 bbbb 20060203 514
0000021 ccccc 20060201 900
0000021 ccccc 20060202 324
0000021 ccccc 20060203 129
Exp1: Now, I want to outputs the first row with a matching key of column(1) and column(2).
what should I do?
I want to have the desired output result as below.
0000007 aaa 20060201 117
0000017 bbbb 20060201 31
0000021 ccccc 20060201 900
Exp2: Same as above, I also want to outputs the first rows with a matching key of column(1) and column(3). what should I do?
I want to have the desired output result as below.
0000007 aaa 20060201 117
0000007 aaa 20060203 136
0000017 bbbb 20060201 31
0000017 bbbb 20060203 127
0000021 ccccc 20060201 900
0000021 ccccc 20060202 324
0000021 ccccc 20060201 129
This is my code:
val lines = sc.textFile("/home/ubuntu/spark-2.4.3-bin-hadoop2.6/data.txt")
val keyed = lines.map(line => line.split(" ")(0) -> line)
val deduplicated = keyed.reduceByKey((a, b) => a)
deduplicated.values.foreach(println)
I'm not sure which version of Spark you are using - from the code you have posted it looks like you are using the old RDD-API, in which case you almost there. You just need to add both keys - either (col1, col2) or (col1, col3) - and then call collect before you print:
val lines = sc.textFile("/home/ubuntu/spark-2.4.3-bin-hadoop2.6/data.txt")
val keyed = lines.map(line => {
val cols = line.split(" ")
// 1. scenario
((cols(0), cols(1)), (cols(2), cols(3)))
// 2. scenario
//((cols(0), cols(2)), (cols(1), cols(3)))
})
val deduplicated = keyed
.reduceByKey((a, b) => a)
deduplicated.values.collect foreach println // add collect
Without the collect your data will be printed to stdout at the different workers, and you don't see the output on the driver. Note that collect should be used with care (usually in debugging mode only) as it collects all data from the workers to the driver. If you have a large data set, your driver will die with an OOM-exception.
As a side note, I would generally recommend that you move from the old RDD-API to either the Dataframe-API or the Dataset-API, but there may of course be reasons why you have not made the shift...

Deleting lines containing duplicated strings

I always appreciate your help.
I would like to delete lines containing duplicated strings in the second column.
test.txt
658 invert_d2e_q_reg_0_/Qalu_ecl_zlow_e 0.825692
659 invert_d2e_q_reg_0_/Qalu_byp_rd_data_e[31] 0.825692
660 invert_d2e_q_reg_0_/Qalu_byp_rd_data_e[63] 0.825692
661 invert_d2e_q_reg_0_/Qalu_ecl_zhigh_e 0.825692
665 invert_d2e_q_reg_0_/Qalu_byp_rd_data_e[62] 0.825692
666 invert_d2e_q_reg_0_/Qalu_ecl_zlow_e 0.825692
668 invert_d2e_q_reg_0_/Qalu_ecl_zhigh_e 0.825692
670 invert_d2e_q_reg_0_/Qalu_ecl_zhigh_e 0.825692
673 invert_d2e_q_reg_0_/Qalu_ecl_zlow_e 0.825692
675 invert_d2e_q_reg_0_/Qalu_ecl_zhigh_e 0.825692
677 invert_d2e_q_reg_0_/Qalu_ecl_zhigh_e 0.825692
678 invert_d2e_q_reg_0_/Qalu_byp_rd_data_e[27] 0.825692
679 invert_d2e_q_reg_0_/Qalu_byp_rd_data_e[27] 0.8120
.
.
.
output.txt
658 invert_d2e_q_reg_0_/Qalu_ecl_zlow_e 0.825692
659 invert_d2e_q_reg_0_/Qalu_byp_rd_data_e[31] 0.825692
660 invert_d2e_q_reg_0_/Qalu_byp_rd_data_e[63] 0.825692
661 invert_d2e_q_reg_0_/Qalu_ecl_zhigh_e 0.825692
665 invert_d2e_q_reg_0_/Qalu_byp_rd_data_e[62] 0.825692
678 invert_d2e_q_reg_0_/Qalu_byp_rd_data_e[27] 0.825692
.
.
.
I know sed can delete lines with predefined specific strings, but in my cases, I could not expect the strings are duplicated. Also, duplicated strings may be more than 1000.
I used “uniq” to do this job, but this does not work.
uniq –u –f 4 test.txt
(-u prints unique lines. –f skips the first 4 letters. )
Is there any way to do this with sed/awk/perl? Or please correct my uniq semantics.
Best,
Jaeyoung
This might work for you (GNU sed):
sed -r 'G;/^\S+\s+(\S+)\s+.*\n.*\1/!{P;s/\S+\s+(\S+)\s+.*/\1/;H};d' file
Test the second column against all unique values of that column stored in the hold space (HS) and if not present print the line and store its value in the HS.
Or use sort:
sort -suk2,2 file | sort -nk1,1
Awk would do this with one tool, but here is fairly straight forward way to do it with Bash associative arrays. Loop over the lines and pull out column three, if there is no associative array entry then echo the line out and set a value so it won't be printed any further.
unset col3 && declare -A col3 && IFS=$(echo -en "\n\b") && for a in $(< test.txt); do
lncol3=$(echo "${a}" | tr '/' ' ' | awk '{print $3}')
[[ -z "${col3["${lncol3}"]}" ]] && echo "${a}" && col3["${lncol3}"]=1
done
awk '!seen[$0]++' input.txt > output.txt

How do I right justify columns in a file [duplicate]

This question already has answers here:
right text align - bash
(3 answers)
Closed 8 years ago.
How do I right justify the columns of a file in awk, sed, or bash ?
My file is currently left justified and space delimited.
Can I used printf or rev?
Here is what my file looks like :
$ cat file
14,107 aaa 12,436 0.0 0 0 313 0 373
3,806,201 bbb 1,573 0.0 0 0 -25 0 -25
And using rev doesn't give me the output I'm looking for.
$rev file | column -t | rev
14,107 aaa 12,436 0.0 0 0 313 0 373
3,806,201 bbb 1,573 0.0 0 0 -25 0 -25
In lieu of a specific example here is a general solution using a trick with rev:
$ cat file
a 10000.00 x
b 100 y
c 1 zzzZZ
$ rev file | column -t | rev
a 10000.00 x
b 100 y
c 1 zzzZZ
Where column -t is replaced by whatever you are trying to do.

How to delete all characters but the last

I want to parse a file and delete all leading 0's of a number using sed. (of course if i have something like 0000 then results to 0) How to do that?
I think you may be searching for this.
Here lies your answer. You need to modify of course.
How to remove first/last character from a string using SED
This is probably over complicated, but it catches all the corner cases I tested:
sed 's/^\([^0-9]*\)0/\1\n0/;s/$/}/;s/\([^0-9\n]\)0/\1\n/g;s/\n0\+/\n/g;s/\n\([^0-9]\)/0\1/g;s/\n//g;s/}$//' inputfile
Explanation:
This uses the divide-and-conquer technique of inserting newlines to delimit segments of a line so they can be manipulated individually.
s/^\([^0-9]*\)0/\1\n0/ - insert a newline before the first zero
s/$/}/ - add a buffer character at the end
s/\([^0-9\n]\)0/\1\n/g - insert newlines before each leading zero (and remove the first)
s/\n0\+/\n/g - remove the remaining leading zeros
s/\n\([^0-9]\)/0\1/g - replace bare zeros
s/\n//g - remove the newlines
s/}$// - remove the end-of-line buffer
This file:
0 foo 1 bar 01 10 001 baz 010 100 qux 000 00 0001 0100 0010
100 | 00100
010 | 010
001 | 001
100 | 100
0 | 0
00 | 0
000 | 0
00 | 00
00 | 00
00 | 00 z
Becomes:
0 foo 1 bar 1 10 1 baz 10 100 qux 0 0 1 100 10
100 | 100
10 | 10
1 | 1
100 | 100
0 | 0
0 | 0
0 | 0
0 | 0
0 | 0
0 | 0 z
If you have leading zeroes and it is accompanied by string of numbers, all you have to do is to convert it into integer. Something like this
$ echo "000123 test " | awk '{$1=$1+0}1'
123 test
This will not require any significant amount of regex whether they are simple or overly complicated.
Similarly (Ruby1.9+)
$ echo "000123 test " | ruby -lane '$F[0]=$F[0].to_i; print $F.join(" ")'
123 test
For cases of all 0000's
$ echo "0000 test " | ruby -lane '$F[0]=$F[0].to_i; print $F.join(" ")'
0 test
$ echo "000 test " | awk '{$1=$1+0}1'
0 test

line extraction dependin on range for specific colums

I would like to extract some lines from a text file, I have started to tweak sed lately,
I have a file with the structure
88 3 3 0 0 1 101 111 4 3
89 3 3 0 0 1 3 4 112 102
90 3 3 0 0 1 102 112 113 103
91 3 3 0 0 2 103 113 114 104
What I would like to do is to extract the information according to the second column, I use sth like in my bash script(argument 2 is infile)
sed -n '/^[0-9]* [23456789]/ p' < $2 > out
however I have different entries other than the range [23456789], for instance 10, since it is composed of 1 and 0, to get that these two characters should be in the range I guess, however there are entries with '1'(for the second column) that I do not like to keep so how can write '10's but not '1's.
Best,
Umut
sed -rn '/^[0-9]* ([23456789]|10)/ p' < $2 > out
You need the extend-regexp support (-r) to have the | operator (or)
Another interesting way is:
sed -rn '/^[0-9]* ([23456789]|[0-9]{2,})/ p' < $2 > out
Which means [23456789] or 2 or more repetition of a digit.
The instant you see variable-sized columns in your data, you should start thinking about awk:
awk '$2 > 1 && $2 < 11 {print}{}'
will do the trick assuming your file format is correct.
sed -rn '/^[0-9]* (2|3|4|5|6|7|8|9|10)/p' < $2 > out