sed - Addressing using two strings - sed

I am picking up sed. I am having a trouble understanding how the line addressing in sed works when a pattern is used to specify line address.
I have a sample text file named emp.lst with the following contents:
2233|a.k. shukla |g.m. |sales |12/12/52|6000
9876|jai sharma |director |production|12/03/50|7000
5678|sumit chakrobarty|d.g.m. |marketing |19/04/43|6000
2365|barun sengupta |director |personnel |11/05/47|7800
5423|n.k. gupta |chairman |admin |30/08/56|5400
1006|chanchal singhvi |director |sales |03/09/38|6700
6213|karuna ganguly |g.m. |accounts |05/06/62|6300
1265|s.n. dasgupta |manager |sales |12/09/63|5600
4290|jayant Choudhury |executive|production|07/09/50|6000
2476|anil aggarwal |manager |sales |01/05/59|5000
6521|lalit chowdury |director |marketing |26/09/45|8200
3212|shyam saksena |d.g.m. |accounts |12/12/55|6000
3564|sudhir Agarwal |executive|personnel |06/07/47|7500
2345|j.b. saxena |g.m. |marketing |12/03/45|8000
0110|v.k. agrawal |g.m. |marketing |31/12/40|9000
As I understand, line address can be specified either in the form of line number(s) of a pattern to match as text or regular expression.
I understand how sed -n '1p' emp.lst and sed -n '1,2p' emp.lst print line 1 and line 1 & 2 respectively without echoing all lines (-n).
I also understand and appreciate how sed -n '/director/p' emp.lst match all the lines containing the string director, and outputs:
9876|jai sharma |director |production|12/03/50|7000
2365|barun sengupta |director |personnel |11/05/47|7800
1006|chanchal singhvi |director |sales |03/09/38|6700
6521|lalit chowdury |director |marketing |26/09/45|8200
Now, when I specify multiple patters as sed -n '/director/,/executive/p' emp.lst, the output shown is:
9876|jai sharma |director |production|12/03/50|7000
5678|sumit chakrobarty|d.g.m. |marketing |19/04/43|6000
2365|barun sengupta |director |personnel |11/05/47|7800
5423|n.k. gupta |chairman |admin |30/08/56|5400
1006|chanchal singhvi |director |sales |03/09/38|6700
6213|karuna ganguly |g.m. |accounts |05/06/62|6300
1265|s.n. dasgupta |manager |sales |12/09/63|5600
4290|jayant Choudhury |executive|production|07/09/50|6000
6521|lalit chowdury |director |marketing |26/09/45|8200
3212|shyam saksena |d.g.m. |accounts |12/12/55|6000
3564|sudhir Agarwal |executive|personnel |06/07/47|7500
What does this output represent?
Is it all lines containing the pattern director and executive? Clearly no, as there are some lines not containing either one of the patterns.
Is it all lines starting with first one matching either one of the patters till the last one matching either one of the patterns? No again, as if I go by that logic, one line (2476|anil aggarwal |manager |sales |01/05/59|5000) is missing from the output.
I have not been able to clearly deduce how the command sed -n '/director/,/executive/p' emp.lst is working? I have gone through the sed man page and have yet been unable to deduce.
How do I approach understanding the working?
For context, I am running sed command built into macOS High Sierra 10.13.6 running in Bash version 4.4.
Note: I am a sed newbie. Please edit any mistake or incorrect terminology that I may have used.

https://www.gnu.org/software/sed/manual/sed.html#Range-Addresses:
An address range can be specified by specifying two addresses separated by a comma (,). An address range matches lines starting from where the first address matches, and continues until the second address matches (inclusively):
$ seq 10 | sed -n '4,6p'
4
5
6
Thus 1,2p does not mean "print lines 1 and 2" but "print all lines between line 1 and line 2". The difference becomes more clear with e.g. 3,7p, which will not just print line 3 and 7, but lines 3, 4, 5, 6, 7.
/director/,/executive/p prints all lines between a starting line (matching director) and an ending line (matching executive).
In your case, you have two matching ranges (each starting with director and ending with executive):
9876|jai sharma |director |production|12/03/50|7000
5678|sumit chakrobarty|d.g.m. |marketing |19/04/43|6000
2365|barun sengupta |director |personnel |11/05/47|7800
5423|n.k. gupta |chairman |admin |30/08/56|5400
1006|chanchal singhvi |director |sales |03/09/38|6700
6213|karuna ganguly |g.m. |accounts |05/06/62|6300
1265|s.n. dasgupta |manager |sales |12/09/63|5600
4290|jayant Choudhury |executive|production|07/09/50|6000
6521|lalit chowdury |director |marketing |26/09/45|8200
3212|shyam saksena |d.g.m. |accounts |12/12/55|6000
3564|sudhir Agarwal |executive|personnel |06/07/47|7500

From man sed:
0,addr2
Start out in "matched first address" state, until addr2 is found.
This is similar to 1,addr2, except that if addr2 matches the very
first line of input the 0,addr2 form will be at the end of its range,
whereas the 1,addr2 form will still be at the beginning of its range.
This works only when addr2 is a regular expression.
Not 100% sure if this is the manual section that applies but it looks like you have 2 blocks from "director" to "executive" in your output above.
There happen to be some other "director" lines between the first "director" and first succeeding "executive".

Related

Remove the repeated punctuation from pyspark dataframe

I need to remove the repeated punctuations and keep the last occurrence only.
For example: !!!! -> !
!!$$ -> !$
I have a dataset that looks like below
temp = spark.createDataFrame([
(0, "This is Spark!!!!"),
(1, "I wish Java could use case classes!!##"),
(2, "Data science is cool#$#!"),
(3, "Machine!!$$")
], ["id", "words"])
+---+--------------------------------------+
|id |words |
+---+--------------------------------------+
|0 |This is Spark!!!! |
|1 |I wish Java could use case classes!!##|
|2 |Data science is cool#$#! |
|3 |Machine!!$$ |
+---+--------------------------------------+
I tried regex to remove specific punctuations and that is below
df2 = temp.select(
[F.regexp_replace(col, r',|\.|&|\\|\||-|_', '').alias(col) for col in temp.columns]
)
but the above is not working. Can anyone tell how to achieve this in pyspark?
Below is the desired output.
id words
0 0 This is Spark!
1 1 I wish Java could use case classes!#
2 2 Data science is cool#$#!
3 3 Machine!$
You can use this regex.
df2 = temp.select('id',
F.regexp_replace('words', r'([!$#])\1+', '$1').alias('words'))
Regex explanation.
( -> Group anything between this and ) and create a capturing group
[ -> Match any characters between this and ]
([!$#]) -> Create the capturing group that match any of !, $, #
\1 -> Reference the first capturing group
+ -> Match 1 or more of a preceding group or character
([!$#])\1+ -> Match any of !, $, # that repeats more than 1 time.
And the last argument of regex_replace to set $1 which is referencing the first capturing group (a single character of !, $, #) to replace the repeating characters with just the single character.
You can add more characters between [] for matching more special characters.

Using an iterator on a table

I have this table:
A:2.34889 2.484112 1.045939 3.359097 1.642348 1.298948 3.046995 4.077684
B:3.845017 3.762336 3.287893 3.338063 5.861462 5.401914 3.537128 5.27197
t:([] AA:A;BB:B)
-1 + prd select (-1#AA)%(1#AA) from t
-1 + prd select (-1#BB)%(1#BB) from t
which outputs
AA| 0.7360047
BB| 0.3711175
I was wondering how I can modify the last two lines into a single line that iterates over AA and BB? For example, if I had 10 symbols, I would only have to write a single line to output the 10 results.
Also apologies on the question title, I am not sure how to phrase it well but am happy to edit if required.
Iterators on tables can either be row rise (demonstrated by 0N! below):
0N!/: t
`AA`BB!2.34889 3.845017
`AA`BB!2.484112 3.762336
`AA`BB!1.045939 3.287893
`AA`BB!3.359097 3.338063
`AA`BB!1.642348 5.861462
`AA`BB!1.298948 5.401914
`AA`BB!3.046995 3.537128
`AA`BB!4.077684 5.27197
Or column wise with flip:
0N!/: flip t
2.34889 2.484112 1.045939 3.359097 1.642348 1.298948 3.046995 4.077684
3.845017 3.762336 3.287893 3.338063 5.861462 5.401914 3.537128 5.27197
For this case, you could do the latter and apply your function to all columns with the each iterator:
{-1+prd last[x]%first x} each flip t
AA| 0.7360047
BB| 0.3711175
Use # or select to get the subset of columns you want to apply the function to if needs be:
{-1+prd last[x]%first x} each flip `AA`BB#t
AA| 0.7360047
BB| 0.3711175
More generally, when trying to build up similar code to apply to a list of columns functional form can be useful to be aware of: https://code.kx.com/q/basics/funsql/
parse "exec AA:{-1+prd last[x]%first x} AA from t"
?
`t
()
()
(,`AA)!,({-1+prd last[x]%first x};`AA)
// or cls:cols t
cls:`AA`BB ;
?[t;();();cls!({-1+prd last[x]%first x}),/:cls]
AA| 0.7360047
BB| 0.3711175
Matts answer is a better and more general answer but in your particular example the logic can be as simple as:
q)-1+last[t]%first t
AA| 0.7360047
BB| 0.3711175
No need to iterate through the columns.
The best use of iterators here is Each Left to apply both first and last to t.
q)(last;first)#\:t
AA BB
-----------------
4.077684 5.27197
2.34889 3.845017
That table is a 2-list, so you can apply Divide.
q)-1+(%).(last;first)#\:t
AA| 0.7360047
BB| 0.3711175
To define this for re-use, it’s a composition of three unaries, here spaced for clarity:
q)f:-1+ (%). (last;first)#\:
q)f t
AA| 0.7360047
BB| 0.3711175
Works for any number of columns.

How to split an address in TALEND based on UPPER CASE values?

I want to split the below address from single column to multiple columns using talend.
Input
|ADDRESS|
|15 St. Patrick Rd NORTH WEST LONDON|
Expected Output
|ADDRESS_LINE1 | ADDRESS_LINE2 |
|15 St. Patrick Rd | NORTH WEST LONDON |
You can use these two following regex expressions allow to split your input ADDRESS as you specified :
ADDRESS_LINE1 = StringHandling.TRIM(
input.ADDRESS.replaceAll("^(.+?)(([A-Z]{2,}\\s*?)+)$", "$1")
)
;
ADDRESS_LINE2 = StringHandling.TRIM(
input.ADDRESS.replaceAll("^(.+?)(([A-Z]{2,}\\s*?)+)$", "$2")
)
;

Remove row from a file perl

I have file with | delimited row in that i want to add check on the value of 8th position if the value matches i want to remove that row from the file and if it not matching i want to leave that in file.
Below is the file format , i want to remove all the rows which have U value on the 8th position
A|B|DADD|H|O| |123 A Street; Apt.2|U|M
A|B|DADD|H|O| |123 A Street; Apt.2|A|M
A|B|DADD|H|O| |123 A Street; Apt.2|B|M
A|B|DADD|H|O| |123 A Street; Apt.2|U|M
How we can do it this Perl or is there any way we can use Awk or Sed. But after removing i want to print them as well .
I have tried sed but is matching through out the file i want to match at specific position.
sed -i '' "/$pattern/d" $file
perl -F'\|' -wlane'print if $F[7] ne "U"' file > new
With -a switch each line is split into words, available in #F array. The separator to split on can be set with -F option (default is whitespace) and here it's |. See switches in perlrun. Then we just check for the 8th field and print.
In order to change the input file in-place add -i switch
perl -i -F'\|' -wlane'print if $F[7] ne "U"' file
or use -i.bak to keep (.bak) backup as well.
I see that a question popped up about logging those lines that aren't kept in the file.
One way is to hijack the STDERR stream for them
perl -i -F'\|' -wlane'$F[7] ne "U" ? print : print STDERR $_' file 2> excluded
where the file excluded gets the STDERR stream, redirected (in bash) using 2>. However, that can be outright dangerous since now possible warnings are hidden and corrupt the file intended for excluded lines (as they also go to that file).
So better collect those lines and print them at the end
perl -i -F'\|' -wlanE'
$F[7] ne "U" ? print : push #exclude, $_;
END { say for #exclude }
' input > excluded
where file excluded gets all omitted (excluded) lines. (I switched -e to -E so to have say.)
Sounds like this might be what you want:
$ cat file
A|B|DADD|H|O| |123 A Street; Apt.2|U|M
A|B|DADD|H|O| |123 A Street; Apt.2|A|M
A|B|DADD|H|O| |123 A Street; Apt.2|B|M
A|B|DADD|H|O| |123 A Street; Apt.2|U|M
$ awk -i inplace -F'[|]' '$8=="U"{print|"cat>&2"; next} 1' file
A|B|DADD|H|O| |123 A Street; Apt.2|U|M
A|B|DADD|H|O| |123 A Street; Apt.2|U|M
$ cat file
A|B|DADD|H|O| |123 A Street; Apt.2|A|M
A|B|DADD|H|O| |123 A Street; Apt.2|B|M
The above uses GNU awk for -i inplace. With other awks you'd just do:
awk -F'[|]' '$8=="U"{print|"cat>&2"; next} 1' file > tmp && mv tmp file
To log the deleted line to a file named log1:
awk -F'[|]' '$8=="U"{print >> "log1"; next} 1' file
To log it and print it to stderr:
awk -F'[|]' '$8=="U"{print|"tee -a log1 >&2"; next} 1' file

Spark TSV file and incorrect column spit

I have TSV file with many of lines. Much of the lines work fine but I have the issue of working with the following line:
tt7841930 tvEpisode "Stop and Hear the Cicadas/Cold-Blooded "Stop and Hear the Cicadas/Cold-Blooded 0 2018 \N 24 Animation,Family
I use Spark and Scala in order to load the file into DataFrame:
val titleBasicsDf = spark.read
.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true)
.option("delimiter", " ")
.csv("title.basics.tsv.gz")
As result I receive:
+---------+---------+-------------------------------------------------------------------------------+-------------+-------+---------+-------+----------------+------+-------------+--------+------------+------------+-------------+
|tconst |titleType|primaryTitle |originalTitle|isAdult|startYear|endYear|runtimeMinutes |genres|averageRating|numVotes|parentTconst|seasonNumber|episodeNumber|
+---------+---------+-------------------------------------------------------------------------------+-------------+-------+---------+-------+----------------+------+-------------+--------+------------+------------+-------------+
|tt7841930|tvEpisode|"Stop and Hear the Cicadas/Cold-Blooded "Stop and Hear the Cicadas/Cold-Blooded|0 |2018 |\N |24 |Animation,Family|null |null |null |tt4947580 |6 |2 |
+---------+---------+-------------------------------------------------------------------------------+-------------+-------+---------+-------+----------------+------+-------------+--------+------------+------------+-------------+
So as you may see, the following data in the line:
"Stop and Hear the Cicadas/Cold-Blooded "Stop and Hear the Cicadas/Cold-Blooded
is not properly split into two different values for primaryTitle and originalTitle columns and primaryTitle contains both of them:
{
"runtimeMinutes":"Animation,Family",
"tconst":"tt7841930",
"seasonNumber":"6",
"titleType":"tvEpisode",
"averageRating":null,
"originalTitle":"0",
"parentTconst":"tt4947580",
"startYear":null,
"endYear":"24",
"numVotes":null,
"episodeNumber":"2",
"primaryTitle":"\"Stop and Hear the Cicadas/Cold-Blooded\t\"Stop and Hear the Cicadas/Cold-Blooded",
"isAdult":2018,
"genres":null
}
What am I doing wrong and how to configure Spark to properly understand and split this line? As I mentioned previously, many of other lines from this file are split correctly into the proper columns.
I found the answer here: https://github.com/databricks/spark-csv/issues/89
The way to turn off the default escaping of the double quote character
(") with the backslash character () - i.e. to avoid escaping for all
characters entirely, you must add an .option() method call with just
the right parameters after the .write() method call. The goal of the
option() method call is to change how the csv() method "finds"
instances of the "quote" character as it is emitting the content. To
do this, you must change the default of what a "quote" actually means;
i.e. change the character sought from being a double quote character
(") to a Unicode "\u0000" character (essentially providing the Unicode
NUL character assuming it won't ever occur within the document).
the following magic option did the trick:
.option("quote", "\u0000")