Sed replace with conditions - sed

I have the following text file.
2017-03-01 10:57:50,892 [Thread-977] limits.compiler : ERROR - Error in formula Undefined_CountryDom
cannot get field LOCKS_CountryDom, String, CountryDom, belongs to Header, scalar (dynamic index: 172)
limits.compiler.LimitsVariablesException: cannot get field LOCKS_CountryDom, String, CountryDom, belongs to Header, scalar (dynamic index: 172)
at limits.compiler.ExpressionHandler.evaluateBoolean(ExpressionHandler.java:170)
at limits.compiler.ExpressionHandler.getBoolean(ExpressionHandler.java:266)
2017-03-01 10:57:50,700 [Thread-231] console : ERROR - at limits.compiler.ExpressionHandler.getString(ExpressionHandler.java:700)
2017-03-01 10:57:50,892 [Thread-977] console : ERROR - at limits.compiler.compliance.ComplianceCheckFactoryImpl.compileDefaultMessageExpression(ComplianceCheckFactoryImpl.java:107)
2017-03-01 10:57:50,892 [Thread-564] console : ERROR - at limits.compiler.compliance.ComplianceCheckFactoryImpl.createOverflow(ComplianceCheckFactoryImpl.java:231)
2017-03-01 10:57:50,893 [Thread-977] console : ERROR - at limits.compiler.compliance.ComplianceCheckFactoryImpl.evaluateTickLockCombinations(ComplianceCheckFactoryImpl.java:498)
2017-03-01 10:57:50,893 [Thread-977] console : ERROR - at limits.engine.stream.TickWriterImpl.doMLCOperations(TickWriterImpl.java:2488)
I require the removal of 2017-03-01 10:57:50,700 [Thread-231] console : ERROR - such that the lines with time and - at would be made similar to the lines without the time above.
The result should be something like this:
2017-03-01 10:57:50,892 [Thread-977] limits.compiler : ERROR - Error in formula Undefined_CountryDom
cannot get field LOCKS_CountryDom, String, CountryDom, belongs to Header, scalar (dynamic index: 172)
limits.compiler.LimitsVariablesException: cannot get field LOCKS_CountryDom, String, CountryDom, belongs to Header, scalar (dynamic index: 172)
at limits.compiler.ExpressionHandler.evaluateBoolean(ExpressionHandler.java:170)
at limits.compiler.ExpressionHandler.getBoolean(ExpressionHandler.java:266)
at limits.compiler.ExpressionHandler.getString(ExpressionHandler.java:700)
at limits.compiler.compliance.ComplianceCheckFactoryImpl.compileDefaultMessageExpression(ComplianceCheckFactoryImpl.java:107)
at limits.compiler.compliance.ComplianceCheckFactoryImpl.createOverflow(ComplianceCheckFactoryImpl.java:231)
at limits.compiler.compliance.ComplianceCheckFactoryImpl.evaluateTickLockCombinations(ComplianceCheckFactoryImpl.java:498)
at limits.engine.stream.TickWriterImpl.doMLCOperations(TickWriterImpl.java:2488)
How can I do that?

You can group regular expressions within sed's famous substitution s function s/regex/pattern/. In this case, we use two groups \(a_regex_group\)and print one pattern, the second \2.
sed 's/\(^[0-9]*.*-\s\s\)\(.*$\)/\t\2/'
This chops everything which starts with numbers ^[0-9]* followed by any arbitrary characters .* including a dash and two whitespaces -\s\s and leave the rest \2 with a leading tabulator \t.

Related

Lex Parsing for exponent

I am trying to parse a file the data looks like
size = [5e+09, 5e+09, 5e+09]
I have 'size OSQUARE NUMBER COMMA NUMBER COMMA NUMBER ESQUARE'
And NUMBER is defined in tokrules as
t_NUMBER = r'[-]?[0-9]*[\.]*[0-9]+([eE]-?[0-9]+)*'
But I get
Syntax error in input!
LexToken(ID,'e',6,113)
Illegal character '+'
Illegal character '+'
Illegal character '+'
What is wrong with my NUMBER definition?
I am using https://www.dabeaz.com/ply/
The part of your rule which matches exponents is
([eE]-?[0-9]+)*
Clearly, that won't match a +. It should be:
([eE][-+]?[0-9]+)*
Also, it will match 0 or more exponents, which is not correct. It should match 0 or 1:
([eE][-+]?[0-9]+)?

How to check for valid file name format in kdb/q?

I'd like to check that the file names in my directory are all formatted properly. First I create a variable dir and then use the keyword key to see what files are listed...
q)dir:`:/myDirectory/data/files
q)dirkey:key dir
q)dirkey
`FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json
`FILEB_ABC_20190430_b556nyc1_OrderSale_000456.meta
I select and parse the .json file name...
q)dirjsn:dirkey where dirkey like "*.json"
q)sepname:raze{"_" vs string x}'[dirjsn]
"FILEA"
"XYZ"
"20190501"
"b233nyc9"
"OrderPurchase"
"000123.json"
Next I'd like to confirm that each character in sepname[0] and sepname[1] are letters, that characters in sepname[2] are numerical/temporal, and that sepname[3] contains alphanumeric values.
What is the best way to optimize the following sequential if statements for performance and how can I check for alphanumeric values, like in the case of sepname[3], not just one or the other?
q)if[not sepname[0] like "*[A-Z]";:show "Incorrect Submitter"];
if[not sepname[1] like "*[A-Z]";:show "Incorrect Reporter"];
if[not sepname[2] like "*[0-9]";:show "Incorrect Date"];
if[not sepname[3] like " ??? ";:show "Incorrect Kind"];
show "Correct File Format"
If your valid filenames alway have that same structure (specifically 5 chars, 3 chars, 8 chars, 8 chars) then you can use a single regex like statement like so:
dirjsn:("FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json";"F2ILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json";"FILEA_XYZ2_20190501_b233nyc9_OrderPurchase_000123.json";"FILEA_XYZ_2A190501_b233nyc9_OrderPurchase_000123.json";"FILEA_XYZ_20190501_b233%yc9_OrderPurchase_000123.json";"FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json");
q)dirjsn
FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json
F2ILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json
FILEA_XYZ2_20190501_b233nyc9_OrderPurchase_000123.json
FILEA_XYZ_2A190501_b233nyc9_OrderPurchase_000123.json
FILEA_XYZ_20190501_b233%yc9_OrderPurchase_000123.json
FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json
q)AZ:"[A-Z]";n:"[0-9]";Azn:"[A-Za-z0-9]";
q)dirjsn where dirjsn like raze(AZ;"_";AZ;"_";n;"_";Azn;"*")where 5 1 3 1 8 1 8 1
"FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json"
"FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json"
like will not work in this case as we need to check each character. One way to do that is to use in and inter:
q) a: ("FILEA"; "XYZ"; "20190501"; "b233nyc9")
Create a character set
q) c: .Q.a, .Q.A
For first 3 cases, check if each charcter belongs to specific set:
q) r1: all#'(3#a) in' (c;c;.Q.n) / output 111b
For alphanumeric case, check if it contains both number and character and no other symbol.
q)r2: (sum[b]=count a[3]) & all b:sum#'a[3] in/: (c;.Q.n) / output 1b
Print output/errors:
q) errors: ("Incorrect Submitter";"Incorrect Reporter";"Incorrect Date";"Incorrect Kind")
q) show $[0=count r:where not r1,r2;"All good";errors r]
q) "All good"

Perl one liner to simulate awk script

I'm new to both awk and perl, so please bear with me.
I have the following awk script:
awk '/regex1/{p = 0;} /regex2/{p = 1;} p'
What this basically does is print all lines staring from line matching with regex2 until a line matching with regex1 is found.
Example:
regex1
regex2
line 1
line 2
regex1
regex2
regex1
Output:
regex2
line 1
line 2
regex2
Is it possible to simulate this using a perl one-liner? I know I can do it with a script saved in a file.
Edit:
A practical example:
24 May 2017 17:00:06,827 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content
24 May 2017 17:00:06,828 [INFO] 567890 (Blah : Blah1) Service-name:: Content( May span multiple lines)
24 May 2017 17:00:06,829 [INFO] 123456 (Blah : Blah2)
Service-name: Multiple line content. Printing Object[ ID1=fac-adasd
ID2=123231
ID3=123108 Status=Unknown
Code=530007 Dest=CA
]
24 May 2017 17:00:06,830 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content
24 May 2017 17:00:06,831 [INFO] 567890 (Blah : Blah2) Service-name:: Content( May span multiple lines)
Given the search key 123456 I want to extract the following:
24 May 2017 17:00:06,827 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content
24 May 2017 17:00:06,829 [INFO] 123456 (Blah : Blah2)
Service-name: Multiple line content. Printing Object[ ID1=fac-adasd
ID2=123231
ID3=123108 Status=Unknown
Code=530007 Dest=CA
]
24 May 2017 17:00:06,830 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content
The following awk script does the job:
awk '/[0-9]{2}\s\w+\s[0-9]{4}/{n = 0} /123456/ {n =1}n' file
perl -ne 'print if (/regex2/ .. /regex1/) =~ /^\d+$/'
This is slightly crazy, but here's how it works:
-n adds an implicit loop over the input lines
the current line is in $_
the two bare regex matches (/regex2/, /regex1/) implicitly test against $_
we use .. in scalar context, which turns it into a stateful flip-flop operator
By that I mean: X .. Y starts out in the "false" state. In the "false" state it only evaluates X. If X returns a false value, it remains in the "false" state (and returns false itself). Once X returns a true value, it moves into the "true" state and returns true.
In the "true" state it only evaluates Y. If Y returns false, it remains in the "true" state (and returns true itself). Once Y returns a true value, it moves into the "false" state but it still returns true.
had we just used print if /regex2/ .. /regex1/, it would have printed all the terminating regex1 lines, too
a close reading of Range Operators in perldoc perlop reveals that you can distinguish the end points of the range
the "true" value returned by .. is actually a sequence number starting from 1, so the start of a range can be identified by checking for 1
when the end of the range is reached (i.e. we're about to move from the "true" state to the "false" state again), the return value gets a "E0" tacked on to the end
Adding "E0" to an integer doesn't affect its numeric value. Perl implicitly converts strings to numbers when needed, and something like "5E0" is just scientific notation (meaning 5 * 10**0, which is 5 * 1, which is 5).
the "false" value returned by .. is the empty string, ""
We check that the result of .. matches the regex /^\d+$/, i.e. is all digits. This excludes the empty string (because we require at least one digit to match), so we don't print lines outside of the range. It also excludes the last line in our range, because E is not a digit.
Not sure if awk prints both the start and end of the range, but Perl does:
perl -ne 'if(/regex2/ ... /regex1/){print}' file
Edit: Awk (at least Gnu awk) also has a range operator, so this could have been done more simply as:
awk '/regex2/,/regex1/' file

How to recognize ID, Literals and Comments in Lex file

I have to write a lex program that has these rules:
Identifiers: String of alphanumeric (and _), starting with an alphabetic character
Literals: Integers and strings
Comments: Start with ! character, go to until the end of the line
Here is what I came up with
[a-zA-Z][a-zA-Z0-9]+ return(ID);
[+-]?[0-9]+ return(INTEGER);
[a-zA-Z]+ return ( STRING);
!.*\n return ( COMMENT );
However, I still get a lot of errors when I compile this lex file.
What do you think the error is?
It would have helped if you'd shown more clearly what the problem was with your code. For example, did you get an error message or did it not function as desired?
There are a couple of problems with your code, but it is mainly correct. The first issue I see is that you have not divided your lex program into the necessary parts with the %% divider. The first part of a lex program is the declarations section, where regular expression patterns are specified. The second part is where the action that match patterns are specified. The (optional) third section is where any code (for the compiler) is placed. Code for the compiler can also be placed in the declaration section when delineated by %{ and %} at the start of a line.
If we put your code through lex we would get this error:
"SoNov16.l", line 1: bad character: [
"SoNov16.l", line 1: unknown error processing section 1
"SoNov16.l", line 1: unknown error processing section 1
"SoNov16.l", line 1: bad character: ]
"SoNov16.l", line 1: bad character: +
"SoNov16.l", line 1: unknown error processing section 1
"SoNov16.l", line 1: bad character: (
"SoNov16.l", line 1: unknown error processing section 1
"SoNov16.l", line 1: bad character: )
"SoNov16.l", line 1: bad character: ;
Did you get something like that? In your example code you are specifying actions (the return(ID); is an example of an action) and thus your code is for the second section. You therefore need to put a %% line ahead of it. It will then be a valid lex program.
You code is dependant on (probably) a parser, which consumes (and declares) the tokens. For testing purposes it is often easier to just print the tokens first. I solved this problem by making a C macro which will do the print and can be redefined to do the return at a later stage. Something like this:
%{
#define TOKEN(t) printf("String: %s Matched: " #t "\n",yytext)
%}
%%
[a-zA-Z][a-zA-Z0-9]+ TOKEN(ID);
[+-]?[0-9]+ TOKEN(INTEGER);
[a-zA-Z]+ TOKEN (STRING);
!.*\n TOKEN (COMMENT);
If we build and test this, we get the following:
abc
String: abc Matched: ID
abc123
String: abc123 Matched: ID
! comment text
String: ! comment text
Matched: COMMENT
Not quite correct. We can see that the ID rule is matching what should be a string. This is due to the ordering of the rules. We have to put the String rule first to ensure it matches first - unless of course you were supposed to match strings inside some quotes? You also missed the underline from the ID pattern. Its also a good idea to match and discard any whitespace characters:
%{
#define TOKEN(t) printf("String: %s Matched: " #t "\n",yytext)
%}
%%
[a-zA-Z]+ TOKEN (STRING);
[a-zA-Z][a-zA-Z0-9_]+ TOKEN(ID);
[+-]?[0-9]+ TOKEN(INTEGER);
!.*\n TOKEN (COMMENT);
[ \t\r\n]+ ;
Which when tested shows:
abc
String: abc Matched: STRING
abc123_
String: abc123_ Matched: ID
-1234
String: -1234 Matched: INTEGER
abc abc123 ! comment text
String: abc Matched: STRING
String: abc123 Matched: ID
String: ! comment text
Matched: COMMENT
Just in case you wanted strings in quotes, that is easy too:
%{
#define TOKEN(t) printf("String: %s Matched: " #t "\n",yytext)
%}
%%
\"[^"]+\" TOKEN (STRING);
[a-zA-Z][a-zA-Z0-9_]+ TOKEN(ID);
[+-]?[0-9]+ TOKEN(INTEGER);
!.*\n TOKEN (COMMENT );
[ \t\r\n] ;
"abc"
String: "abc" Matched: STRING

How to read a string containing a comma and an at sign with textread?

My prototype data line looks like this:
(1) 11 July England 0-0 Uruguay # Wembley Stadium, London
Currently I'm using this:
[no,dd,mm,t1,p1,p2,t2,loc]=textread('1966.txt','(%d) %d %s %s %d-%d %s # %[%s \n]');
But it gives me the following error:
Error using dataread
Trouble reading string from file (row 1, field 12) ==> Wembley Stadium, London\n
Error in textread (line 174)
[varargout{1:nlhs}]=dataread('file',varargin{:}); %#ok<REMFF1>
So it seems to have trouble with reading a string that contains a comma, or it's the at sign that causes trouble. I read the documentation thoroughly but nowhere does it mention what to do when you have special characters such as # or if you want to read a string that contains a delimiter even though it I don't want it recognized as a delimiter.
You want
[no,dd,mm,t1,p1,p2,t2,loc] = ...
textread('1966.txt','(%d) %d %s %s %d-%d %s # %[^\n]');