How can I remove multiline sections with Perl?
I have such wiki test code:
{|
|-
| colspan="2"|
: <math>
[\underbrace{\color{Red}4,2}_{4 > 2},5,1,7] \rightarrow
[2,\underbrace{\color{OliveGreen}4,5}_{4 < 5},1,7] \rightarrow
[2,4,\underbrace{\color{Red}5,1}_{5 > 1},7] \rightarrow
[2,4,1,\underbrace{\color{OliveGreen}5,7}_{5 < 7}]
</math>
|-
|
: <math>
[\underbrace{\color{OliveGreen}2,4}_{2 < 4},1,5,{\color{Blue}7}] \rightarrow
[2,\underbrace{\color{Red}4,1}_{4 > 1},5,{\color{Blue}7}] \rightarrow
[2,1,\underbrace{\color{OliveGreen}4,5}_{4 < 5},{\color{Blue}7}]
</math>
: <math>
[\underbrace{\color{Red}2,1}_{2 > 1},4,{\color{Blue}5},{\color{Blue}7}] \rightarrow
[1,\underbrace{\color{OliveGreen}2,4}_{2 < 4},{\color{Blue}5},{\color{Blue}7}]
</math>
: <math>
[\underbrace{\color{OliveGreen}1,2}_{1 < 2},{\color{Blue}4},{\color{Blue}5},{\color{Blue}7}]
</math>
|}
And I want to remove from this code all how to do it? I have done such code:
cat math-text.txt | perl -e 'while(<>) { s/<math>.+?<\/math>//gs; print $_; }'
It is not works but should since documentation explains that . will much new lines. How to do it?
The following is a python script which I use to extract all the mathematical formula from wikipedia dumps. Rather than using a multi-line regexp it scans for occurrences of <math> </math> and uses the position on the line to work out where the actual position on the line is and uses a finite state machine to find the actual equations, basically with two states determined by inEqn. It does a few other things like find the title and name space and attributes in the maths tags.
As dumps are in the order of 100MB using a line by line approach may well end up being more efficient than multi-line regexps.
import sys
import re
titleRE = re.compile('<title>(.*)</title>')
nsRE = re.compile('<ns>(.*)</ns>')
mathRE = re.compile('</?math(.*?)>')
pageEndRE = re.compile('</page>')
title =""
attr = ""
ns = -1
inEqn = 0
for line in sys.stdin:
m = titleRE.search(line)
if m :
title = m.group(1)
expression = ""
inEqn = 0
m = nsRE.search(line)
if m :
ns = m.group(1)
start = 0
pos = 0
m = mathRE.search(line,pos)
while m :
if m.group().startswith('<math'):
attr = m.group(1)
start = m.end()
pos = start
expression = ""
inEqn = 1
if m.group() == '</math>' :
end = m.start()
expression = ' '.join([expression,line[start:end]])
print title,'\t',attr,'\t',expression.lstrip().replace('<','<').replace('>','>').replace('&','&')
pos = m.end()
expression = ""
start = 0
inEqn = 0
m = mathRE.search(line,pos)
if start > 0 :
expression = line[start:].rstrip()
elif inEqn :
expression = ' '.join([expression,line.rstrip()])
Another option might be to consider an xml parser. A SAX or DOM based parser would be able to find the equations. This might be worth considering if you want to do more sophisticated analysis of the wiki-text.
Related
I would like to split a string in a column to n rows in Talend.
For example :
column
2aabbccdd
The first number is the "n" which I use to define the row lenght, so the expected result should be :
row 1 = aa
row 2 = bb
row 3 = cc
row 4 = dd
The idea here is to iterate on the string and cut it every 2 characters.
Any idea please ?
I would use a tJavaFlex to split the string, with a trick to have n rows coming out of it.
tJavaFlex's main code:
int n = Integer.parseInt(row1.str.substring(0, 4)); //get n from the first 4 characters
String str2 = row1.str.substring(4); //get the string after n
int nbParts = (str2.length() + 1) / n;
System.out.println("number of parts = " + nbParts);
for (int i = 0; i < nbParts; i++)
{
String part = str2.substring(i * n);
if(part.length() > n)
{
part = part.substring(0, n);
}
row2.str = part;
And tJavaFlex's end code is just a closing brace:
}
The trick is to use a for loop in the main code, but only close it in the end code.
tFixedFlowInput contains just one column holding the input string.
I'm using ag grid with angularjs and the filter does not work with formatted numbers. I use formatted numbers with currency values.
Below is the columndef code:
{ headerName:"GBO", field: "GBO", width: 200, editable:true, cellClass: "number-cell",filter:'agNumberColumnFilter',
cellRenderer : function(params){
if(params.value == "" || params.value == null)
return '-';
else return params.value;
}
}
Before assigning the data to the grid, I format the numbers using :
$scope.formatNumberOnly = function(num,c, d, t){
//console.log(num );
var n = getNumber(num);
//var n = this,
c = isNaN(c = Math.abs(c)) ? 2 : c,
d = d == undefined ? "." : d,
t = t == undefined ? "," : t,
s = n < 0 ? "-" : "",
i = parseInt(n = Math.abs(+n || 0).toFixed(c)) + "",
j = (j = i.length) > 3 ? j % 3 : 0;
return s + (j ? i.substr(0, j) + t : "") + i.substr(j).replace(/(\d{3})(?=\d)/g, "$1" + t) + (c ? d + Math.abs(n - i).toFixed(c).slice(2) : "");
};
});
The problem here is that the filter doesn't work with these formatted numbers and only seems to be working for values upto 999.
Can anyone please help me with a solution to this filtering problem?
If you want the filter to work on these formatted values, you should use a valueGetter instead of a valueFormatter
You should implement the above formatter function as a valueGetter in column Definition.
Also a number filter won't work as in order for your formatted number to be interpreted, it should be a text filter.
Here is an example from official docs.
I have a bunch of numbers in a text file as follows (example
r0 = 204
r1 = 205
max_gap = 20u
min = 0
max = 8
thickness = 2
color = green
fill_under = yes
fill_color = green
r0 = 205
r1 = 206
I would like to divide any line with r0 = by 100 so that the line will then read
r0 = 20.4
I would like to do this for all lines with r0 and also for r1. Is there a way to do this in perl?
This is my attempt but doesnt work mainly because I've never used perl before which is why I'm asking such a simple question
#!/usr/bin/perl
$string= r0\s+=\s+\\(d+)
$num= $1/100
$num2= r0\s+=\s+\\$num
s/$string/$num2;
A one liner I could run from bash would be much better though. I know it'll involve the s/find/replace function but not sure how to specify the integer part
perl -pei 's#^(r[01]\s*=\s*)(\d+)$#$1.$2/100#e' filename
The options mean:
-p = Run the code in a loop that prints the modified input
-e = Execute the code in the first argument
-i = Replace the input file(s) with the output
The regular expression bits mean:
^ = beginning of line
r[01] = r0 or r1
\s*=\s* = any amount of whitespace, an =, and any amount of whitespace
\d+ = digits
$ = end of line
The replacement uses the e modifier, which means that it should be executed as a Perl expression. $1 and $2 are the contents of the two capture groups: $1 is everything before the number, $2 is the number. $2/100 divides the number by 100, and . concatenates the two pieces together.
As a one-liner:
perl -pi -e 's{^r[01]\s*=\s*\K(\d+)$}{$1/10}e' filename.txt
Here is an awk solution:
awk '/^r[01]/ {$3/=100} 1' file
r0 = 2.04
r1 = 2.05
max_gap = 20u
min = 0
max = 8
thickness = 2
color = green
fill_under = yes
fill_color = green
r0 = 2.05
r1 = 2.06
function final = fcn(sensor1, sensor2, sensor3)
% resolution = res
res = 10;
% value1 = ((sensor1+sensor2+sensor3)/3);
% | is used for 'or' command
if
sensor1 > res+sensor2 | sensor1> res+sensor3;
value1 = ((sensor2+sensor3)/2);
elseif
sensor2 > res+sensor1 | sensor2> res+sensor3;
value1 = ((sensor1+sensor3)/2);
elseif
sensor3 > res+sensor1 | sensor3> res+sensor2;
value1 = ((sensor1+sensor2)/2);
else
value1 = ((sensor1+sensor2+sensor3)/3);
end
final = value1;
I want it to display the final value based on the average. If any single value is greater than any of the other two by a certain number (resolution in this case) then it should neglect that number and just use the average of the other two. On matlab, my IF and ELSEIF loop has an error saying 'Parse error at , and Parse error at elseif.
You should have you if and your conditions on the same line. And no semi colon after the conditions:
.
.
.
if sensor1 > res+sensor2 || sensor1> res+sensor3
value1 = ((sensor2+sensor3)/2);
elseif sensor2 > res+sensor1 || sensor2> res+sensor3
value1 = ((sensor1+sensor3)/2);
.
.
.
btw you should be using || in this case because you're dealing with scalars.
You can get rid of almost all conditional statements with some vectorized approach. Additionally, it will automatically scale if you have many sensors as inputs with the same conditions.
Code
function value = fcn(sensor1, sensor2, sensor3)
res = 10;
sensor = [sensor1;sensor2;sensor3];
ind_first_cond_met = find(any(bsxfun(#gt,sensor,(res+sensor)'),2),1,'first');
if isempty(ind_first_cond_met)
value = mean(sensor);
else
sum_mat = bsxfun(#plus,sensor,sensor');
mean_every_other_two = [sum_mat(1,2) sum_mat(2,3) sum_mat(3,1)]./2;
value = mean_every_other_two(ind_first_cond_met);
end
I am involved in a project that I think you can help me. I have multiple images that you can see here Images to recognize. The goal here is to extract the numbers between the dashed lines. What is the best approach to do that? The idea that I have from the beginning is to find the coordinates of the dash lines and do the crop function, then is just run OCR software. But is not easy to find those coordinates, can you help me? Or if you have a better approach tell me.
Best regards,
Pedro Pimenta
You may start by looking at more obvious (bigger) objects in your images. The dashed lines are way too small in some images. Searching for the "euros milhoes" logo and the barcode will be easier and it will help you have an idea of the scale and rotation involved.
To find these objects without using match template you can binarize your image (watch out for the background texture) and use the Hu moments on the contours/blobs.
Don't expect a good OCR accuracy on images where the numbers are smaller than 8-10 pixels.
You can use python-tesseract https://code.google.com/p/python-tesseract/ ,it works with your image.What you need to do is to split the result string.I use your https://www.dropbox.com/sh/kcybs1i04w3ao97/u33YGH_Kv6#f:euro9.jpg to test.And source code is below.UPDATE
# -*- coding: utf-8 -*-
from PIL import Image
from PIL import ImageEnhance
import tesseract
im = Image.open('test.jpg')
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(4)
im = im.convert('1')
w, h = im.size
im = im.resize((w * (416 / h), 416))
pix = im.load()
LINE_CR = 0.01
WHITE_HEIGHT_CR = int(h * (20 / 416.0))
status = 0
white_line = []
for i in xrange(h):
line = []
for j in xrange(w):
line.append(pix[(j, i)])
p = line.count(0) / float(w)
if not p > LINE_CR:
white_line.append(i)
wp = None
for i in range(10, len(white_line) - WHITE_HEIGHT_CR):
k = white_line[i]
if white_line[i + WHITE_HEIGHT_CR] == k + WHITE_HEIGHT_CR:
wp = k
break
result = []
flag = 0
while 1:
if wp < 0:
result.append(wp)
break
line = []
for i in xrange(w):
line.append(pix[(i, wp)])
p = line.count(0) / float(w)
if flag == 0 and p > LINE_CR:
l = []
for xx in xrange(20):
l.append(pix[(xx, wp)])
if l.count(0) > 5:
break
l = []
for xx in xrange(416-1, 416-100-1, -1):
l.append(pix[(xx, wp)])
if l.count(0) > 17:
break
result.append(wp)
wp -= 1
flag = 1
continue
if flag == 1 and p < LINE_CR:
result.append(wp)
wp -= 1
flag = 0
continue
wp -= 1
result.reverse()
for i in range(1, len(result)):
if result[i] - result[i - 1] < 15:
result[i - 1] = -1
result = filter(lambda x: x >= 0, result)
im = im.crop((0, result[0], w, result[-1]))
im.save('test_converted.jpg')
api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetVariable("tessedit_char_whitelist", "0123456789abcdefghijklmnopqrstuvwxyz")
api.SetPageSegMode(tesseract.PSM_AUTO)
mImgFile = "test_converted.jpg"
mBuffer=open(mImgFile,"rb").read()
result = tesseract.ProcessPagesBuffer(mBuffer,len(mBuffer),api)
print "result(ProcessPagesBuffer)=",result
Depends python 2.7 python-tesseract-win32 python-opencv numpy PIL,and be sure to follow python-tesseract's remember to .