Extract lines from input text file to text file with some transformation with Mule 4 - mule-studio

I have a requirement where I need to read text file and extract some data and send the extracted to other system for which am unable to do it.
Input file:
1BoraBora Island
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
3BR 209078 BoraBora 6798989 99999
1 BR 67854 JAIHIND 789 000Y247 9898983
2 BR CR9 BoraBora 123 QK J12Y64 00010520
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Output should be:
1BoraBora Island
0000000000000000000000
1 BR 67854 JAIHIND 789 000Y247 9898983
2 BR CR9 BoraBora 123 QK J12Y64 00010520
Need to extract only row having "BR" in it at 3th letter.
Please guide me how to achieve this in text format only.

Assuming that the input is `text/plain'. Using a DataWeave script and the subscript() function you can extract a given position from the input:
%dw 2.0
import * from dw::core::Strings
output text/plain
var lines=payload splitBy "\n" // separate text into an array of lines
---
lines[0] ++"\n" ++ lines[1] ++"\n"
++ (lines[2 to -1] // use the range selector to get the remaining lines
filter (substring($,2,4)=="BR") // filter lines that have "BR" at the right position
reduce ($$++"\n"++$) // concatenate the remaining lines again into a single text file
)
Output:
1BoraBora Island
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
1 BR 67854 JAIHIND 789 000Y247 9898983
2 BR CR9 BoraBora 123 QK J12Y64 00010520

Since you are working with Text, you can also use Regex with the scan function to scan all lines that matches your condition then joinBy a new line character
%dw 2.0
output text/plain
---
flatten(payload scan /(?<=^|\n).{2}BR.*/)
joinBy "\n"
(?<=^|\n).{2}BR.* Regex breakdown:
(?<= is a positive lookbehind, that means it will start matching the rest of the pattern only if it follows the pattern specified by it
(?<=^|\n) is positive lookahead with either start of string (^) of a new line (\n)
.{2}BR.* indicates any character twice followed by the literal BR then any number of any character thereafter

Related

matlab: delimit .csv file where no specific delimiter is available

i wonder if there is the possibility to read a .csv file looking like:
0,0530,0560,0730,....
90,15090,15290,157....
i should get:
0,053 0,056 0,073 0,...
90,150 90,152 90,157 90,...
when using dlmread(path, '') matlab spits out an error saying
Mismatch between file and Format character vector.
Trouble reading 'Numeric' field frin file (row 1, field number 2) ==> ,053 0,056 0,073 ...
i also tried using "0," as the delimiter but matlab prohibits this.
Thanks,
jonnyx
str= importdata('file.csv',''); %importing the data as a cell array of char
for k=1:length(str) %looping till the last line
str{k}=myfunc(str{k}); %applying the required operation
end
where
function new=myfunc(str)
old = str(1:regexp(str, ',', 'once')); %finding the characters till the first comma
%old is the pattern of the current line
new=strrep(str,old,[' ',old]); %adding a space before that pattern
new=new(2:end); %removing the space at the start
end
and file.csv :
0,0530,0560,073
90,15090,15290,157
Output:
>> str
str=
'0,053 0,056 0,073'
'90,150 90,152 90,157'
You can actually do this using textscan without any loops and using a few basic string manipulation functions:
fid = fopen('no_delim.csv', 'r');
C = textscan(fid, ['%[0123456789' 10 13 ']%[,]%3c'], 'EndOfLine', '');
fclose(fid);
C = strcat(C{:});
output = strtrim(strsplit(sprintf('%s ', C{:}), {'\n' '\r'})).';
And the output using your sample input file:
output =
2×1 cell array
'0,053 0,056 0,073'
'90,150 90,152 90,157'
How it works...
The format string specifies 3 items to read repeatedly from the file:
A string containing any number of characters from 0 through 9, newlines (ASCII code 10), or carriage returns (ASCII code 13).
A comma.
Three individual characters.
Each set of 3 items are concatenated, then all sets are printed to a string separated by spaces. The string is split at any newlines or carriage returns to create a cell array of strings, and any spaces on the ends are removed.
If you have access to a GNU / *NIX command line, I would suggest using sed to preprocess your data before feeding into matlab. The command would be in this case : sed 's/,[0-9]\{3\}/& /g' .
$ echo "90,15090,15290,157" | sed 's/,[0-9]\{3\}/& /g'
90,150 90,152 90,157
$ echo "0,0530,0560,0730,356" | sed 's/,[0-9]\{3\}/& /g'
0,053 0,056 0,073 0,356
also, you easily change commas , to decimal point .
$ echo "0,053 0,056 0,073 0,356" | sed 's/,/./g'
0.053 0.056 0.073 0.356

Converting a .txt file with 1 million digits of "e" into a vector in matlab

I have a text file with 1 million decimal digits of "e" number with 80 digits on each line excluding the first and the last line which have 76 and 4 digits and the file has 12501 lines. I want to convert it into a vector in matlab with each digit on each row. I tried num2str function, but the problem is that it gets converted like for example '7.1828e79' (13 characters). What can I do?
P.S.1: The first two lines of the text file (76 and 80 digits) are:
7182818284590452353602874713526624977572470936999595749669676277240766303535 47594571382178525166427427466391932003059921817413596629043572900334295260595630
P.S.2: I used "dlmread" and got a 12501x1 vector, with the first and second row of 7.18281828459045e+75 and 4.75945713821785e+79 and the problem is that when I use num2str for example for the first row value, I get: '7.182818284590453e+75' as a string and not the whole 76 digits. My aim was to do something like this:
e1=dlmread('e.txt');
es1=num2str(e1);
for i=1:12501
for j=1:length(es1(1,:))
a1((i-1)*length(es1(1,:))+j)=es1(i,j);
end
end
e_digits=a1.';
but I get a string like this:
a1='7.182818284590453e+754.759457138217852e+797.381323286279435e+799.244761460668082e+796.133138458300076e+791.416928368190255e+79 5...'
with 262521 characters instead of 1 million digits.
P.S.3: I think the problem might be solved if I can manipulate the text file in a way that I have one digit on each line and simply use dlmread.
Well, this is not hard, there are many ways to do it.
So first you want to load in your file as a Char Array using something simple like (you want a Char Array so that you can easily manipulate it to forget about the lines breaks) :
C = fileread('yourfile.txt'); %loads file as Char Array
D = C(~isspace(C)); %Removes SPACES which are line-breaks
Next, you want to actually append a SPACE between each char (this is because you want to use the num2str transform - and matlab needs to see the space), you can do this using a RESHAPE, a STRTRIM or simply a REGEX:
E = strtrim(regexprep(D, '.{1}', '$0 ')); %Add a SPACE after each Numeric Digit
Now you can transform it using str2num into a Vector:
str2num(E)'; %Converts the Char Array back to Vector
which will give you a single digit each row.
Using your example, I get a vector of 156 x 1, with 1 digit each row as you required.
You can get a digit per row like this
fid = fopen('e.txt','r');
c = textscan(fid,'%s');
c=cat(1,c{:});
c = cellfun(#(x) str2num(reshape(x,[],1)),c,'un',0);
c=cat(1,c{:});
And it is not the only possible way.
Could you please tell what is the final task, how do you plan using the array of e digits?

How to read a specific number (or word) from an answer

I have an .nc file I'm reading in matlab, and getting info out of the time variable.
the code looks like this
>> ncreadatt(model_list{3},'T','units')
ans =
'months since 1850-01-01'
what I want to do is get just the '1850' out of the answer.
Regular expression is a very powerful tool to parse and manipulate strings.
Matlab has regexp command:
line = 'months since 1850-01-01';
res = regexp( line, '\s(\d+)-', 'tokens', 'once');
year = str2double(res{1})
And the results is:
year =
1850
The regular expression used '\s(\d+)-' means:
\s - look for a single white space character (the space before 1850).
'(\d+)' - look for one or more digit ('\d+'), the parentheses means that all charcters matching here will be saved as a "token".
'-' - look for a single '-' after the digits.
You can play with it on ideone.

I want to convert as below via preg_replace

I want to convert as below via preg_replace.
How can i know answer??
preg_replace($pattern, "$2/$1", "One001Two111Three");
result> Three/Two111/One001
You'd better use preg_split, it's much more simple than preg_replace and it works with any number of elements:
$str = "One001Two111Three";
$res = implode('/', array_reverse(preg_split('/(?<=\d)(?=[A-Z])/', $str)));
echo $res,"\n";
output:
Three/Two111/One001
The regex /(?<=\d)(?=[A-Z])/ splits on boundary between a digit and a capital letter, array_reverse reverse the order of the array given by preg_split, then the elements of reversed array are joined by implode with a /
$string = "One001Two111Three";
$result = preg_replace('/^(.*?\d+)(.*?\d+)(.*?)$/im', '$3/$2/$1', $string );
echo $result;
RESULT: Three/Two111/One001
DEMO
EXPLANATION:
^(.*?\d+)(.*?\d+)(.*?)$
-----------------------
Options: Case insensitive; Exact spacing; Dot doesn't match line breaks; ^$ match at line breaks; Greedy quantifiers; Regex syntax only
Assert position at the beginning of a line (at beginning of the string or after a line break character) (line feed) «^»
Match the regex below and capture its match into backreference number 1 «(.*?\d+)»
Match any single character that is NOT a line break character (line feed) «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match a single character that is a “digit” (any decimal number in any Unicode script) «\d+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regex below and capture its match into backreference number 2 «(.*?\d+)»
Match any single character that is NOT a line break character (line feed) «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match a single character that is a “digit” (any decimal number in any Unicode script) «\d+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regex below and capture its match into backreference number 3 «(.*?)»
Match any single character that is NOT a line break character (line feed) «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Assert position at the end of a line (at the end of the string or before a line break character) (line feed) «$»
$3/$2/$1
Insert the text that was last matched by capturing group number 3 «$3»
Insert the character “/” literally «/»
Insert the text that was last matched by capturing group number 2 «$2»
Insert the character “/” literally «/»
Insert the text that was last matched by capturing group number 1 «$1»

Unicode character transformation in SPSS

I have a string variable. I need to convert all non-digit characters to spaces (" "). I have a problem with unicode characters. Unicode characters (the characters outside the basic charset) are converted to some invalid characters. See the code for example.
Is there any other way how to achieve the same result with procedure which would not choke on special unicode characters?
new file.
set unicode = yes.
show unicode.
data list free
/T (a10).
begin data
1234
5678
absd
12as
12(a
12(vi
12(vī
12āčž
end data.
string Z (a10).
comp Z = T.
loop #k = 1 to char.len(Z).
if ~range(char.sub(Z, #k, 1), "0", "9") sub(Z, #k, 1) = " ".
end loop.
comp Z = normalize(Z).
comp len = char.len(Z).
list var = all.
exe.
The result:
T Z len
1234 1234 4
5678 5678 4
absd 0
12as 12 2
12(a 12 2
12(vi 12 2
12(vī 12 � 6
>Warning # 649
>The first argument to the CHAR.SUBSTR function contains invalid characters.
>Command line: 1939 Current case: 8 Current splitfile group: 1
12āčž 12 �ž 7
Number of cases read: 8 Number of cases listed: 8
The substr function should not be used on the left hand side of an expression in Unicode mode, because the replacement character may not be the same number of bytes as the character(s) being replaced. Instead, use the replace function on the right hand side.
The corrupted characters you are seeing are due to this size mismatch.
How about instead of replacing non-numeric characters, you cycle though and pull out the numeric characters and rebuild Z? (Note my version here is pre CHAR. string functions.)
data list free
/T (a10).
begin data
1234
5678
absd
12as
12(a
12(vi
12(vī
12āčž
12as23
end data.
STRING Z (a10).
STRING #temp (A1).
COMPUTE #len = LENGTH(RTRIM(T)).
LOOP #i = 1 to #len.
COMPUTE #temp = SUBSTR(T,#i,1).
DO IF INDEX('0123456789',#temp) > 0.
COMPUTE Z = CONCAT(SUBSTR(Z,1,#i-1),#temp).
ELSE.
COMPUTE Z = CONCAT(SUBSTR(Z,1,#i-1)," ").
END IF.
END LOOP.
EXECUTE.