how to extract a pattern from a txt file in Talend - talend

I have a txt file as the following and I would like to extract the Accession ID "GSE????" or "GSE** " with Talend, I tried the "tPatternextract" and it seems not to work in Talend 7.1, is there a way to extract all text with a pattern?
Best,
Xinhui
Integrated analysis of DNA methylation and gene expression profiles identified S100A9 as a potential biomarker in ulcerative colitis
(Submitter supplied) In this research, 90 differential expression mRNAs (DEMs).
Organism: Homo sapiens
Type: Expression profiling by array; Non-coding RNA profiling by array
Platform: GPL20115 6 Samples
FTP download: GEO (TXT) ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE160nnn/GSE160804/
Series Accession: GSE160804 ID: 200160804
Induced organoids derived from patients with ulcerative colitis recapitulate the colitic reactivity
(Submitter supplied) We report the application of single nucleus RNA-seq.
Organism: Homo sapiens
Type: Expression profiling by high throughput sequencing
Platform: GPL24676 11 Samples
FTP download: GEO (MTX, TSV) ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE152nnn/GSE152999/
SRA Run Selector: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA641142
Series Accession: GSE152999 ID: 200152999

Use a tFilterRow
In the Component tab, click on "Use advanced mode" and give this condition
input_row.columnName1.startsWith("AA12")

Related

ruamel.yaml.cmd rt breaks lists, if containing long string, or hash

I just notices that the command line tool, called like this: "ruamel.yaml.cmd rt --save $YAML_FILE", will break lists that either contain long strings, or hashes:
Example list containing a hash:
Source:
telegraf::inputs:
cpu:
- percpu: true
totalcpu: true
report_active: true
output:
telegraf::inputs:
cpu:
- percpu: true
totalcpu: true
report_active: true
example list containing long string:
source:
rsyslog::config::snippets:
00_forward:
ensure: 'present'
lines:
- 'if $syslogfacility != 1 then {'
- 'action(Name="collector-syslog" Type="omfwd" Target="%{hiera("rsyslog_server")}" Port="514" Action.ResumeInterval="5" Protocol="tcp")'
- '}'
output:
rsyslog::config::snippets:
00_forward:
ensure: present
lines:
- if $syslogfacility != 1 then {
- action(Name="collector-syslog" Type="omfwd" Target="%{hiera("rsyslog_server")}"
Port="514" Action.ResumeInterval="5" Protocol="tcp")
- '}'
I already created a bug report for this, but it was deleted with a comment pointing to https://yaml.readthedocs.io/en/latest/example.html?highlight=indent#output-of-dump-as-a-string.
But I am not sure how this code snipped should help me with the command line tool.
Or is the tool deprecated, and I have to roll my own?
The automatic detection of the indent seems incorrect for your input, as that input is inconsistent (your mappings are indented 2 positions and your sequences 4 positions with an offset for the block sequence indicator of 2). ruamel.yaml.cmd as on PyPI doesn't support different indentation levels for sequences and mappings (ruamel.yaml didn't when that was written, it does now).
Apart from that you cannot set the line width for the output in ruamel.yaml.cmd for older versions ( before 2020-12-01), and those versions are using the default 80 characters for the wrapping.
I recommend you upgrade to 0.5.6 and use the command line options:
yaml rt --indent 2 --width 1024 --save <yourfile>
The appropriate repository for ruamel.yaml.cmd is https://sourceforge.net/p/ruamel-yaml-cmd/code/ci/default/tree/ . A bug report on ruamel.yaml which can only be used from a Python program, should include the minimal source code of the program that reproduces the error, and if not provided, issues will be removed as announced on its create issue page.

snort | pcre| rule specification

My objective is to write a rule to detect a simple truth exploit (SQLi)
The string example is of a form:
% ' or 1 = 1 #
In order to identify the string above and some of its variations, I have developed following pcre.
pcre: "/\W\s*\W\s*or\s*([\d\w])\s*\W\s*\1\s*\W/";
I ran a test # regextester and my regex seems to work. However, in Snort, this rule fails to pick and does not trigger.
The rule is of a format
alert 192.168.x.x any -> 192.168.y.y 80 (msg: "SQL Query"; pcre: "/\W\s*\W\s*or\s*([\d\w])\s*\W\s*\1\s*\W/"; sid: 1001;);
I'd appreciate any help
GET request from Whireshark
GET /dvwa/vulnerabilities/sqli/?id=%25+%27+or+1+%3D+1+%23&Submit=Submit
The cause of the rule fail is URL encoding. %25 means %, %27means ', +(or %20) means space, %3D means =. https://www.w3schools.com/tags/ref_urlencode.asp
Snort have a HTTP normalization module. But i think it is not perfect.
Refer to following rule.
alert tcp any any -> any any (content:"+or+"; nocase; pcre:"/\+or\+\w\+%3D\+\w/";)
Using pcre alone can degrade performance. When used with content, it narrows the scope of the pcre inspection and improves performance.

Mallet POS-Tagging learning time

I've been trying to use the Mallet Simple Tagger (http://mallet.cs.umass.edu/sequences.php) to learn a CRF- Model for POS-Tagging.
I am now starting to get worried/confused as my computer has been learning for this one model for over a week.
It does not seem to be hung up, as it sill gives me output in the form:
...
Punkte NN->Puppenk�nig NN(Puppenk�nig NN) Punkte NN,Puppenk�nig NN
Punkte NN->Obere NN(Obere NN) Punkte NN,Obere NN
Punkte NN->Entfernung NN(Entfernung NN) Punkte NN,Entfernung NN
...
So I wanted to ask, if it is normal for Mallet to take this long, or did something go wrong?
I used the command specified on the webpage:
hough#gobur:~/tagger-test$ java -cp
"/home/hough/mallet/class:/home/hough/mallet/lib/mallet-deps.jar"
cc.mallet.fst.SimpleTagger
--train true --model-file nouncrf sample
The training data contains 96903 Tokens.
Edit:
We're assuming, it might have something to do with the form of the input. The website specifies the form:
Bill CAPITALIZED noun
slept non-noun
here LOWERCASE STOPWORD non-noun
And the documentation for the SimpleTagger(http://mallet.cs.umass.edu/api/) states that each instance should be a separate block, separated by blank lines. While I'm not sure what is meant by instance, I thought, the expected form is something like this:
word pos
word pos
. $.
word pos
word pos
word pos
. $.
word pos
word pos
. $.
...
Is this the right format? Does maybe someone have an example file, to show what the format should look like?
A week for a 100k token corpus seems much too long. I would expect on the order of a half hour at most.

use perl to extract specific output lines

I'm endeavoring to create a system to generalize rules from input text. I'm using reVerb to create my initial set of rules. Using the following command[*], for instance:
$ echo "Bananas are an excellent source of potassium." | ./reverb -q | tr '\t' '\n' | cat -n
To generate output of the form:
1 stdin
2 1
3 Bananas
4 are an excellent source of
5 potassium
6 0
7 1
8 1
9 6
10 6
11 7
12 0.9999999997341693
13 Bananas are an excellent source of potassium .
14 NNS VBP DT JJ NN IN NN .
15 B-NP B-VP B-NP I-NP I-NP I-NP I-NP O
16 bananas
17 be source of
18 potassium
I'm currently piping the output to a file, which includes the preceding white space and numbers as depicted above.
What I'm really after is just the simple rule at the end, i.e. lines 16, 17 & 18. I've been trying to create a script to extract just that component and put it to a new file in the form of a Prolog clause, i.e. be source of(banans, potassium).
Is that feasible? Can Prolog rules contain white space like that?
I think I'm locked into getting all that output from reVerb so, what would be the best way to extract the desirable component? With a Perl script? Or maybe sed?
*Later I plan to replace this with a larger input file as opposed to just single sentences.
This seems wasteful. Why not leave the tabs as they are, and use:
$ echo "Bananas are an excellent source of potassium." \
| ./reverb -q | cut --fields=16,17,18
And yes, you can have rules like this in Prolog. See the answer by #mat. You need to know a bit of Prolog before you move on, I guess.
It is easier, however, to just make the string a a valid name for a predicate:
be_source_of with underscores instead of spaces
or 'be source of' with spaces, and enclosed in single quotes.
You can use probably awk to do what you want with the three fields. See for example the printf command in awk. Or, you can parse it again from Prolog directly. Both are beyond the scope of your current question, I feel.
sed -n 'N;N
:cycle
$!{N
D
b cycle
}
s/\(.*\)\n\(.*\)\n\(.*\)/\2 (\1,\3)/p' YourFile
if number are in output and not jsut for the reference, change last sed action by
s/\^ *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)/\2 (\1,\3)/p
assuming the last 3 lines are the source of your "rules"
Regarding the Prolog part of the question:
Yes, Prolog facts can contain whitespace like this, with suitable operator declarations present.
For example:
:- op(700, fx, be).
:- op(650, fx, source).
:- op(600, fx, of).
Example query and its result, to let you see the shape of terms that are created with this syntax:
?- write_canonical(be source of(a, b)).
be(source(of(a,b))).
Therefore, with these operator declarations, a fact like:
be source of(a, b).
is exactly the same as stating:
be(source(of(a,b)).
Depending on use cases and other definitions, it may even be an advantage to create this kind of facts (i.e., facts of the form be/1 instead of source_of/2). If this is the only kind of facts you need, you can simply write:
source_of(a, b).
This creates no redundant wrappers and is easier to use.
Or, as Boris suggested, you can use single quotes as in 'be source of'/2.

Find all differences between .mat files

I am looking for a way to list the differences between two .mat files, something that can be usefull for many people.
Though I searched everywhere I could think of, I have not found anything that meets my requirements:
Pick 2 mat files
Find the differences
Save them properly
The closest I have come is visdiff. As long as I stay within matlab, it will allow me to browse the differences, but when I save the result it only shows me the top level.
Here is a simplified example of what my files typically look like:
a = 6;
b.c.d = 7;
b.c.e = 'x';
save f1
f = a;
clear a
b.c.e = 'y';
save f2
visdiff('f1.mat','f2.mat')
If I click here on b, I can find the difference. However if I run this and use 'file>save', I am not able to click on b. Thus I still don't know what has been changed.
Note: I don't have Simulink
Hence my question is:
How can I show all differences between 2 mat files to someone without Matlab
Here are the answers that I personally consider to be most suitable for different situations:
Answer for users with Simulink
General answer
Answer displaying all value differences
Find all differences between mat files without MATLAB?
You can find the differences between HDF5 based .mat files with the HDF5 Tools.
Example
Let me shorten your MATLAB example and assume you create two mat files with
clear ; a = 6 ; b.c = 'hello' ; save -v7.3 f1
clear ; a = 7 ; b.e = 'world' ; save -v7.3 f2
Outside MATLAB use
h5ls -v -r f1.mat
to get a listing about the kind of data included f1.mat:
Opened "f1.mat" with sec2 driver.
/ Group
Location: 1:96
Links: 1
/a Dataset {1/1, 1/1}
Attribute: MATLAB_class scalar
Type: 6-byte null-terminated ASCII string
Data: "double"
Location: 1:2576
Links: 1
Storage: 8 logical bytes, 8 allocated bytes, 100.00% utilization
Type: native double
/b Group
Attribute: MATLAB_class scalar
Type: 6-byte null-terminated ASCII string
Data: "struct"
Location: 1:800
Links: 1
/b/c Dataset {5/5, 1/1}
Attribute: H5PATH scalar
Type: 2-byte null-terminated ASCII string
Data: "/b"
Attribute: MATLAB_class scalar
Type: 4-byte null-terminated ASCII string
Data: "char"
Attribute: MATLAB_int_decode scalar
Type: native int
Data: 2
Location: 1:1832
Links: 1
Storage: 10 logical bytes, 10 allocated bytes, 100.00% utilization
Type: native unsigned short
Use of
h5ls -d -r f1.mat
returns the values of the stored data:
/ Group
/a Dataset {1, 1}
Data:
(0,0) 6
/b Group
/b/c Dataset {5, 1}
Data:
(0,0) 104, 101, 108, 108, 111
The data 104, 101, 108, 108, 111 represents the word hello, which can be seen with
h5ls -d -r f1.mat | tail -1 | awk '{FS=",";printf("%c%c%c%c%c \n",$2,$3,$4,$5,$6)}'
You can get the same listing for f2.mat and compare the two outputs with the tool of your choice.
Comparison also works directly with HDF5 Tools. To compare the two numbers a from both files use
h5diff -r f1.mat f2.mat /a
which will show you the values and their difference
dataset: </a> and </a>
size: [1x1] [1x1]
position a a difference
------------------------------------------------------------
[ 0 0 ] 6 7 1
1 differences found
attribute: <MATLAB_class of </a>> and <MATLAB_class of </a>>
0 differences found
Remarks
There are a few more commands and options in the HDF5 Tools, which may help to get your real problem solved.
Binary distributions are available for Linux and Windows from The HDF Group. For OS X you can get them installed via MacPorts. If needed there is also a GUI: HDFView.
If you have simulink you can use Simulink.saveVars to generate an m-file that upon execution creates the same variables in work space:
a = 6;
b.c.d = 7;
b.c.e = 'x';
Simulink.saveVars('f1');
f = a;
clear a
b.c.e = 'y';
Simulink.saveVars('f2');
visdiff('f1.m','f2.m')
as illustrated in this sctreenshot
Note that by default it limits the number of elements in arrays to 1000 and you can increase it to 10000. Arrays larger than that limit will be saved in a separate mat-file.
UPDATE: From R2014a a new function similar to Simulink.saveVars has been added to MATLAB. see matlab.io.saveVariablesToScript
This is only part of the answer, but maybe it helps.
You could use gencode, a Matlab function that generates Matlab code from a variable such that running the code reproduces the variable. You do this for all of the variables in each mat-file (takes some programming, but should be doable) and put the results in different .m-files.
Then you use a standard text comparison tool (maybe even visdiff) to compare the .m-files.
There are several good tools to compare XML-Files, this I would proceed this way:
Download struct2xml.m
Load both matfiles
Export each with struct2xml
compare, using XMLSpy or similar
Simple general answer, without displaying value differences
Due to the insight I gained from the answers of #BHF, #Daniel R and #Dennis Jaheruddin, I have managed to find a simple scalable solution:
[fs1, fs2, er] = comp_struct(load('f1.mat'),load('f2.mat'))
Note that it works for .mat containing an arbritrary number of variables.
This uses the Compare Structures - File Exchange submission.
Answer for small files, displaying all value differences
Based on the suggestion by #A. Donda I have tried to use gencode to create a variable for everything.
Though it works for my toy example, it is quite slow and tells me that I exceed the allowed amount of variables for my real .mat files.
Anyway, for those who are looking for something that works with small files, I will post this option:
wList=who;
for iLoop = 1:numel(wList)
eval(['generated_' wList{iLoop} '= gencode(' wList{iLoop} ');'])
for jLoop = 1:numel(eval(['generated_' wList{iLoop}]))
eval(['generated_' wList{iLoop} '_' num2str(jLoop) '= generated_' wList{iLoop} '(' num2str(jLoop) ');' ])
end
end
Though it may work, I don't feel like this is the best way to go.
General answer, without displaying value differences
Due to the insight I gained from the answers of #BHF and #Daniel R I have managed to find a reasonably scalable solution.
Step 1: Save all variables from each files as a single struct
This uses the Save workspace to struct - File Exchange submission.
Here are the steps to take assuming you want to compare f1.mat and f2.mat:
clear
load f1
myStruct1 = ws2struct;
save myStruct1 myStruct1
clear
load f2
myStruct2 = ws2struct;
save myStruct2 myStruct2
clear
load myStruct1
load myStruct2
Step 2: Compare the structs
This uses the Compare Structures - File Exchange submission
Given that you want to compare myStruct1 and myStruct2 you can simply call:
[fs1, fs2, er] = comp_struct(myStruct1,myStruct2)
I was positively surprised at how readable the list of differences in er is, here is the output for the example that was used in the question:
er =
's2 is missing field a'
's1(1).b(1).c(1).e and s2(1).b(1).c(1).e do not match'
Note that it will not show values, from a technical point of view it is probably not too hard to change the m file if value difference displays are desirable. However, especially if there are some big matrices I suppose this could result in problematic output.