I have a txt file as the following and I would like to extract the Accession ID "GSE????" or "GSE** " with Talend, I tried the "tPatternextract" and it seems not to work in Talend 7.1, is there a way to extract all text with a pattern?
Best,
Xinhui
Integrated analysis of DNA methylation and gene expression profiles identified S100A9 as a potential biomarker in ulcerative colitis
(Submitter supplied) In this research, 90 differential expression mRNAs (DEMs).
Organism: Homo sapiens
Type: Expression profiling by array; Non-coding RNA profiling by array
Platform: GPL20115 6 Samples
FTP download: GEO (TXT) ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE160nnn/GSE160804/
Series Accession: GSE160804 ID: 200160804
Induced organoids derived from patients with ulcerative colitis recapitulate the colitic reactivity
(Submitter supplied) We report the application of single nucleus RNA-seq.
Organism: Homo sapiens
Type: Expression profiling by high throughput sequencing
Platform: GPL24676 11 Samples
FTP download: GEO (MTX, TSV) ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE152nnn/GSE152999/
SRA Run Selector: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA641142
Series Accession: GSE152999 ID: 200152999
Use a tFilterRow
In the Component tab, click on "Use advanced mode" and give this condition
input_row.columnName1.startsWith("AA12")
In http://en.wikipedia.org/wiki/Addressing_mode
Indexed absolute
+------+-----+-----+--------------------------------+
| load | reg |index| address |
+------+-----+-----+--------------------------------+
(Effective address = address + contents of specified index
register)
Note that this is more or less the same as base-plus-offset addressing mode, except that the offset in this case is large enough to address any memory location.
I still don't understand what differences are between offset and index? And differences between base-plus-offset addressing mode and Indexed absolute addressing mode?
Thanks.
Offset is an absolute number of bytes. So if address = 0x1000 and offset = 0x100 then the effective address = 0x1000 + 0x100 = 0x1100.
Index is an offset that is multiplied by a constant. So if address = 0x1000 and index = 0x100 and size of element = 4 then address = 0x1000 + 0x100*4 = 0x1400. You would use this when indexing into an array of 32-bit values.
To me, the address+index example sounds like the x86 LEA instruction:
http://www.cs.virginia.edu/~evans/cs216/guides/x86.html#instructions
With that said, when I read the Wikipedia article, I can't see a difference between "Indexed absolute" "Base plus index" and "Scaled." It looks like the exact same thing, except the term "address" and "base" are interchanged. This looks to me like too many authors writing the same thing over again. If this response gets enough upvotes, I'm editing the article. :-)
I'm trying to unpack binary vector of 140 Million bits into list.
I'm checking the memory usage of this function, but it looks weird. the memory usage rises to 35GB (GB and not MB). how can I reduce the memory usage?
sub bin2list {
# This sub translates a binary vector to a list of "1","0"
my $vector = shift;
my #unpacked = split //, (unpack "B*", $vector );
return #unpacked;
}
Scalars contain a lot of information.
$ perl -MDevel::Peek -e'Dump("0")'
SV = PV(0x42a8330) at 0x42c57b8
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK)
PV = 0x42ce670 "0"\0
CUR = 1
LEN = 16
In order to keep them as small as possible, a scalar consists of two memory blocks[1], a fixed-sized head, and a body that can be "upgraded" to contain more information.
The smallest type of scalar that can contain a string (such as the ones returned by split) is a SVt_PV. (It's usually called PV, but PV can also refer to the name of the field that points to the string buffer, so I'll go with the name of the constant.)
The first block is the head.
ANY is a pointer to the body.
REFCNT is a reference count that allows Perl to know when the scalar can be deallocated.
FLAGS contains information about what the scalar actually contains. (e.g. SVf_POK means the scalar contains a string.)
TYPE contains information the type of scalar (what kind of information it can contain.)
For an SVt_PV, the last field points to the string buffer.
The second block is the body. The body of an SVt_PV has the following fields:
STASH is not used in the scalars in question since they're not objects.
MAGIC is not used for the scalars in question. Magic allows code to be called when the variable is accessed.
CUR is the length of the string in the buffer.
LEN is the length of the string buffer. Perl over-allocates to speed up concatenation.
The block on the right is the string buffer. As you might have noticed, Perl over-allocates. This speeds up concatenation.
Ignore the block on the bottom. It's an alternative to the string buffer format for special strings (e.g. hash keys).
To how much does that add up?
$ perl -MDevel::Size=total_size -E'say total_size("0")'
28 # 32-bit Perl
56 # 64-bit Perl
That's just for the scalar itself. It doesn't take into the overhead in the memory allocation system of three memory blocks.
These scalars are in an array. An array is really just a scalar.
So an array has overheard.
$ perl -MDevel::Size=total_size -E'say total_size([])'
56 # 32-bit Perl
64 # 64-bit Perl
That's an empty array. You have 140 million of the scalars in yours, so it needs a buffer that can contain 140 million pointers. (In this particular case, the array won't be over-allocated, at least.) Each pointer is 4 bytes on a 32-bit system, 8 on a 64.
That brings the total up to:
32-bit: 56 + (4 + 28) * 140,000,000 = 4,480,000,056
64-bit: 64 + (8 + 56) * 140,000,000 = 8,960,000,064
That doesn't factor in the memory allocation overhead, but it's still very different from the numbers you gave. Why? Well, the scalars returned by split are actually different than the scalars inside the array. So for a moment, you actually have 280,000,000 scalars in memory!
The rest of the memory is probably held by lexical variables in subs that aren't currently executing. Lexical variables aren't normally freed on scope exit since it's expected that the sub will need the memory the next time it's called. That means bin2list continues to use up 140MB of memory after it exits.
Footnotes
Scalars that are undefined can get away without a body until a value is assigned to them. Scalars that contain only an integer can get away without allocating a memory block for the body by storing the integer in the same field as a SVt_PV stores the pointer to the string buffer.
The images are from illguts. They are protected by Copyright.
A single integer value in Perl is going to be stored in an SVt_IV or SVt_UV scalar, whose size will be four machine-sized words - so on a 32bit machine, 16 bytes. An array of 140 million of those, therefore, is going to consume 2.2 billion bytes, presuming it is densely packed together. Add to that the SV * pointers in the AvARRAY used to reference them and we're now at 2.8 billion bytes. Now double that, because you copied the array when you returned it, and we're now at 5.6 billion bytes.
That of course was on a 32bit machine - on a 64bit machine we're at double again, so 11.2 billion bytes. This presumes totally dense packing inside the memory - in practice this will be allocated in stages and chunks, so RAM fragmentation will further add to this. I could imagine a total size around the 35 billion byte mark for this. It doesn't sound outlandishly unreasonable.
For a very easy way to massively reduce the memory usage (not to mention CPU time required), rather than returning the array itself as a list, return a reference to it. Then a single reference is returned rather than a huge list of 140 million SVs; this avoids a second copy also.
sub bin2list {
# This sub translates a binary vector to a list of "1","0"
my $vector = shift;
my #unpacked = split //, (unpack "B*", $vector );
return \#unpacked;
}
I am trying to investigate an access violation issue in code. As you can see some of the value for address contains apostrophe character (like in 7fb`80246000).
0:000> !address -summary
--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
Free 424 7fb`80246000 ( 7.982 Tb) 99.78%
The tick mark (grave accent) is just used to separate the lower 4 bytes from the higher 4 bytes of a 64-bit number.
7fb`80246000
is the same as
0x7fb80246000
It is purely for visual aesthetics (making the value easier to parse by humans).
In windbg, I can list loaded modules with lm.
How can I find the memory footprint of those assemblies?
I'm analyzing a dump of a process suspected of using too much memory, and one thing I'm noticing is the number of assemblies, but not sure what's the size they occupy in memory.
Also, they don't seem to be in contiguous memory positions. Or are they if I sort lm's output some way?
Thanks!
The !address -summary gives you a good overview.
Check the Image row
0:008> !address -summary
--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
Free 212 b13cb000 ( 2.769 Gb) 69.23%
Heap 455 25281000 ( 594.504 Mb) 47.18% 14.51%
<unknown> 861 2168d000 ( 534.551 Mb) 42.42% 13.05%
Image 662 4e8e000 ( 78.555 Mb) 6.23% 1.92%
Stack 156 3400000 ( 52.000 Mb) 4.13% 1.27%
Other 39 54000 ( 336.000 kb) 0.03% 0.01%
TEB 52 34000 ( 208.000 kb) 0.02% 0.00%
PEB 1 1000 ( 4.000 kb) 0.00% 0.00%
You can check each module's size by using lmvm module_name. There is an ImageSize output indicating the hexidecimal size of that module.
Edited: Another way is to first lm to show all modules, and then use !lmi start_address or !lmi module_name to get information about a specific module. !lmi has a Size field that indicates image size.
Note that for .NET 4 native images loaded, you have to use !lmi start_address, as module name resolution fails.