Analyzing and storing text in a data structure - text-processing

I hope you understand what I want to do. It is hard to choose the best words, because English is not my first language and I distrust automatic translators. I will try to explain as well as I can.
I was thinking about analyzing a long text. Suppose, for example, that I have a string divided into paragraphs.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla vitae elit libero, a pharetra augue. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras mattis consectetur purus sit amet fermentum.
Duis mollis, est non commodo luctus, nisi erat porttitor ligula, eget lacinia odio sem nec elit. Aenean eu leo quam. Pellentesque ornare sem lacinia quam venenatis vestibulum. Cras justo odio, dapibus ac facilisis in, egestas eget quam. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur blandit tempus porttitor. Maecenas sed diam eget risus varius blandit sit amet non magna.
I would like to store this string in an array or something similar, in a way I can find the length or location of the two paragraphs very quickly. For example (pseudocode):
Array => {
paragraphs => {
"Lorem ipsum dolor sit amet, [...] fermentum.",
...
}
}
I don't really know whether this has a name. I suppose there is much theory about how to do this type of task. I am really interested in practices that take care about performance when processing a big amount of text. I would like to have something to study and read carefully.
Any help would be very appreciated. Thanks in advance,
—Alberto

Perhaps read into Apache's UIMA, it's all about analyzing unstructured information, text analysis being a major component of it.

Related

Dart: Is there a way to split strings into sentences without using Dart's split method?

I'm looking to split a paragraph of text into individual sentences using Dart. The problem I am having is that sentences can end in a number of punctuation marks (e.g. '.', '!', '?') and in some cases (such as the Japanese language), sentences can end in unique symbols (e.g. '。').
Additionally, Dart's split method removes the split value from the string. For example, 'Hello World!" becomes "Hello World" when using the code text.split('! ');
I've looked around at Dart packages available but I'm unable to find anything that does what I'm looking for.
Ideally, I'm looking for something similar to BreakIterator in Java which allows the programmer to define which locale they wish to use when detecting punctuation and also maintains the punctuation mark when splitting the string into sentences. I'm happy to use a solution in Dart that doesn't automatically detect sentence endings based on Locale but if this isn't available I would like to have the ability to define all sentence endings to look for when splitting a string.
Any help is appreciated. Thank you in advance.
it can be done using regex, something like this:
String str1 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. In vulputate odio eros, sit amet ultrices ipsum auctor sed. Mauris in faucibus elit. Nulla quam orci? ultrices a leo a, feugiat pharetra ex. Nunc et ipsum lorem. Integer quis congue nisi! In et sem eget leo ullamcorper consectetur dignissim vitae massa。Nam quis erat ac tellus laoreet posuere. Vivamus eget sapien eget neque euismod mollis.";
// regular expression:
RegExp re = new RegExp(r"(\w|\s|,|')+[。.?!]*\s*");
// get all the matches:
Iterable matches = re.allMatches(str1);
// Iterate all matches:
for (Match m in matches) {
String match = m.group(0);
print("match: $match");
}
output:
// match: Lorem ipsum dolor sit amet, consectetur adipiscing elit.
// match: In vulputate odio eros, sit amet ultrices ipsum auctor sed.
// match: Mauris in faucibus elit.
// match: Nulla quam orci?
// match: ultrices a leo a, feugiat pharetra ex.
// match: Nunc et ipsum lorem.
// match: Integer quis congue nisi!
// match: In et sem eget leo ullamcorper consectetur dignissim vitae massa。
// match: Nam quis erat ac tellus laoreet posuere.
// match: Vivamus eget sapien eget neque euismod mollis.

Is it possible to crop an Element using Elm's std library? If not, is there a 3rd party lib for this?

I often run across a situation where I've constructed an Element or Form and wish to crop the view down to a given area (i.e. for scrolling within a smaller rectangle) though I haven't been able to find any methods for this within their respective modules.
Is it possible to do this using Elm's std library? If not, are there any 3rd-party libraries capable of doing this?
Otherwise, perhaps there is a better way of achieving this?
Any help or suggestions appreciated!
No scollbars (using the std library)
I can't find a way to crop but have scrollbars with the current Graphics.Element. What is possible is to crop without having scrollbars, either through a container that's smaller than it's contents or by resizing an element with size. I think the container way is more robust, as resizing an image will actually warp the image.
Here's an example:
import Graphics.Element exposing (..)
import Text
string = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec a diam lectus. Sed sit amet ipsum mauris. Maecenas congue ligula ac quam viverra nec consectetur ante hendrerit. Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean ut gravida lorem. Ut turpis felis, pulvinar a semper sed, adipiscing id dolor. Pellentesque auctor nisi id magna consequat sagittis. Curabitur dapibus enim sit amet elit pharetra tincidunt feugiat nisl imperdiet. Ut convallis libero in urna ultrices accumsan. Donec sed odio eros. Donec viverra mi quis quam pulvinar at malesuada arcu rhoncus. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. In rutrum accumsan ultricies. Mauris vitae nisi at sem facilisis semper ac in est."
main : Element
main =
let element =
leftAligned (Text.fromString string)
|> container 400 300 topLeft
in container 205 200 topLeft element
Scrollbars (via "3rd party" library)
If you want scrollbars, you'll probably need at least a little bit of html from elm-html. Note that the author of the library is also the author of Elm, so it's not quite 3rd party :P . You can keep it minimal by using conversions to html and from html and wrapping it in a div with style attributes that define the smaller size and the right overflow property. As long as that div has a known size, it should be easy to convert back to an Element.

Use of sections within a module group in doxygen

I seek the preferred way to structure the contents of a doxygen module group. For example I want to structure the #details text in the following module group in different sections. Especially each of the sections should appear in the bookmarks of the generated PDF (as child elements of the module group):
#defgroup lorem
#{
#brief
Lorem ipsum
#details
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum
ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu
libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu
neque.
Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis
egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum
urna fringilla ultrices. Phasellus eu tellus sit amet tortor gravida placerat.
Integer sapien est, iaculis in, pretium quis, viverra ac, nunc.
Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla,
malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper nulla.
Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend,
sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.
#}
A way could be using #section #subsection etc, but the doxygen manual says:
Warning:
This command only works inside related page documentation
and not in other documentation blocks!
Is it possible to use #section or are there other (better) ways to do this?
Edit:
The behavior using #section seems indeed odd, for example I tried something like this:
#defgroup lorem
#{
#brief
Lorem ipsum
#section sec0 Lorem ipsum
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum
ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu
libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu
neque.
#section sec1 Pellentesque
Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis
egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum
urna fringilla ultrices. Phasellus eu tellus sit amet tortor gravida placerat.
#section sec2 Integer sapien est
Integer sapien est, iaculis in, pretium quis, viverra ac, nunc.
Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla,
malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper nulla.
Donec varius orci eget risus.
Duis nibh mi, congue eu, accumsan eleifend,
sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.
#}
The result looks fine and has in this case the following structure in the PDF:
4.1 lorem
4.1.1 Lorem ipsum
4.1.2 Pellentesque
4.1.3 Integer sapien est
Now if a add #file a #details section containing no text appears in the output I can't get rid of. It looks like this (I also tested adding another group containing a #file - same result):
4.1 lorem
4.1.1 Detailed Description
4.1.2 Lorem ipsum
4.1.3 Pellentesque
4.1.4 Integer sapien est
Then I tried to move the sections and make them subsections of the "Detailed Description" -which would be logically okay. But when I change them to #subsection 's they completely disappear. Doxygen warns in this case it has found a subsection out of section context so obviously it doesn't realize it generated this mysterious empty #details section.
Next idea was to use the Markdown support to do it. In this case sections aren't put into the PDF bookmarks so it looks okay - but in latex code they are still on same level as the #details section. And subsections in Markdown disappear as well. I've no idea what is going on, but I can't be the first person trying to structure things in a module group.
I tried your example myself and if you put the #file between your #{ ... #} like this,
/**
#defgroup lorem
#{
...
#file
#}
*/
then the standard Doxygen Layout is created.
If you move the #file outside, then your like this, then it should work.
/**
#defgroup lorem
#{
...
#}
#file
*/
However, if you really need the #file within your #defgroup lorem #{ ... #}, there are two ways to achive it.
FIRST:
/**
#defgroup file
#{
#file
#defgroup lorem
#{
...
#}
#}
*/
SECOND:
Change the standard doxygen layout.
To do this, follow the manuals description here, beginning at: Changing the layout of pages, by creating your custom DoxygenLayout.xml.
Now, edit the DoxygenLayout.xml and search for the <group> tag.
There will find <detaileddescription title=""/> tag, change it to <detaileddescription visible="no" title=""/>and the `Detailed Description" bugging you should vanish.
If you simply want to structure the sections within a group description then I'd simply use HTML and insert <h1>Section name here<\h1> etc.
Alternatively Markdown ## section marking may work.
I don't know how well these percolate to LaTex and PDF as I've never used that output route. However, I'm confident that using HTML will give more reliable results than using #section which, as the manual warns, is simply not designed for use here, but, instead, as part of the #page / #section / #subsection hierarchy for bulk prose. These command sets do not mix well in Doxygen.
Locating #file in exotic positions (as per aldr's suggestion) is prone to cause very strange results. It's simply the description block for the physical file, and is best used as just that.

How do I compose format=flowed emails that include hanging indents with vim?

Is there a good way to configure vim to send format=flowed emails that include hanging indents?
My complete vimrc (for testing purposes) is:
set nocompatible
set fo+=awn
set tw=72
set ai
I'm typing something like:
1. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam
posuere dui lorem, et condimentum nulla. Sed pharetra justo nec ante
fringilla non mattis nisi blandit. Donec molestie ligula dolor.
Nulla facilisi. Aliquam vel nulla elit, mollis facilisis metus. Sed
id eros a ante blandit convallis id sit amet elit. Duis malesuada
lobortis leo a placerat. Sed ut ipsum nisl. Sed pretium mauris vitae
velit sollicitudin iaculis.
vim adds a trailing space to each line except the last, per set fo+=w. It also adds spaces for the hanging indent. It looks great!
My mail client sets the format=flowed header. The result when this email is viewed in either Mail.app or mutt is not pretty:
1. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam posuere dui lorem, et condimentum nulla. Sed pharetra justo nec ante fringilla non mattis nisi blandit. Donec molestie ligula dolor. Nulla facilisi. Aliquam vel nulla elit, mollis facilisis metus. Sed id eros a ante blandit convallis id sit amet elit. Duis malesuada lobortis leo a placerat. Sed ut ipsum nisl. Sed pretium mauris vitae velit sollicitudin iaculis.
The paragraph wraps correctly, in the sense that resizing the reader client reflows it (which is not what you'll see here on stackoverflow, but you get the idea). The problem is, there are 5 spaces between "Etiam" and "posuere" and all the other lines that have been joined back together.
Is there a fix for this in vim? Or is this a limitation of the format=flowed spec? How do other people handle this?
The paragraph wraps correctly, in the sense that resizing the reader client reflows it (which is not what you'll see here on stackoverflow, but you get the idea). The problem is, there are 5 spaces between "Etiam" and "posuere" and all the other lines that have been joined back together.
This is a limitation of the "format=flowed" MIME parameter as specified in RFC 3676. There is nothing in the specification that would allow a client to recognize the leading spaces as ornaments intended only for plaintext versions of the mail.
Section 4.1 of the RFC states:
If the first character of a line is a space, the line has been space-stuffed (see Section 4.4). Logically, this leading space is deleted before examining the line further (that is, before checking for flowed).
The referenced "space-stuffing" from Section 4.4:
Space-stuffing adds a single space to the start of any line which needs protection when the message is generated. On reception, if the first character of a line is a space, it is logically deleted. This occurs after the test for a quoted line (which logically counts and deletes any quote marks), and before the test for a flowed line.
So an RFC 3676-compliant mail client would remove a single leading space from each line beginning with such a character and then (optionally) remove any the linebreaks that following a single space character. This process would not touch the remaining leading whitespace

Can a pixbuf inserted into a GTK+ text buffer be set as "floating"?

I'm writing an application [a Pidgin plugin, actually], which inserts an image embedded into a GtkTextBuffer. Currently, I add it using:
gtk_text_buffer_insert_pixbuf(textBuffer, &iter, pixbuf);
However, this just puts the image "inline" with the text. What I'm looking for for is something similar to HTML's "float". For example, assuming my image is about twice the size of a line of text, I current get this [where X is the image]
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam gravida
XXXX
XXXX ante in massa dignissim aliquam. Nullam tempus quam luctus eros volutpat laoreet.
XXXX
XXXX sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.
Mauris semper, nunc quis gravida molestie,
leo neque imperdiet nulla, vel consectetur nisi nisl non metus. Maecenas pharetra
magna nec magna mattis faucibus convallis nibh
Ideally, I'd like to have:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam gravida
XXXX ante in massa dignissim aliquam. Nullam tempus quam luctus eros volutpat laoreet.
XXXX
XXXX sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.
XXXX Mauris semper, nunc quis gravida molestie,
leo neque imperdiet nulla, vel consectetur nisi nisl non metus. Maecenas pharetra
magna nec magna mattis faucibus convallis nibh
Note that there are four paragraphs, where the second and third have an image in the beginning.
Is this possible?
The short answer is no; images in TextView are just treated as a character (which may be a lot bigger than a usual character). There isn't any layout engine in the HTML sense. (Layout is limited to what PangoLayout can do.)
You could probably hack something together, using an approach such as:
leave a margin the size of the image on your paragraph
add an expose event handler to paint the image to the window (see the "border windows" examples which are I think in gtk-demo or the docs somewhere, but draw to the main window not border windows)
Some amount of work, but it would probably get the job done.