extract a number from first line .rtf file - numbers

I would need to extract a number from the first line of a .rtf document.
The first line is:
1 of 51 DOCUMENTS
I need to extract the number 51.
How can I do? (Probably easy question but I am not good at textual analysis yet)
Thank you in advance,

Related

Extracting data from old text file into usable format

I have some data in a text file in the following format:
1079,40,011,1,301 17,310 4,668 6,680 1,682 1,400 7,590 2,591 139,592 332,565 23,568 2,569 2,595 1,471 1,470 10,481 12,540 117,510 1,522 187,492 9,533 41,558 15,555 12,556 9,558 27,546 1,446 1,523 4000,534 2000,364 1,999/
1083,40,021,1,301 4,310 2,680 1,442 1,400 2,590 2,591 90,592 139,595 11,565 6,470 2,540 66,522 4,492 1,533 19,546 3,505 1,523 3000,534 500,999/
These examples represent what would be two rows in a spreadsheet. The first four values (in the first example, "1079,40,011,1") each go into their own column. The rest of the data are in a paired format, first listing a name of a column, designated by a number, then a space followed by the value that should appear in that column. So again, example: 301 17,310 4,668 6: in this row, column 301 has a value of 17, column 310 has value of 4, column 668 has value of 6, etc. Then 999/ indicates an end to that row.
Any suggestions on how I can transform this text file format into a usable spreadsheet would be greatly appreciated. There are thousands of "rows" and so can't just manually convert them and I don't possess the coding skills to execute such a transformation myself.
This is messy but since there is a pattern it should be doable. What software are you using?
My first idea would be to identify when the delimeter changes from comma to space. Is it based on a fixed width, like always after 14 characters? Or is it based on the delimiter, like it is always after the 4th comma?
Once you've done that, you could make two passes at the data. The first pass imports the first four values from the beginning of the line which are separated by comma. The second pass imports the remaining values which are separated by space.
If you include a row number when importing you can then use it to join first and second passes at importing.

Is there a way for spark to read this odd text format?

The file format I have is sort of like csv and looks like this (abinitio .dat file of some sort):
1,apple,10.00,\n
2,banana,12.35,\n
3,orange,9.23,\n
The commas are actually "Start of Header" 0x01 byte characters, but I will use commas for simplicity. I can easily read the above sample by reading the file as a string RDD with a custom line split ,\n and then passing that into spark.read.csv. I am currently splitting lines by ,\n because there may be newlines in the data and I thought that those two characters were unique for each record. However a problem occurs when there are newline characters at the start of text fields. For example:
1,one \n apple,10.00,\n
2,two banana,12.35,\n
3,\n three orange,9.23,\n
My current code is able to ignore the newline in record 1 but picks up the ,\n after the 3 and splits the 3 lines into 4. How can I reliably read in this format?
My current ideas are:
Check that there are the right number of , column delimiters before allowing a split. I am not sure how to implement this, is it possible to do a regex look-back when spark sees a ,\n and check for the correct number of delimiters?
Try to coerce the file into some other format besides CSV
Make my own InputFormatClass, although I am not sure what this entails.

How to create mat file containing video in it

I'm new to matlab programming.I have an image processing code which helps to load a mat file in it. the code accepts .mat file as input with video file in it.
filename=('C:\Users\HP\Desktop\Folder\Image\NVR_ch2_main_cut_35-41.asf');
s=load(filename);
s=struct2cell(s);
M=double(s{1});
if (length(size(M))==4)
M=squeeze(M(:,:,1,:));
end`
Error using load
Unknown text on line number 1 of ASCII file C:\Users\HP\Desktop\Folder\Image\NVR_ch2_main_cut_35-41.asf
"Seh".
Just use v = VideoReader(filename) instead of the load function.
For further information: http://ch.mathworks.com/help/matlab/ref/videoreader.html
Well obviously Matlab won't read your file because it contains things load won't accept.
Does your file comply to this: (from the Matlab reference , next time you should read this)
ASCII files must contain a rectangular table of numbers, with an equal
number of elements in each row. The file delimiter (the character
between elements in each row) can be a blank, comma, semicolon, or tab
character. The file can contain MATLAB comments (lines that begin with
a percent sign, %).
http://de.mathworks.com/help/matlab/ref/load.html#responsive_offcanvas
Read your first sentence. You say you want to load a .mat file. But filename ends with .asf which is some video format if I remember correctly.
You can't feed a video file into load.

updating line in large text file using scala

i've a large text file around 43GB in .ttl contains triples in the form :
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://la.dbpedia.org/resource/Mahatma_Gandhi> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lad.dbpedia.org/resource/Mohandas_Gandhi> .
and i want to find the fastest way to update a specific line inside the file without rewriting all next text. either by updating it or deleting it and appending it to the end of the file
to access the specific line i use this code :
val lines = io.Source.fromFile("text.txt").getLines
val seventhLine = lines drop(10000000) next
If you want to use text files, consider a fixed length/record size for each line/record.
This way you can use a RandomAccessFile to seek to the exact position of each line by number: You just seek to line * LineSize, and then update it.
It will not really help, if you have to insert a new line. Other limitations are: The file size will grow (because of the fixed record length), and there will always be one record which is too big.
As for the initial conversion:
Get the maximum line length of the current file, then add 10% for example.
Now you have to convert the file once: Read a line from the text file, and convert it into a fixed-size record.
You could use a special character like | to separate the fields. If possible, use somthing like ;, so you get a .csv file
I suggest padding the remaining space it with spaces, so it still looks like a text file which you can parse with shell utilities.
You could use a \n to terminate the record.
For example
http://x.com|http://x.com|http://x.com|...\n
or
http://x.com;http://x.com;http://x.com;...\n
where each . at the end represents a space character. So it's still somehow compatible with a "normal" text file.
On the other hand, looking at your data, consider using a key-value data store like Redis: You could use the line number or the 1st URL as the key.

How to fill a field with spaces until a length in Notepad++

I've prepared a macro in Notepad++ to transform a ldif file in a csv file with a few fields. Everything is OK but I have a final problem: I have to have 2 fields with a specific length and in this moment I cannot ensure that length because in the source file they are not coming so
For instance, I generate this line:
12345,namenamename,123456
And I have to ensure that the 2nd and 3rd fields have 30 (filling with spaces at right side) and 9 (filling with zeros at left) characters, so in this case I should generate:
12345,namenamename ,000123456
I haven't found how Notepad++ could match a pattern in order to add spaces/zeros, so I have though in to add 1 space/zero to the proper field and repeat this step so many times as needed to ensure the lengths (this is, 29 and 8, because they cannot come empty) and search with the length in the regex (for instance: \d{1,8} for the third field)
My question is: can I repeat only one step of the macro several times (and the rest of the macro only 1 repetition)?
I've read the wiki related to this point (http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Editing_Configuration_Files#.3CMacros.3E) and I don't found anything neither
If not possible, how could be a good solution? Create another 2 different macros and after execute the main one, execute this new 2 macros several times?
Thanks in advance!
A two pass solution with Notepad++ is possible. Find a pair of characters or two short sequence of characters that never occurs in your data file. I will use =#<= and =>#= here.
First pass, generate or convert the input text into the form 12345,=#<=namenamename______________________________,000000000123456=>#=. Ie add 30 spaces after the name and nine zeroes before the number (underscores used here just to make things clearer).
Second pass, do a regular expression search for =#<=(.{30})_*,0*(\d{9})=>#= and replace with \1,\2.
I have just suggested a similar solution in special timestamp format of csv