updating line in large text file using scala - scala

i've a large text file around 43GB in .ttl contains triples in the form :
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://la.dbpedia.org/resource/Mahatma_Gandhi> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lad.dbpedia.org/resource/Mohandas_Gandhi> .
and i want to find the fastest way to update a specific line inside the file without rewriting all next text. either by updating it or deleting it and appending it to the end of the file
to access the specific line i use this code :
val lines = io.Source.fromFile("text.txt").getLines
val seventhLine = lines drop(10000000) next

If you want to use text files, consider a fixed length/record size for each line/record.
This way you can use a RandomAccessFile to seek to the exact position of each line by number: You just seek to line * LineSize, and then update it.
It will not really help, if you have to insert a new line. Other limitations are: The file size will grow (because of the fixed record length), and there will always be one record which is too big.
As for the initial conversion:
Get the maximum line length of the current file, then add 10% for example.
Now you have to convert the file once: Read a line from the text file, and convert it into a fixed-size record.
You could use a special character like | to separate the fields. If possible, use somthing like ;, so you get a .csv file
I suggest padding the remaining space it with spaces, so it still looks like a text file which you can parse with shell utilities.
You could use a \n to terminate the record.
For example
http://x.com|http://x.com|http://x.com|...\n
or
http://x.com;http://x.com;http://x.com;...\n
where each . at the end represents a space character. So it's still somehow compatible with a "normal" text file.
On the other hand, looking at your data, consider using a key-value data store like Redis: You could use the line number or the 1st URL as the key.

Related

Is there a way for spark to read this odd text format?

The file format I have is sort of like csv and looks like this (abinitio .dat file of some sort):
1,apple,10.00,\n
2,banana,12.35,\n
3,orange,9.23,\n
The commas are actually "Start of Header" 0x01 byte characters, but I will use commas for simplicity. I can easily read the above sample by reading the file as a string RDD with a custom line split ,\n and then passing that into spark.read.csv. I am currently splitting lines by ,\n because there may be newlines in the data and I thought that those two characters were unique for each record. However a problem occurs when there are newline characters at the start of text fields. For example:
1,one \n apple,10.00,\n
2,two banana,12.35,\n
3,\n three orange,9.23,\n
My current code is able to ignore the newline in record 1 but picks up the ,\n after the 3 and splits the 3 lines into 4. How can I reliably read in this format?
My current ideas are:
Check that there are the right number of , column delimiters before allowing a split. I am not sure how to implement this, is it possible to do a regex look-back when spark sees a ,\n and check for the correct number of delimiters?
Try to coerce the file into some other format besides CSV
Make my own InputFormatClass, although I am not sure what this entails.

Matlab dataimport

My matlab code for dataimport is giving me different results for what appear to be similar text files as input. Input1 gives me a normal cell with all lines from the text file as entries in the cell which i can reference using {i}.
Input2 gives me a scalar data structure where all numeric entries in my text file are converted to the input.data structure. I want all files to be converted to regular cell entries and I do not understand why for some files they are converted to scalar data structures.
Code: input = importdata(strcat(direct,'\',filename));
Input1 example: Correctly working dataimport, with text file on the right
File link: https://drive.google.com/open?id=1aHK4xivqEqJEmaA8Dv8Y0uW5giG-Bbip
Input2 example: Incorrectly working data import, with text file on the right FIle link: https://drive.google.com/open?id=1nzUj_wR1bNXFcGaSLGva6uVsxrk-R5vA
UTSL!
I'm guessing you are using GNU Octave although you are writing "Matlab" as topic of your question.
In importdata.m around line 178, the code tries to automatically detect the delimiter for your data:
delim = regexpi (row, '[-+\d.e*ij ]+([^-+\de.ij])[-+\de*.ij ]','tokens', 'once');
If you run this against W40A0060; you get A as delimiter because there is basically a number before and after it.
If you run this against W39E0016; you get {} as delimiter(empty) because the E could be part of a number in scientific notation and is therefore excluded.
Solution:
you really should add the correct delimiter to the importdata call and not trust that it's magically detected.
And if you just want the lines in a cell, use
strsplit (fileread ("W39E0016_Input2.txt"), "\n")
Analysis
This looks indeed strange!
EDIT: The cause for this strange looking behaviour has been deciphered by #Andy (See his solution).
When you use all outputs of importdata() function you can see what happens when reading the data:
[dat1,del1,headerrows1]=importdata('Input1.txt')
[dat2,del2,headerrows2]=importdata('Input2.txt')
For your first file it recognizes 69 header riws and no delimiter:
del1 = []
headerrows1 = 69
while in your second file only two header rows and a comma , delimiter is recognized
del2 = ','
headerrows2 = 2
I can not find an obvious reason in your files causing this different interpretation of data.
Suggestion
Your data format is rather complex. It is not a simple table like produced from excel. It has multiple lines with a different number of fields per line and varying data types. importdata() is not designed for this type of data. I suggest to write a specific import function for this kind of file. Have a look at textread() for a first guess. You can use it to read the lines of the files as text and later interpret it with sscanf() or use strsplit() to split the line contents into fields.

How to create mat file containing video in it

I'm new to matlab programming.I have an image processing code which helps to load a mat file in it. the code accepts .mat file as input with video file in it.
filename=('C:\Users\HP\Desktop\Folder\Image\NVR_ch2_main_cut_35-41.asf');
s=load(filename);
s=struct2cell(s);
M=double(s{1});
if (length(size(M))==4)
M=squeeze(M(:,:,1,:));
end`
Error using load
Unknown text on line number 1 of ASCII file C:\Users\HP\Desktop\Folder\Image\NVR_ch2_main_cut_35-41.asf
"Seh".
Just use v = VideoReader(filename) instead of the load function.
For further information: http://ch.mathworks.com/help/matlab/ref/videoreader.html
Well obviously Matlab won't read your file because it contains things load won't accept.
Does your file comply to this: (from the Matlab reference , next time you should read this)
ASCII files must contain a rectangular table of numbers, with an equal
number of elements in each row. The file delimiter (the character
between elements in each row) can be a blank, comma, semicolon, or tab
character. The file can contain MATLAB comments (lines that begin with
a percent sign, %).
http://de.mathworks.com/help/matlab/ref/load.html#responsive_offcanvas
Read your first sentence. You say you want to load a .mat file. But filename ends with .asf which is some video format if I remember correctly.
You can't feed a video file into load.

Spark: Split CSV with newlines in octet-stream field

I am using Scala to parse CSV files. Some of these files have fields which are non-textual data like images or octet-streams. I would like to use Apache Spark's textFile() method to split up the CSV into rows, and
split(",[ ]*(?=([^\"]*\"[^\"]*\")*[^\"]*$)")
to split the row into fields. Unfortunatly this does not work with files that have these mentioned binary fields. There are two problems: 1) The octet-streams can contain newlines which make textFile() split rows which should be one, and 2) The octet-streams contain commas and/or double quotes which are not escaped and mess up my schema.
The files are usually big, couple of MBs up to couple of 100MBs. I have to take the CSV's as they are, although I could preprocess them.
All I want to achieve is a working split function so I can ignore the field with the octet-stream. Nevertheless, a great bonus would be to extract the textual information in the octet-stream.
So how would I go forward to solve my problems?
Edit: A typical record obtained with cat, the newlines are from the file, not for cosmetic purposes (shortened):
7,url,user,02/24/2015 02:29:00 AM,03/22/2015 03:12:36 PM,octet-stream,27156,"MSCF^#^#^#^#�,^#^#^#^#^#^#D^#^#^#^#^#^#^#^C^A^A^#^C^#^D^#^#^#^#^#^T^#^#^#^#^#^P^#�,^#^#^X=^#^#^#^#^#^#^#^#^#^#�^#^#^#^E^#^A^#��^A^#^#^#^#^#^#^#WF6�!^#Info.txt^#=^B^#^#��^A^#^#^#WF7�^#^#List.xml^#^�^#^#��^A^#^#^#WF:�^#^#Filename.txt^#��>��
^#�CK�]�r��^Q��T�^O�^#�-�j�]��FI�Ky��Ei�Je^K""!�^Qx #�*^U^?�^_�;��ħ�^LI^#$(�^Q���b��\N����t�����+������ȷgvM�^L̽�LǴL�^L��^ER��w^Ui^M��^X�Kޓ�^QJȧ��^N~��&�x�bB��D]1�^B|^G���g^SyG�����:����^_P�^T�^_�����U�|B�gH=��%Z^NY���,^U�^VI{��^S�^U�!�^Lpw�T���+�a�z�l������b����w^K��or��pH� ��ܞ�l��z�^\i=�z�:^C�^S!_ESCW��ESC""��g^NY2��s�� u���X^?�^R^R+��b^]^Ro�r���^AR�h�^D��^X^M�^]ޫ���ܰ�^]���0^?��^]�92^GhCx�DN^?
mY<{��L^Zk�^\���M�^V^HE���-Ե�$f�f����^D�e�^R:�u����� ^E^A�Ȑ�^B�^E�sZ���Yo��8Eސ�}��&JY���^A9^P������^P����~Jʭy��`�^9«�""�U� �:�}3���6�Hߧ�v���A7^Xi^L^]�sA�^Q�7�5d�^Xo˛�tY
Bp��4�Y���7DkV_���\^_q~�w�|�a�s̆���#�g�ӳu�^�!W}�n��Rgż_2�]�p�2}��b�G9�M^Q
�����:�X����bR[ԳZV!^G����^U�tq�&�Y6b��GR���s#mn6Z=^ZH^]�b��R^G�C�0R��{r1��4�#�
=r/X2�^O�����r^M�Rȕ�goG^X-����}���P+˥Qf�#��^C�Բ�z1�I�j����6�^Np���ܯ^P�[�^Tzԏ���^F2�e��\�E�߻6c�%���$�:E�*�*©t�y�J�,�S�2U�S�^X}ME�]��]�i��G�su�""��!�-��!r'ܷe_et Y^K^?0���l^A��^^�m�1/q����|�_r�5$�%�([x��W^E�G^^y���#����Z2^?ڠ�^_��^AҶ�OO��^]�vq%:j�^?�jX��\�]����^S�^^n�^C��>.^CY^O-� �_�\K����:p�<7Sֺnj���-Yk�r���^Q^M�n�J^B��^Z0^?�(^C��^W³!�g�Z�~R�A^M�^O^^�%;��Ԗ�p^S�w���*m^S���jڒ|�����<�^S�;Z^^Fc�1���^O�G_o����8��CS���w��^?��n�2~��m���G;��rx4�(�]�'��^E���eƧ�x��.�w�9WO�^^�י3��0,�y��H�Y�.H�x�""'���h}灢^T�Gm;^XE�̼�J��c�^^񾠫;�^A�qZ1ׁBZ^Q�^A^FB�^QbQ�_�3|ƺ�EvZ���^S�w���^P���9^MT��ǩY[+�+�9�Ԩ�^O�^Q���Fy(+�9p�^^Mj�2��Y^?��ڞ��^Ķb�^Z�ψMр}�ڣ�^^S�^?��^U�^Wڻ����z�^#��uk��k^^�>^O�^W�ݤO�h�^G�����Kˇ�.�R|�)-��e^G�^]�/J����U�ϴ�a���i5HO�^L�ESCg�R'���.����d���+~�}��ڝ^Y5]l�3jg54M�������2t�5^Y}�q)��^O;�X\�q^Ox~Vۗ�t�^\f� >k;^G�K5��,��X�t/�ǧ^G""5��4^MiΟ�n��^B^]�|�����V��ߌ֗Q~�H���8��t��5��ܗ�
�Z�^c�6N�ESCG����^_��>��t^L^R�^:�x���^]v�{^#+KM��qԎ�.^S�%&��=^W-�=�^S�����^CI���&^]_�s�˞�y�z�Jc^W�kڠ�^\��^]j�����^O��;�oY^^�^V59;�c��^B��T�nb����^C��^N��s�x�<{�9-�F�T�^N�5�^Se-���^T�Y[���`^ZsL��v�բ<C�+�~�^ۚ��""�Yκ2^_�^VxT�>��/ݳ^U�m�^#���3^Ge�n^Vc�V�^#�NVn�,�q��^^^]gy�R�S��Ȃ$���>A�d����xg�^GB3�M�J�^QJ^]�^\�{.�D��碎�^W�8a����qޠl?,'^R�^X�Cgy�P[����mڞ��H�Z�s�SD&蠤�s�E��nu�O#O<��3wj`C-%w�W�J�^WP^T�^]r^NT�TC�Lq�Z�f�!�;�l�Y��Gb��>�ud�hx�Ԭ^N)9�^N!k�҉s�35v������.�""^]��~4������۴�Z^]u�^Ti^^�i:�)K��P᳕!�#�^?�>��EE^VE-u�^SgV^L��<��^D�O<�+�J.�c�Z#>�.l����^S�
ESC��(��E�j�π쬖���2{^U&b\��P^S�`^O^XdL�^ 6bu��FD��^#^#^#^#","field_x, data",field_y,field_z
Expected output would be an array
("7","url","user","02/24/2015 02:29:00 AM","03/22/2015 03:12:36 PM","octet-stream","27156","field_x, data",field_y",field_z")
Or, but this is probably another question, such an array (like running strings on the octet-stream field):
("7","url","user","02/24/2015 02:29:00 AM","03/22/2015 03:12:36 PM","octet-stream","27156","Info.txt List.xml Filename.txt","field_x, data",field_y",field_z")
Edit 2: Every file that has a binary field also contains a length field for it. So instead of splitting directly I can walk left to right through my record and extract the fields. This is certainly a great improvement of my current situation but problem 1) still persists. How can I split those files reliably?
I took a closer look at the files and a header looks like this:
RecordId, Field_A, Content_Type, Content_Length, Content, Field_B
(Where Content_Type can be "octet-stream", Content_Length the number of bytes in the Content field, and Content obviously the data). And good for me, the value of Field_B is predictable, let's assume for a certain file it's always "Hello World".
So instead of using Spark's default behaviour splitting on newlines, how can I achieve that Spark is only splitting on newlines following "Hello World"? (I also edited the question title since the focus of the question changed)
As answered in Spark: Reading files using different delimiter than new line, I used textinputformat.record.delimiter to split on "Hello World\n" because I am a bit lucky that the last column always contains the same value. After that I simply walk left to right through the record and when I reach the length field I skip the next n bytes. Everything works now. Thanks for pointing me in the right direction.
There are two problems: 1) The octet-streams can contain newlines
which make textFile() split rows which should be one, and 2) The
octet-streams contain commas and/or double quotes which are not
escaped and mess up my schema.
Well, actually that csv file is properly escaped:
the multiline field is enclosed in double quotes: "MSCF^# .. ^#^#" (which also handles possible separators inside the field)
double quotes inside the field are escaped with another double quote as it should be: Je^K""!
Of course a simple split will not work in this case (and should never be used on csv data), but any csv reader able to handle multiline fields should parse that data correctly.
Also keep in mind that the double quotes inside the octet-stream have to be unescaped, or that data won't be valid (another reason not to use split, but a csv reader that handles this).

long text file to SAS Dataset

I am trying to load a large text file(report) as a single cell in SAS dataset, but because of multiple spaces and formatting the data is getting split into multiple cells.
Data l1.MD;
infile 'E:\Sasfile\f1.txt' truncover;
input text $char50. #;
run;
I have 50 such files to upload so keeping each file as a single cell is most important. What am I missing here?
Data l1.MD;
infile 'E:\Sasfile\f1.txt' recfm=f lrecl=32767 pad;
input text $char32767.;
run;
That would do it. RECFM=F tells SAS to have fixed line lengths (ignoring line feeds) and the other options set the line length to the maximum for a single variable (lines can be longer, but one variable is limited to 32767 characters) and to fill it with blank spaces if it's too short.
You'd only end up with > 1 cell if your text file is longer than that. Note that the line feed and/or carriage return characters will be in this file, which may be good or may be bad. You can identify them with '0A'x and/or '0D'x (depending on your OS you may have one or both), and you can compress them with the 'c' option or translate them to a line separator of your preference.