Two consecutive fields having the same value make Papaparse fail? - papaparse

Parsing following file ( in my case with 'header:true' ) :
FN1,FN2,FN3
A1,A2,A3
B1,B2,B3
C1,C1,C3
D1,D2,D3
makes Papaparse fail on
Row 4 : Too few fields: expected 3 fields but parsed 1
Please note this is a stripdown of a much larger file where the consecutive values where deep down the file.
Is this a bug or am I doing something wrong ?

There is some kind of anomaly, but not exactly as I pointed out.
The problem seems to stem from the last bytes of the file being \r\n.
Papaparse interprets one additional (empty) line therefore.
It is complaining on that one. (and I misinterpreted the row information : 1 based counting , stripped header row , my double value coincidentally on the one but last row)
Configuring skipEmptyLines : true solves the issue.
I am still inclined though to call it a bug as there's not really an empty line at the end.

Related

Extracting data from old text file into usable format

I have some data in a text file in the following format:
1079,40,011,1,301 17,310 4,668 6,680 1,682 1,400 7,590 2,591 139,592 332,565 23,568 2,569 2,595 1,471 1,470 10,481 12,540 117,510 1,522 187,492 9,533 41,558 15,555 12,556 9,558 27,546 1,446 1,523 4000,534 2000,364 1,999/
1083,40,021,1,301 4,310 2,680 1,442 1,400 2,590 2,591 90,592 139,595 11,565 6,470 2,540 66,522 4,492 1,533 19,546 3,505 1,523 3000,534 500,999/
These examples represent what would be two rows in a spreadsheet. The first four values (in the first example, "1079,40,011,1") each go into their own column. The rest of the data are in a paired format, first listing a name of a column, designated by a number, then a space followed by the value that should appear in that column. So again, example: 301 17,310 4,668 6: in this row, column 301 has a value of 17, column 310 has value of 4, column 668 has value of 6, etc. Then 999/ indicates an end to that row.
Any suggestions on how I can transform this text file format into a usable spreadsheet would be greatly appreciated. There are thousands of "rows" and so can't just manually convert them and I don't possess the coding skills to execute such a transformation myself.
This is messy but since there is a pattern it should be doable. What software are you using?
My first idea would be to identify when the delimeter changes from comma to space. Is it based on a fixed width, like always after 14 characters? Or is it based on the delimiter, like it is always after the 4th comma?
Once you've done that, you could make two passes at the data. The first pass imports the first four values from the beginning of the line which are separated by comma. The second pass imports the remaining values which are separated by space.
If you include a row number when importing you can then use it to join first and second passes at importing.

Why are formulas not propagating in null fields?

I'm trying to change the value of nulls to something else that can be used to filter. This data comes from a QVD file. The field that contains nulls, contains nulls due to no action taken on those items ( they will eventually change to something else once an action has been taken). I found this link which was very informative but i tried multiple solutions from the document to no avail.
What i don't quite understand is that whenever i make a new field (in the script or as an expression) the formula does not propagate in the records that are null, it shows " - ". For instance, the expression isNull(ActionTaken) will return false in a field that that not null, but only " - " in fields that are null. If i export the table to Excel, the " - " is exported, i copy this cell to a text analyzer i the UTF-8 encoded is \x2D\x0A\x0A, i'm not sure if that's an artifact of the export process.
I also tried using the NullAsValue statement but no luck. Using a combination of Len & Trim = 0 will return the same result as above. This is only one table, no other tables are involved.
Thanks in advance.
I had a similar case few years ago where the field looked empty but actually it was filled with a character which just looked empty. Trimming the field also didnt worked as expected in this case, because the character code was different
What I can suggest you is to check if the character number, returned for the empty value, is actually an empty string. You can use the ord to check the character number for the empty values. Once you have the number then you can use this number to replace it with whatever you want (for example empty string)

Preparing data for (stanford) Deepdive (ValueError)

I started using Stanford-Deepdive a while ago.
I am currently facing the problem, that deepdive will interpret some of the rows he gets as incomplete.
Value Error: Expected 6 attributes, but found 5 in input row:
<Row()>
I already had this problem with another data-set. At this set there were some rows, that contained "\n" within the text. So i removed that and everything went flawlessly.
For my new set of data I am removing "\n", "\t", and any occurence of multiple spaces. Also I replace any empty text value by "EMPTY" - still the error refuses to go away.
Are there any other formatting errors or characters that I need to take care of?
Is my way of approaching this reasonable?
I found the problem. It was caused by a singular TAB (\t) entry. I replaced that by a singe SPACE and in the end it would not be a valid antry anymore
so if you use some text for deepdive you will want to treat etrys consisting of a single SPACE as if they were empty.

Spark: Split CSV with newlines in octet-stream field

I am using Scala to parse CSV files. Some of these files have fields which are non-textual data like images or octet-streams. I would like to use Apache Spark's textFile() method to split up the CSV into rows, and
split(",[ ]*(?=([^\"]*\"[^\"]*\")*[^\"]*$)")
to split the row into fields. Unfortunatly this does not work with files that have these mentioned binary fields. There are two problems: 1) The octet-streams can contain newlines which make textFile() split rows which should be one, and 2) The octet-streams contain commas and/or double quotes which are not escaped and mess up my schema.
The files are usually big, couple of MBs up to couple of 100MBs. I have to take the CSV's as they are, although I could preprocess them.
All I want to achieve is a working split function so I can ignore the field with the octet-stream. Nevertheless, a great bonus would be to extract the textual information in the octet-stream.
So how would I go forward to solve my problems?
Edit: A typical record obtained with cat, the newlines are from the file, not for cosmetic purposes (shortened):
7,url,user,02/24/2015 02:29:00 AM,03/22/2015 03:12:36 PM,octet-stream,27156,"MSCF^#^#^#^#�,^#^#^#^#^#^#D^#^#^#^#^#^#^#^C^A^A^#^C^#^D^#^#^#^#^#^T^#^#^#^#^#^P^#�,^#^#^X=^#^#^#^#^#^#^#^#^#^#�^#^#^#^E^#^A^#��^A^#^#^#^#^#^#^#WF6�!^#Info.txt^#=^B^#^#��^A^#^#^#WF7�^#^#List.xml^#^�^#^#��^A^#^#^#WF:�^#^#Filename.txt^#��>��
^#�CK�]�r��^Q��T�^O�^#�-�j�]��FI�Ky��Ei�Je^K""!�^Qx #�*^U^?�^_�;��ħ�^LI^#$(�^Q���b��\N����t�����+������ȷgvM�^L̽�LǴL�^L��^ER��w^Ui^M��^X�Kޓ�^QJȧ��^N~��&�x�bB��D]1�^B|^G���g^SyG�����:����^_P�^T�^_�����U�|B�gH=��%Z^NY���,^U�^VI{��^S�^U�!�^Lpw�T���+�a�z�l������b����w^K��or��pH� ��ܞ�l��z�^\i=�z�:^C�^S!_ESCW��ESC""��g^NY2��s�� u���X^?�^R^R+��b^]^Ro�r���^AR�h�^D��^X^M�^]ޫ���ܰ�^]���0^?��^]�92^GhCx�DN^?
mY<{��L^Zk�^\���M�^V^HE���-Ե�$f�f����^D�e�^R:�u����� ^E^A�Ȑ�^B�^E�sZ���Yo��8Eސ�}��&JY���^A9^P������^P����~Jʭy��`�^9«�""�U� �:�}3���6�Hߧ�v���A7^Xi^L^]�sA�^Q�7�5d�^Xo˛�tY
Bp��4�Y���7DkV_���\^_q~�w�|�a�s̆���#�g�ӳu�^�!W}�n��Rgż_2�]�p�2}��b�G9�M^Q
�����:�X����bR[ԳZV!^G����^U�tq�&�Y6b��GR���s#mn6Z=^ZH^]�b��R^G�C�0R��{r1��4�#�
=r/X2�^O�����r^M�Rȕ�goG^X-����}���P+˥Qf�#��^C�Բ�z1�I�j����6�^Np���ܯ^P�[�^Tzԏ���^F2�e��\�E�߻6c�%���$�:E�*�*©t�y�J�,�S�2U�S�^X}ME�]��]�i��G�su�""��!�-��!r'ܷe_et Y^K^?0���l^A��^^�m�1/q����|�_r�5$�%�([x��W^E�G^^y���#����Z2^?ڠ�^_��^AҶ�OO��^]�vq%:j�^?�jX��\�]����^S�^^n�^C��>.^CY^O-� �_�\K����:p�<7Sֺnj���-Yk�r���^Q^M�n�J^B��^Z0^?�(^C��^W³!�g�Z�~R�A^M�^O^^�%;��Ԗ�p^S�w���*m^S���jڒ|�����<�^S�;Z^^Fc�1���^O�G_o����8��CS���w��^?��n�2~��m���G;��rx4�(�]�'��^E���eƧ�x��.�w�9WO�^^�י3��0,�y��H�Y�.H�x�""'���h}灢^T�Gm;^XE�̼�J��c�^^񾠫;�^A�qZ1ׁBZ^Q�^A^FB�^QbQ�_�3|ƺ�EvZ���^S�w���^P���9^MT��ǩY[+�+�9�Ԩ�^O�^Q���Fy(+�9p�^^Mj�2��Y^?��ڞ��^Ķb�^Z�ψMр}�ڣ�^^S�^?��^U�^Wڻ����z�^#��uk��k^^�>^O�^W�ݤO�h�^G�����Kˇ�.�R|�)-��e^G�^]�/J����U�ϴ�a���i5HO�^L�ESCg�R'���.����d���+~�}��ڝ^Y5]l�3jg54M�������2t�5^Y}�q)��^O;�X\�q^Ox~Vۗ�t�^\f� >k;^G�K5��,��X�t/�ǧ^G""5��4^MiΟ�n��^B^]�|�����V��ߌ֗Q~�H���8��t��5��ܗ�
�Z�^c�6N�ESCG����^_��>��t^L^R�^:�x���^]v�{^#+KM��qԎ�.^S�%&��=^W-�=�^S�����^CI���&^]_�s�˞�y�z�Jc^W�kڠ�^\��^]j�����^O��;�oY^^�^V59;�c��^B��T�nb����^C��^N��s�x�<{�9-�F�T�^N�5�^Se-���^T�Y[���`^ZsL��v�բ<C�+�~�^ۚ��""�Yκ2^_�^VxT�>��/ݳ^U�m�^#���3^Ge�n^Vc�V�^#�NVn�,�q��^^^]gy�R�S��Ȃ$���>A�d����xg�^GB3�M�J�^QJ^]�^\�{.�D��碎�^W�8a����qޠl?,'^R�^X�Cgy�P[����mڞ��H�Z�s�SD&蠤�s�E��nu�O#O<��3wj`C-%w�W�J�^WP^T�^]r^NT�TC�Lq�Z�f�!�;�l�Y��Gb��>�ud�hx�Ԭ^N)9�^N!k�҉s�35v������.�""^]��~4������۴�Z^]u�^Ti^^�i:�)K��P᳕!�#�^?�>��EE^VE-u�^SgV^L��<��^D�O<�+�J.�c�Z#>�.l����^S�
ESC��(��E�j�π쬖���2{^U&b\��P^S�`^O^XdL�^ 6bu��FD��^#^#^#^#","field_x, data",field_y,field_z
Expected output would be an array
("7","url","user","02/24/2015 02:29:00 AM","03/22/2015 03:12:36 PM","octet-stream","27156","field_x, data",field_y",field_z")
Or, but this is probably another question, such an array (like running strings on the octet-stream field):
("7","url","user","02/24/2015 02:29:00 AM","03/22/2015 03:12:36 PM","octet-stream","27156","Info.txt List.xml Filename.txt","field_x, data",field_y",field_z")
Edit 2: Every file that has a binary field also contains a length field for it. So instead of splitting directly I can walk left to right through my record and extract the fields. This is certainly a great improvement of my current situation but problem 1) still persists. How can I split those files reliably?
I took a closer look at the files and a header looks like this:
RecordId, Field_A, Content_Type, Content_Length, Content, Field_B
(Where Content_Type can be "octet-stream", Content_Length the number of bytes in the Content field, and Content obviously the data). And good for me, the value of Field_B is predictable, let's assume for a certain file it's always "Hello World".
So instead of using Spark's default behaviour splitting on newlines, how can I achieve that Spark is only splitting on newlines following "Hello World"? (I also edited the question title since the focus of the question changed)
As answered in Spark: Reading files using different delimiter than new line, I used textinputformat.record.delimiter to split on "Hello World\n" because I am a bit lucky that the last column always contains the same value. After that I simply walk left to right through the record and when I reach the length field I skip the next n bytes. Everything works now. Thanks for pointing me in the right direction.
There are two problems: 1) The octet-streams can contain newlines
which make textFile() split rows which should be one, and 2) The
octet-streams contain commas and/or double quotes which are not
escaped and mess up my schema.
Well, actually that csv file is properly escaped:
the multiline field is enclosed in double quotes: "MSCF^# .. ^#^#" (which also handles possible separators inside the field)
double quotes inside the field are escaped with another double quote as it should be: Je^K""!
Of course a simple split will not work in this case (and should never be used on csv data), but any csv reader able to handle multiline fields should parse that data correctly.
Also keep in mind that the double quotes inside the octet-stream have to be unescaped, or that data won't be valid (another reason not to use split, but a csv reader that handles this).

How to fill a field with spaces until a length in Notepad++

I've prepared a macro in Notepad++ to transform a ldif file in a csv file with a few fields. Everything is OK but I have a final problem: I have to have 2 fields with a specific length and in this moment I cannot ensure that length because in the source file they are not coming so
For instance, I generate this line:
12345,namenamename,123456
And I have to ensure that the 2nd and 3rd fields have 30 (filling with spaces at right side) and 9 (filling with zeros at left) characters, so in this case I should generate:
12345,namenamename ,000123456
I haven't found how Notepad++ could match a pattern in order to add spaces/zeros, so I have though in to add 1 space/zero to the proper field and repeat this step so many times as needed to ensure the lengths (this is, 29 and 8, because they cannot come empty) and search with the length in the regex (for instance: \d{1,8} for the third field)
My question is: can I repeat only one step of the macro several times (and the rest of the macro only 1 repetition)?
I've read the wiki related to this point (http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Editing_Configuration_Files#.3CMacros.3E) and I don't found anything neither
If not possible, how could be a good solution? Create another 2 different macros and after execute the main one, execute this new 2 macros several times?
Thanks in advance!
A two pass solution with Notepad++ is possible. Find a pair of characters or two short sequence of characters that never occurs in your data file. I will use =#<= and =>#= here.
First pass, generate or convert the input text into the form 12345,=#<=namenamename______________________________,000000000123456=>#=. Ie add 30 spaces after the name and nine zeroes before the number (underscores used here just to make things clearer).
Second pass, do a regular expression search for =#<=(.{30})_*,0*(\d{9})=>#= and replace with \1,\2.
I have just suggested a similar solution in special timestamp format of csv