Spark: Split CSV with newlines in octet-stream field - scala

I am using Scala to parse CSV files. Some of these files have fields which are non-textual data like images or octet-streams. I would like to use Apache Spark's textFile() method to split up the CSV into rows, and
split(",[ ]*(?=([^\"]*\"[^\"]*\")*[^\"]*$)")
to split the row into fields. Unfortunatly this does not work with files that have these mentioned binary fields. There are two problems: 1) The octet-streams can contain newlines which make textFile() split rows which should be one, and 2) The octet-streams contain commas and/or double quotes which are not escaped and mess up my schema.
The files are usually big, couple of MBs up to couple of 100MBs. I have to take the CSV's as they are, although I could preprocess them.
All I want to achieve is a working split function so I can ignore the field with the octet-stream. Nevertheless, a great bonus would be to extract the textual information in the octet-stream.
So how would I go forward to solve my problems?
Edit: A typical record obtained with cat, the newlines are from the file, not for cosmetic purposes (shortened):
7,url,user,02/24/2015 02:29:00 AM,03/22/2015 03:12:36 PM,octet-stream,27156,"MSCF^#^#^#^#�,^#^#^#^#^#^#D^#^#^#^#^#^#^#^C^A^A^#^C^#^D^#^#^#^#^#^T^#^#^#^#^#^P^#�,^#^#^X=^#^#^#^#^#^#^#^#^#^#�^#^#^#^E^#^A^#��^A^#^#^#^#^#^#^#WF6�!^#Info.txt^#=^B^#^#��^A^#^#^#WF7�^#^#List.xml^#^�^#^#��^A^#^#^#WF:�^#^#Filename.txt^#��>��
^#�CK�]�r��^Q��T�^O�^#�-�j�]��FI�Ky��Ei�Je^K""!�^Qx #�*^U^?�^_�;��ħ�^LI^#$(�^Q���b��\N����t�����+������ȷgvM�^L̽�LǴL�^L��^ER��w^Ui^M��^X�Kޓ�^QJȧ��^N~��&�x�bB��D]1�^B|^G���g^SyG�����:����^_P�^T�^_�����U�|B�gH=��%Z^NY���,^U�^VI{��^S�^U�!�^Lpw�T���+�a�z�l������b����w^K��or��pH� ��ܞ�l��z�^\i=�z�:^C�^S!_ESCW��ESC""��g^NY2��s�� u���X^?�^R^R+��b^]^Ro�r���^AR�h�^D��^X^M�^]ޫ���ܰ�^]���0^?��^]�92^GhCx�DN^?
mY<{��L^Zk�^\���M�^V^HE���-Ե�$f�f����^D�e�^R:�u����� ^E^A�Ȑ�^B�^E�sZ���Yo��8Eސ�}��&JY���^A9^P������^P����~Jʭy��`�^9«�""�U� �:�}3���6�Hߧ�v���A7^Xi^L^]�sA�^Q�7�5d�^Xo˛�tY
Bp��4�Y���7DkV_���\^_q~�w�|�a�s̆���#�g�ӳu�^�!W}�n��Rgż_2�]�p�2}��b�G9�M^Q
�����:�X����bR[ԳZV!^G����^U�tq�&�Y6b��GR���s#mn6Z=^ZH^]�b��R^G�C�0R��{r1��4�#�
=r/X2�^O�����r^M�Rȕ�goG^X-����}���P+˥Qf�#��^C�Բ�z1�I�j����6�^Np���ܯ^P�[�^Tzԏ���^F2�e��\�E�߻6c�%���$�:E�*�*©t�y�J�,�S�2U�S�^X}ME�]��]�i��G�su�""��!�-��!r'ܷe_et Y^K^?0���l^A��^^�m�1/q����|�_r�5$�%�([x��W^E�G^^y���#����Z2^?ڠ�^_��^AҶ�OO��^]�vq%:j�^?�jX��\�]����^S�^^n�^C��>.^CY^O-� �_�\K����:p�<7Sֺnj���-Yk�r���^Q^M�n�J^B��^Z0^?�(^C��^W³!�g�Z�~R�A^M�^O^^�%;��Ԗ�p^S�w���*m^S���jڒ|�����<�^S�;Z^^Fc�1���^O�G_o����8��CS���w��^?��n�2~��m���G;��rx4�(�]�'��^E���eƧ�x��.�w�9WO�^^�י3��0,�y��H�Y�.H�x�""'���h}灢^T�Gm;^XE�̼�J��c�^^񾠫;�^A�qZ1ׁBZ^Q�^A^FB�^QbQ�_�3|ƺ�EvZ���^S�w���^P���9^MT��ǩY[+�+�9�Ԩ�^O�^Q���Fy(+�9p�^^Mj�2��Y^?��ڞ��^Ķb�^Z�ψMр}�ڣ�^^S�^?��^U�^Wڻ����z�^#��uk��k^^�>^O�^W�ݤO�h�^G�����Kˇ�.�R|�)-��e^G�^]�/J����U�ϴ�a���i5HO�^L�ESCg�R'���.����d���+~�}��ڝ^Y5]l�3jg54M�������2t�5^Y}�q)��^O;�X\�q^Ox~Vۗ�t�^\f� >k;^G�K5��,��X�t/�ǧ^G""5��4^MiΟ�n��^B^]�|�����V��ߌ֗Q~�H���8��t��5��ܗ�
�Z�^c�6N�ESCG����^_��>��t^L^R�^:�x���^]v�{^#+KM��qԎ�.^S�%&��=^W-�=�^S�����^CI���&^]_�s�˞�y�z�Jc^W�kڠ�^\��^]j�����^O��;�oY^^�^V59;�c��^B��T�nb����^C��^N��s�x�<{�9-�F�T�^N�5�^Se-���^T�Y[���`^ZsL��v�բ<C�+�~�^ۚ��""�Yκ2^_�^VxT�>��/ݳ^U�m�^#���3^Ge�n^Vc�V�^#�NVn�,�q��^^^]gy�R�S��Ȃ$���>A�d����xg�^GB3�M�J�^QJ^]�^\�{.�D��碎�^W�8a����qޠl?,'^R�^X�Cgy�P[����mڞ��H�Z�s�SD&蠤�s�E��nu�O#O<��3wj`C-%w�W�J�^WP^T�^]r^NT�TC�Lq�Z�f�!�;�l�Y��Gb��>�ud�hx�Ԭ^N)9�^N!k�҉s�35v������.�""^]��~4������۴�Z^]u�^Ti^^�i:�)K��P᳕!�#�^?�>��EE^VE-u�^SgV^L��<��^D�O<�+�J.�c�Z#>�.l����^S�
ESC��(��E�j�π쬖���2{^U&b\��P^S�`^O^XdL�^ 6bu��FD��^#^#^#^#","field_x, data",field_y,field_z
Expected output would be an array
("7","url","user","02/24/2015 02:29:00 AM","03/22/2015 03:12:36 PM","octet-stream","27156","field_x, data",field_y",field_z")
Or, but this is probably another question, such an array (like running strings on the octet-stream field):
("7","url","user","02/24/2015 02:29:00 AM","03/22/2015 03:12:36 PM","octet-stream","27156","Info.txt List.xml Filename.txt","field_x, data",field_y",field_z")
Edit 2: Every file that has a binary field also contains a length field for it. So instead of splitting directly I can walk left to right through my record and extract the fields. This is certainly a great improvement of my current situation but problem 1) still persists. How can I split those files reliably?
I took a closer look at the files and a header looks like this:
RecordId, Field_A, Content_Type, Content_Length, Content, Field_B
(Where Content_Type can be "octet-stream", Content_Length the number of bytes in the Content field, and Content obviously the data). And good for me, the value of Field_B is predictable, let's assume for a certain file it's always "Hello World".
So instead of using Spark's default behaviour splitting on newlines, how can I achieve that Spark is only splitting on newlines following "Hello World"? (I also edited the question title since the focus of the question changed)

As answered in Spark: Reading files using different delimiter than new line, I used textinputformat.record.delimiter to split on "Hello World\n" because I am a bit lucky that the last column always contains the same value. After that I simply walk left to right through the record and when I reach the length field I skip the next n bytes. Everything works now. Thanks for pointing me in the right direction.

There are two problems: 1) The octet-streams can contain newlines
which make textFile() split rows which should be one, and 2) The
octet-streams contain commas and/or double quotes which are not
escaped and mess up my schema.
Well, actually that csv file is properly escaped:
the multiline field is enclosed in double quotes: "MSCF^# .. ^#^#" (which also handles possible separators inside the field)
double quotes inside the field are escaped with another double quote as it should be: Je^K""!
Of course a simple split will not work in this case (and should never be used on csv data), but any csv reader able to handle multiline fields should parse that data correctly.
Also keep in mind that the double quotes inside the octet-stream have to be unescaped, or that data won't be valid (another reason not to use split, but a csv reader that handles this).

Related

Trouble rendering CSV data as an interactive table in GitHub

When viewed, any .csv file committed to a GitHub repository automatically renders as an interactive table, complete with headers and row numbering. By default, the first row is your header row. The tables were supposed to look nice as below:
However, there's an error happening in my tabular data, and despite indicating the error, I can't fix it:
I'm using a .csv file with a semicolon separator. Does anyone have an idea of what's happening?
According to the docs, Github can only do its lay-out thing with .csv (comma-separated) and .tsv (tab-separated) files.
Using a semicolon as a separator isn't supported, at least not officially, and a spurious comma in a semicolon-separated file could well throw the algorithm off.
You could try replacing all semicolons with tabs and see how you fare.
If that doesn't work, try using commas as separators and enclose all text table cell data with quotes, like:
"Liver fibrosis, sclerosis, and cirrhosis","c370800","102922","Cystic fibrosis related cirrhosis","Diagnosis of liver fibrosis, sclerosis, and cirrhosis"
Note: no spaces after the commas. Also, if you have quotes in the text fields, you will have to escape those to "" (two quotes), or the algorithm will get confused.
You may get away with using quotes only for the offending text data, but that could well be more difficult to generate than just putting the quotes around all fields.

Is there a way for spark to read this odd text format?

The file format I have is sort of like csv and looks like this (abinitio .dat file of some sort):
1,apple,10.00,\n
2,banana,12.35,\n
3,orange,9.23,\n
The commas are actually "Start of Header" 0x01 byte characters, but I will use commas for simplicity. I can easily read the above sample by reading the file as a string RDD with a custom line split ,\n and then passing that into spark.read.csv. I am currently splitting lines by ,\n because there may be newlines in the data and I thought that those two characters were unique for each record. However a problem occurs when there are newline characters at the start of text fields. For example:
1,one \n apple,10.00,\n
2,two banana,12.35,\n
3,\n three orange,9.23,\n
My current code is able to ignore the newline in record 1 but picks up the ,\n after the 3 and splits the 3 lines into 4. How can I reliably read in this format?
My current ideas are:
Check that there are the right number of , column delimiters before allowing a split. I am not sure how to implement this, is it possible to do a regex look-back when spark sees a ,\n and check for the correct number of delimiters?
Try to coerce the file into some other format besides CSV
Make my own InputFormatClass, although I am not sure what this entails.

Removing spaces from a string using Powershell

I have an issue where extracting data from database it sometimes (quite often) adds spaces in between strings of texts that should not be there.
What I'm trying to do is create a small script that will look at these strings and remove the spaces.
The problem is that the spaces can be in any position in the string, and the string is a variable that changes.
Example:
"StaffID": "0000 25" <- The space in the number should not be there.
Is there a way to have the script look at this particular line, and if it finds spaces, to remove them.
Or:"DateOfBirth": "23-10-199 0" <-It would also need to look at these spaces and remove them.
The problem is that the same data also has lines such as:
"Address": " 91 Broad street" <- The spaces should be here obviously.
I've tried using TRIM, but that only removes spaces from start/end.
Worth mentioning that the data extracted is in json format and is then imported using API into the new system.
You should think about the logic of what you want to do, and whether or not it's programmatically possible to determine if you can teach your script where it is or is not appropriate to put spaces. As it is, this is one of the biggest problems facing AI research right now, so unfortunately you're probably going to have to do this by hand.
If it were me, I'd specify the kind of data format that I expect from each column, and try my best to attempt to parse those strings. For example, if you know that StaffID doesn't contain spaces, you can have a rule that just deletes them:
$staffid = $staffid.replace("\s+",'')
There are some more complicated things that you can do with forced formatting (.replace) that have already been covered in this answer, but again, that requires some expectation of exactly what data is going to come out of what column.
You might want to look more closely at where those spaces are coming from, rather than process the output like this. Is the retrieval script doing it? Maybe you can optimize the database that you're drawing from?

PostgreSQL Trimming Leading and Trailing Characters: = and "

I'm working to build an import tool that utilizes a quoted CSV file. However, several of the fields in the CSV file are reported as such:
"=""38000"""
Where 38000 is the data I need. The data integration software I use (Talend 6.11) already strips the leading and trailing double quotes for me (so, "38000" becomes 38000), but I can't find a way to get rid of those others.
So, essentially, I need "=""38000""" to become "38000" where the leading "=" is removed and the trailing "" is removed.
Is there a TRIM function that can accomplish this for me? Perhaps there is a method in Talend that can do this?
As the other answer stated, you could do that operation in SQL. Or, you could do it in Java, Groovy, etc, within Talend. However, if there is an existing Talend component which does the job, my preference is to use it. That leads to faster development, potentially less testing, and easier maintenance. Having said that, it is important to review all the components which are available, so you know what's available to you.
You can use the Talend component tReplace, to inspect each of the input columns you want to trim of quotes and equal signs. A single tReplace component can do search and replace operations on multiple input columns. If all the of the replaces are related to each other, I would keep them within a single tReplace. When it gets to the point of doing unrelated replacements, I might place those within a new tReplace so that logical operations are organized and grouped together.
tReplace
For a given Input Column
search for "=", replace with ""
search for "\"", replace with ""
Something like that:
SELECT format( '"%s"', trim( both '"=' from '"=""38000"""' ) );
-[ RECORD 1 ]---
format | "38000"
1st: trim() function removes all " and = chars. Result is simply 38000
2nd: with format can add double quote back to get wishful end result
Alternatively, can use regexp and other Postgres string functions.
See more:
https://www.postgresql.org/docs/current/static/functions-string.html

How to fill a field with spaces until a length in Notepad++

I've prepared a macro in Notepad++ to transform a ldif file in a csv file with a few fields. Everything is OK but I have a final problem: I have to have 2 fields with a specific length and in this moment I cannot ensure that length because in the source file they are not coming so
For instance, I generate this line:
12345,namenamename,123456
And I have to ensure that the 2nd and 3rd fields have 30 (filling with spaces at right side) and 9 (filling with zeros at left) characters, so in this case I should generate:
12345,namenamename ,000123456
I haven't found how Notepad++ could match a pattern in order to add spaces/zeros, so I have though in to add 1 space/zero to the proper field and repeat this step so many times as needed to ensure the lengths (this is, 29 and 8, because they cannot come empty) and search with the length in the regex (for instance: \d{1,8} for the third field)
My question is: can I repeat only one step of the macro several times (and the rest of the macro only 1 repetition)?
I've read the wiki related to this point (http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Editing_Configuration_Files#.3CMacros.3E) and I don't found anything neither
If not possible, how could be a good solution? Create another 2 different macros and after execute the main one, execute this new 2 macros several times?
Thanks in advance!
A two pass solution with Notepad++ is possible. Find a pair of characters or two short sequence of characters that never occurs in your data file. I will use =#<= and =>#= here.
First pass, generate or convert the input text into the form 12345,=#<=namenamename______________________________,000000000123456=>#=. Ie add 30 spaces after the name and nine zeroes before the number (underscores used here just to make things clearer).
Second pass, do a regular expression search for =#<=(.{30})_*,0*(\d{9})=>#= and replace with \1,\2.
I have just suggested a similar solution in special timestamp format of csv