Text data source supports only a single column, and you have 8 columns - scala

This is the error I got when I tried to save a data frame to text:
org.apache.spark.sql.AnalysisException: Text data source supports only a single column, and you have 8 columns
This is the code:
df.write.text("/tmp/wt")
What I m doing wrong?

In Spark 1.6, the easiest solution is to use databricks' library and write:
df.write.format("com.databricks.spark.csv").save("pathToFile.csv")
If you do not want to use it, you can simply convert the rows of your dataframe into csv lines like this:
df.rdd
.map(_.toSeq.map(_+"").reduce(_+";"+_))
.saveAsTextFile("pathToFile.csv")
Note that if your fields can contain separators or quotes, you will have to add enclosing quotes and escape existing quotes, things that the library would do for you transparently.

Related

Trying to work around the error DF-CSVWriter-InvalidEscapeSetting

So I have a dataset which I want to export to csv with pipe as separator and no escape character.
That dataset contains in fact 4 source columns, 3 regular ones (just text) and one variable one.
That last column holds another subset of values that are also separated with a pipe.
Purpose is that the export looks like this, where the values are coming from my 4th field.
COL1|COL2|COL3|VAL1|VAL2|VAL3|....
The number of values can be different for each record but.
When I set the csv export separator to ";", I get this result which is expected
COL1;COL2;COL3;VAL1|VAL2|VAL3|....
However setting it to "|", it throws the error DF-CSVWriter-InvalidEscapeSetting.
Most likely because it detected the separator character in my 4th field and then enforces that an escape character needs to be set.
Which is a logical thing in most case but in my case I would like him to ignore this and just export as-is.
Any way how I can work around this, perhaps with a different approach or some additional settings?
Split & flatten produces extra rows but that's not what I want.
Regards,
Sven Peeters
As you have the same characters in the column value same as your delimiter character, with no escape character in your dataset will throw an error.
You have to change the delimiter character to a different character or add a Quote character and Escape character to Double quote(").
Downloaded file:

Trouble rendering CSV data as an interactive table in GitHub

When viewed, any .csv file committed to a GitHub repository automatically renders as an interactive table, complete with headers and row numbering. By default, the first row is your header row. The tables were supposed to look nice as below:
However, there's an error happening in my tabular data, and despite indicating the error, I can't fix it:
I'm using a .csv file with a semicolon separator. Does anyone have an idea of what's happening?
According to the docs, Github can only do its lay-out thing with .csv (comma-separated) and .tsv (tab-separated) files.
Using a semicolon as a separator isn't supported, at least not officially, and a spurious comma in a semicolon-separated file could well throw the algorithm off.
You could try replacing all semicolons with tabs and see how you fare.
If that doesn't work, try using commas as separators and enclose all text table cell data with quotes, like:
"Liver fibrosis, sclerosis, and cirrhosis","c370800","102922","Cystic fibrosis related cirrhosis","Diagnosis of liver fibrosis, sclerosis, and cirrhosis"
Note: no spaces after the commas. Also, if you have quotes in the text fields, you will have to escape those to "" (two quotes), or the algorithm will get confused.
You may get away with using quotes only for the offending text data, but that could well be more difficult to generate than just putting the quotes around all fields.

Matlab dataimport

My matlab code for dataimport is giving me different results for what appear to be similar text files as input. Input1 gives me a normal cell with all lines from the text file as entries in the cell which i can reference using {i}.
Input2 gives me a scalar data structure where all numeric entries in my text file are converted to the input.data structure. I want all files to be converted to regular cell entries and I do not understand why for some files they are converted to scalar data structures.
Code: input = importdata(strcat(direct,'\',filename));
Input1 example: Correctly working dataimport, with text file on the right
File link: https://drive.google.com/open?id=1aHK4xivqEqJEmaA8Dv8Y0uW5giG-Bbip
Input2 example: Incorrectly working data import, with text file on the right FIle link: https://drive.google.com/open?id=1nzUj_wR1bNXFcGaSLGva6uVsxrk-R5vA
UTSL!
I'm guessing you are using GNU Octave although you are writing "Matlab" as topic of your question.
In importdata.m around line 178, the code tries to automatically detect the delimiter for your data:
delim = regexpi (row, '[-+\d.e*ij ]+([^-+\de.ij])[-+\de*.ij ]','tokens', 'once');
If you run this against W40A0060; you get A as delimiter because there is basically a number before and after it.
If you run this against W39E0016; you get {} as delimiter(empty) because the E could be part of a number in scientific notation and is therefore excluded.
Solution:
you really should add the correct delimiter to the importdata call and not trust that it's magically detected.
And if you just want the lines in a cell, use
strsplit (fileread ("W39E0016_Input2.txt"), "\n")
Analysis
This looks indeed strange!
EDIT: The cause for this strange looking behaviour has been deciphered by #Andy (See his solution).
When you use all outputs of importdata() function you can see what happens when reading the data:
[dat1,del1,headerrows1]=importdata('Input1.txt')
[dat2,del2,headerrows2]=importdata('Input2.txt')
For your first file it recognizes 69 header riws and no delimiter:
del1 = []
headerrows1 = 69
while in your second file only two header rows and a comma , delimiter is recognized
del2 = ','
headerrows2 = 2
I can not find an obvious reason in your files causing this different interpretation of data.
Suggestion
Your data format is rather complex. It is not a simple table like produced from excel. It has multiple lines with a different number of fields per line and varying data types. importdata() is not designed for this type of data. I suggest to write a specific import function for this kind of file. Have a look at textread() for a first guess. You can use it to read the lines of the files as text and later interpret it with sscanf() or use strsplit() to split the line contents into fields.

Is it possible to have multiple row separators in Talend?

I'm facing a challenge for one of my first projects as a junior dev. I'm using Talend to open some metadata files that have a series of "key=value" pairs within the files. I eventually need to transform the metadata and write it as a new row in an Excel file.
The metadata file looks something like this:
DOCTYPE=some_data
DOCNBR=some_data
DOCREV=some_data
DOCBASE=some_data
DOCNAME=some_data
RELEASE=some_data
DWG=TYPE=2;NAME=some_data;SIZE=some_data
DESCRIPTION=some_data
Line 7 of the example above (DWG=TYPE=2;NAME=some_data;SIZE=some_data) is what I'm stuck on when I'm attempting to create a new delimited metadata file, using "=" as the field separator and "\n" as the row separator.
Is there a way to have multiple row separators to include ";" so that I could have the other items on line 7 on their own rows?
Yes you can.
Write a regex which include \n and ; both and give it to the field delimiter field

Spark: Split CSV with newlines in octet-stream field

I am using Scala to parse CSV files. Some of these files have fields which are non-textual data like images or octet-streams. I would like to use Apache Spark's textFile() method to split up the CSV into rows, and
split(",[ ]*(?=([^\"]*\"[^\"]*\")*[^\"]*$)")
to split the row into fields. Unfortunatly this does not work with files that have these mentioned binary fields. There are two problems: 1) The octet-streams can contain newlines which make textFile() split rows which should be one, and 2) The octet-streams contain commas and/or double quotes which are not escaped and mess up my schema.
The files are usually big, couple of MBs up to couple of 100MBs. I have to take the CSV's as they are, although I could preprocess them.
All I want to achieve is a working split function so I can ignore the field with the octet-stream. Nevertheless, a great bonus would be to extract the textual information in the octet-stream.
So how would I go forward to solve my problems?
Edit: A typical record obtained with cat, the newlines are from the file, not for cosmetic purposes (shortened):
7,url,user,02/24/2015 02:29:00 AM,03/22/2015 03:12:36 PM,octet-stream,27156,"MSCF^#^#^#^#�,^#^#^#^#^#^#D^#^#^#^#^#^#^#^C^A^A^#^C^#^D^#^#^#^#^#^T^#^#^#^#^#^P^#�,^#^#^X=^#^#^#^#^#^#^#^#^#^#�^#^#^#^E^#^A^#��^A^#^#^#^#^#^#^#WF6�!^#Info.txt^#=^B^#^#��^A^#^#^#WF7�^#^#List.xml^#^�^#^#��^A^#^#^#WF:�^#^#Filename.txt^#��>��
^#�CK�]�r��^Q��T�^O�^#�-�j�]��FI�Ky��Ei�Je^K""!�^Qx #�*^U^?�^_�;��ħ�^LI^#$(�^Q���b��\N����t�����+������ȷgvM�^L̽�LǴL�^L��^ER��w^Ui^M��^X�Kޓ�^QJȧ��^N~��&�x�bB��D]1�^B|^G���g^SyG�����:����^_P�^T�^_�����U�|B�gH=��%Z^NY���,^U�^VI{��^S�^U�!�^Lpw�T���+�a�z�l������b����w^K��or��pH� ��ܞ�l��z�^\i=�z�:^C�^S!_ESCW��ESC""��g^NY2��s�� u���X^?�^R^R+��b^]^Ro�r���^AR�h�^D��^X^M�^]ޫ���ܰ�^]���0^?��^]�92^GhCx�DN^?
mY<{��L^Zk�^\���M�^V^HE���-Ե�$f�f����^D�e�^R:�u����� ^E^A�Ȑ�^B�^E�sZ���Yo��8Eސ�}��&JY���^A9^P������^P����~Jʭy��`�^9«�""�U� �:�}3���6�Hߧ�v���A7^Xi^L^]�sA�^Q�7�5d�^Xo˛�tY
Bp��4�Y���7DkV_���\^_q~�w�|�a�s̆���#�g�ӳu�^�!W}�n��Rgż_2�]�p�2}��b�G9�M^Q
�����:�X����bR[ԳZV!^G����^U�tq�&�Y6b��GR���s#mn6Z=^ZH^]�b��R^G�C�0R��{r1��4�#�
=r/X2�^O�����r^M�Rȕ�goG^X-����}���P+˥Qf�#��^C�Բ�z1�I�j����6�^Np���ܯ^P�[�^Tzԏ���^F2�e��\�E�߻6c�%���$�:E�*�*©t�y�J�,�S�2U�S�^X}ME�]��]�i��G�su�""��!�-��!r'ܷe_et Y^K^?0���l^A��^^�m�1/q����|�_r�5$�%�([x��W^E�G^^y���#����Z2^?ڠ�^_��^AҶ�OO��^]�vq%:j�^?�jX��\�]����^S�^^n�^C��>.^CY^O-� �_�\K����:p�<7Sֺnj���-Yk�r���^Q^M�n�J^B��^Z0^?�(^C��^W³!�g�Z�~R�A^M�^O^^�%;��Ԗ�p^S�w���*m^S���jڒ|�����<�^S�;Z^^Fc�1���^O�G_o����8��CS���w��^?��n�2~��m���G;��rx4�(�]�'��^E���eƧ�x��.�w�9WO�^^�י3��0,�y��H�Y�.H�x�""'���h}灢^T�Gm;^XE�̼�J��c�^^񾠫;�^A�qZ1ׁBZ^Q�^A^FB�^QbQ�_�3|ƺ�EvZ���^S�w���^P���9^MT��ǩY[+�+�9�Ԩ�^O�^Q���Fy(+�9p�^^Mj�2��Y^?��ڞ��^Ķb�^Z�ψMр}�ڣ�^^S�^?��^U�^Wڻ����z�^#��uk��k^^�>^O�^W�ݤO�h�^G�����Kˇ�.�R|�)-��e^G�^]�/J����U�ϴ�a���i5HO�^L�ESCg�R'���.����d���+~�}��ڝ^Y5]l�3jg54M�������2t�5^Y}�q)��^O;�X\�q^Ox~Vۗ�t�^\f� >k;^G�K5��,��X�t/�ǧ^G""5��4^MiΟ�n��^B^]�|�����V��ߌ֗Q~�H���8��t��5��ܗ�
�Z�^c�6N�ESCG����^_��>��t^L^R�^:�x���^]v�{^#+KM��qԎ�.^S�%&��=^W-�=�^S�����^CI���&^]_�s�˞�y�z�Jc^W�kڠ�^\��^]j�����^O��;�oY^^�^V59;�c��^B��T�nb����^C��^N��s�x�<{�9-�F�T�^N�5�^Se-���^T�Y[���`^ZsL��v�բ<C�+�~�^ۚ��""�Yκ2^_�^VxT�>��/ݳ^U�m�^#���3^Ge�n^Vc�V�^#�NVn�,�q��^^^]gy�R�S��Ȃ$���>A�d����xg�^GB3�M�J�^QJ^]�^\�{.�D��碎�^W�8a����qޠl?,'^R�^X�Cgy�P[����mڞ��H�Z�s�SD&蠤�s�E��nu�O#O<��3wj`C-%w�W�J�^WP^T�^]r^NT�TC�Lq�Z�f�!�;�l�Y��Gb��>�ud�hx�Ԭ^N)9�^N!k�҉s�35v������.�""^]��~4������۴�Z^]u�^Ti^^�i:�)K��P᳕!�#�^?�>��EE^VE-u�^SgV^L��<��^D�O<�+�J.�c�Z#>�.l����^S�
ESC��(��E�j�π쬖���2{^U&b\��P^S�`^O^XdL�^ 6bu��FD��^#^#^#^#","field_x, data",field_y,field_z
Expected output would be an array
("7","url","user","02/24/2015 02:29:00 AM","03/22/2015 03:12:36 PM","octet-stream","27156","field_x, data",field_y",field_z")
Or, but this is probably another question, such an array (like running strings on the octet-stream field):
("7","url","user","02/24/2015 02:29:00 AM","03/22/2015 03:12:36 PM","octet-stream","27156","Info.txt List.xml Filename.txt","field_x, data",field_y",field_z")
Edit 2: Every file that has a binary field also contains a length field for it. So instead of splitting directly I can walk left to right through my record and extract the fields. This is certainly a great improvement of my current situation but problem 1) still persists. How can I split those files reliably?
I took a closer look at the files and a header looks like this:
RecordId, Field_A, Content_Type, Content_Length, Content, Field_B
(Where Content_Type can be "octet-stream", Content_Length the number of bytes in the Content field, and Content obviously the data). And good for me, the value of Field_B is predictable, let's assume for a certain file it's always "Hello World".
So instead of using Spark's default behaviour splitting on newlines, how can I achieve that Spark is only splitting on newlines following "Hello World"? (I also edited the question title since the focus of the question changed)
As answered in Spark: Reading files using different delimiter than new line, I used textinputformat.record.delimiter to split on "Hello World\n" because I am a bit lucky that the last column always contains the same value. After that I simply walk left to right through the record and when I reach the length field I skip the next n bytes. Everything works now. Thanks for pointing me in the right direction.
There are two problems: 1) The octet-streams can contain newlines
which make textFile() split rows which should be one, and 2) The
octet-streams contain commas and/or double quotes which are not
escaped and mess up my schema.
Well, actually that csv file is properly escaped:
the multiline field is enclosed in double quotes: "MSCF^# .. ^#^#" (which also handles possible separators inside the field)
double quotes inside the field are escaped with another double quote as it should be: Je^K""!
Of course a simple split will not work in this case (and should never be used on csv data), but any csv reader able to handle multiline fields should parse that data correctly.
Also keep in mind that the double quotes inside the octet-stream have to be unescaped, or that data won't be valid (another reason not to use split, but a csv reader that handles this).