Remove quotation marks - Swift - swift

From a .txt file I get strings that are full of quotation marks which of course are terrible to use in code. e.g.
3820,"20170217",8,752,119,"Rh",,"fr",,,,"iC1.2","iC1.2",,"IS6a","Z",,0,"IS6a",,201702161517,"-"
3821,"20170227",2,753,207,"Dd","Kru","sc",,,,"iB8","iB8",,"IS9b","Z",,2097152,"IS9b",,201702270804,"+~-"
3822,"20170227",3,753,8,"Dd",,"phH_1",,,,"iB8",,,"IS12~IT12","Z",,2097153,"IS12~IT12",C,201702270804,"-"
3823,"20170227",4,753,29,"Dd",,"phH_1",,,,"iB8",,,"IS11~IT11","Z",,2097153,"IS11~IT11",C,201702270804,"-"
How could I best get rid of these quotation marks??
Thanks in advance

This is CSV, so you'll need to parse it as CSV. There are several problems with your approach:
The quotes mess everything up
It's a mixture of strings and numbers - you won't be able to use an array
There are consecutive commas, that would be illegal in an array literal.
The best approach is to put the data in a file (in your application bundle if you like) and find a CSV parser to read the data. My quick Google yielded SwiftCSV which looks reasonable, but I've never used it.

Related

Trouble rendering CSV data as an interactive table in GitHub

When viewed, any .csv file committed to a GitHub repository automatically renders as an interactive table, complete with headers and row numbering. By default, the first row is your header row. The tables were supposed to look nice as below:
However, there's an error happening in my tabular data, and despite indicating the error, I can't fix it:
I'm using a .csv file with a semicolon separator. Does anyone have an idea of what's happening?
According to the docs, Github can only do its lay-out thing with .csv (comma-separated) and .tsv (tab-separated) files.
Using a semicolon as a separator isn't supported, at least not officially, and a spurious comma in a semicolon-separated file could well throw the algorithm off.
You could try replacing all semicolons with tabs and see how you fare.
If that doesn't work, try using commas as separators and enclose all text table cell data with quotes, like:
"Liver fibrosis, sclerosis, and cirrhosis","c370800","102922","Cystic fibrosis related cirrhosis","Diagnosis of liver fibrosis, sclerosis, and cirrhosis"
Note: no spaces after the commas. Also, if you have quotes in the text fields, you will have to escape those to "" (two quotes), or the algorithm will get confused.
You may get away with using quotes only for the offending text data, but that could well be more difficult to generate than just putting the quotes around all fields.

Removing spaces from a string using Powershell

I have an issue where extracting data from database it sometimes (quite often) adds spaces in between strings of texts that should not be there.
What I'm trying to do is create a small script that will look at these strings and remove the spaces.
The problem is that the spaces can be in any position in the string, and the string is a variable that changes.
Example:
"StaffID": "0000 25" <- The space in the number should not be there.
Is there a way to have the script look at this particular line, and if it finds spaces, to remove them.
Or:"DateOfBirth": "23-10-199 0" <-It would also need to look at these spaces and remove them.
The problem is that the same data also has lines such as:
"Address": " 91 Broad street" <- The spaces should be here obviously.
I've tried using TRIM, but that only removes spaces from start/end.
Worth mentioning that the data extracted is in json format and is then imported using API into the new system.
You should think about the logic of what you want to do, and whether or not it's programmatically possible to determine if you can teach your script where it is or is not appropriate to put spaces. As it is, this is one of the biggest problems facing AI research right now, so unfortunately you're probably going to have to do this by hand.
If it were me, I'd specify the kind of data format that I expect from each column, and try my best to attempt to parse those strings. For example, if you know that StaffID doesn't contain spaces, you can have a rule that just deletes them:
$staffid = $staffid.replace("\s+",'')
There are some more complicated things that you can do with forced formatting (.replace) that have already been covered in this answer, but again, that requires some expectation of exactly what data is going to come out of what column.
You might want to look more closely at where those spaces are coming from, rather than process the output like this. Is the retrieval script doing it? Maybe you can optimize the database that you're drawing from?

Modify certain section of csv file

I have a csv file that has multiople lines, which uses "|" as delimiter as the following shows.
MAIN|02/21/2018|7695154880|082.21|00021
MAIN|02/21/2018|7695154880|000.21|00210
I would like to remove the leading zeros of forth and fifth sections so the following shows desired result. Is there any way can do it in powershell? Thank you.
MAIN|02/21/2018|7695154880|82.21|21
MAIN|02/21/2018|7695154880|0.21|210
Agree with Olaf here, including the zero thing.
Yet, maybe it's just the visual thing for you.
Anyway, what you are asking for is similar to this post.
Remove leading zeros in csv file from int values only
I have this csv file I'm trying to remove leading zeros from
Remove leading zeros in csv file from int values only
I initially thought Regex like [regex]::replace($value,'^(0+)','')
But - in my opinion it would be better just to do
$csv|Foreach-Object{$_.ValueTORemoveLeadingZeroesFrom=
[float]$_.ValueTORemoveLeadingZeroesFrom}
By casting string to float You actually get this done easier I think?

Spark: Split CSV with newlines in octet-stream field

I am using Scala to parse CSV files. Some of these files have fields which are non-textual data like images or octet-streams. I would like to use Apache Spark's textFile() method to split up the CSV into rows, and
split(",[ ]*(?=([^\"]*\"[^\"]*\")*[^\"]*$)")
to split the row into fields. Unfortunatly this does not work with files that have these mentioned binary fields. There are two problems: 1) The octet-streams can contain newlines which make textFile() split rows which should be one, and 2) The octet-streams contain commas and/or double quotes which are not escaped and mess up my schema.
The files are usually big, couple of MBs up to couple of 100MBs. I have to take the CSV's as they are, although I could preprocess them.
All I want to achieve is a working split function so I can ignore the field with the octet-stream. Nevertheless, a great bonus would be to extract the textual information in the octet-stream.
So how would I go forward to solve my problems?
Edit: A typical record obtained with cat, the newlines are from the file, not for cosmetic purposes (shortened):
7,url,user,02/24/2015 02:29:00 AM,03/22/2015 03:12:36 PM,octet-stream,27156,"MSCF^#^#^#^#�,^#^#^#^#^#^#D^#^#^#^#^#^#^#^C^A^A^#^C^#^D^#^#^#^#^#^T^#^#^#^#^#^P^#�,^#^#^X=^#^#^#^#^#^#^#^#^#^#�^#^#^#^E^#^A^#��^A^#^#^#^#^#^#^#WF6�!^#Info.txt^#=^B^#^#��^A^#^#^#WF7�^#^#List.xml^#^�^#^#��^A^#^#^#WF:�^#^#Filename.txt^#��>��
^#�CK�]�r��^Q��T�^O�^#�-�j�]��FI�Ky��Ei�Je^K""!�^Qx #�*^U^?�^_�;��ħ�^LI^#$(�^Q���b��\N����t�����+������ȷgvM�^L̽�LǴL�^L��^ER��w^Ui^M��^X�Kޓ�^QJȧ��^N~��&�x�bB��D]1�^B|^G���g^SyG�����:����^_P�^T�^_�����U�|B�gH=��%Z^NY���,^U�^VI{��^S�^U�!�^Lpw�T���+�a�z�l������b����w^K��or��pH� ��ܞ�l��z�^\i=�z�:^C�^S!_ESCW��ESC""��g^NY2��s�� u���X^?�^R^R+��b^]^Ro�r���^AR�h�^D��^X^M�^]ޫ���ܰ�^]���0^?��^]�92^GhCx�DN^?
mY<{��L^Zk�^\���M�^V^HE���-Ե�$f�f����^D�e�^R:�u����� ^E^A�Ȑ�^B�^E�sZ���Yo��8Eސ�}��&JY���^A9^P������^P����~Jʭy��`�^9«�""�U� �:�}3���6�Hߧ�v���A7^Xi^L^]�sA�^Q�7�5d�^Xo˛�tY
Bp��4�Y���7DkV_���\^_q~�w�|�a�s̆���#�g�ӳu�^�!W}�n��Rgż_2�]�p�2}��b�G9�M^Q
�����:�X����bR[ԳZV!^G����^U�tq�&�Y6b��GR���s#mn6Z=^ZH^]�b��R^G�C�0R��{r1��4�#�
=r/X2�^O�����r^M�Rȕ�goG^X-����}���P+˥Qf�#��^C�Բ�z1�I�j����6�^Np���ܯ^P�[�^Tzԏ���^F2�e��\�E�߻6c�%���$�:E�*�*©t�y�J�,�S�2U�S�^X}ME�]��]�i��G�su�""��!�-��!r'ܷe_et Y^K^?0���l^A��^^�m�1/q����|�_r�5$�%�([x��W^E�G^^y���#����Z2^?ڠ�^_��^AҶ�OO��^]�vq%:j�^?�jX��\�]����^S�^^n�^C��>.^CY^O-� �_�\K����:p�<7Sֺnj���-Yk�r���^Q^M�n�J^B��^Z0^?�(^C��^W³!�g�Z�~R�A^M�^O^^�%;��Ԗ�p^S�w���*m^S���jڒ|�����<�^S�;Z^^Fc�1���^O�G_o����8��CS���w��^?��n�2~��m���G;��rx4�(�]�'��^E���eƧ�x��.�w�9WO�^^�י3��0,�y��H�Y�.H�x�""'���h}灢^T�Gm;^XE�̼�J��c�^^񾠫;�^A�qZ1ׁBZ^Q�^A^FB�^QbQ�_�3|ƺ�EvZ���^S�w���^P���9^MT��ǩY[+�+�9�Ԩ�^O�^Q���Fy(+�9p�^^Mj�2��Y^?��ڞ��^Ķb�^Z�ψMр}�ڣ�^^S�^?��^U�^Wڻ����z�^#��uk��k^^�>^O�^W�ݤO�h�^G�����Kˇ�.�R|�)-��e^G�^]�/J����U�ϴ�a���i5HO�^L�ESCg�R'���.����d���+~�}��ڝ^Y5]l�3jg54M�������2t�5^Y}�q)��^O;�X\�q^Ox~Vۗ�t�^\f� >k;^G�K5��,��X�t/�ǧ^G""5��4^MiΟ�n��^B^]�|�����V��ߌ֗Q~�H���8��t��5��ܗ�
�Z�^c�6N�ESCG����^_��>��t^L^R�^:�x���^]v�{^#+KM��qԎ�.^S�%&��=^W-�=�^S�����^CI���&^]_�s�˞�y�z�Jc^W�kڠ�^\��^]j�����^O��;�oY^^�^V59;�c��^B��T�nb����^C��^N��s�x�<{�9-�F�T�^N�5�^Se-���^T�Y[���`^ZsL��v�բ<C�+�~�^ۚ��""�Yκ2^_�^VxT�>��/ݳ^U�m�^#���3^Ge�n^Vc�V�^#�NVn�,�q��^^^]gy�R�S��Ȃ$���>A�d����xg�^GB3�M�J�^QJ^]�^\�{.�D��碎�^W�8a����qޠl?,'^R�^X�Cgy�P[����mڞ��H�Z�s�SD&蠤�s�E��nu�O#O<��3wj`C-%w�W�J�^WP^T�^]r^NT�TC�Lq�Z�f�!�;�l�Y��Gb��>�ud�hx�Ԭ^N)9�^N!k�҉s�35v������.�""^]��~4������۴�Z^]u�^Ti^^�i:�)K��P᳕!�#�^?�>��EE^VE-u�^SgV^L��<��^D�O<�+�J.�c�Z#>�.l����^S�
ESC��(��E�j�π쬖���2{^U&b\��P^S�`^O^XdL�^ 6bu��FD��^#^#^#^#","field_x, data",field_y,field_z
Expected output would be an array
("7","url","user","02/24/2015 02:29:00 AM","03/22/2015 03:12:36 PM","octet-stream","27156","field_x, data",field_y",field_z")
Or, but this is probably another question, such an array (like running strings on the octet-stream field):
("7","url","user","02/24/2015 02:29:00 AM","03/22/2015 03:12:36 PM","octet-stream","27156","Info.txt List.xml Filename.txt","field_x, data",field_y",field_z")
Edit 2: Every file that has a binary field also contains a length field for it. So instead of splitting directly I can walk left to right through my record and extract the fields. This is certainly a great improvement of my current situation but problem 1) still persists. How can I split those files reliably?
I took a closer look at the files and a header looks like this:
RecordId, Field_A, Content_Type, Content_Length, Content, Field_B
(Where Content_Type can be "octet-stream", Content_Length the number of bytes in the Content field, and Content obviously the data). And good for me, the value of Field_B is predictable, let's assume for a certain file it's always "Hello World".
So instead of using Spark's default behaviour splitting on newlines, how can I achieve that Spark is only splitting on newlines following "Hello World"? (I also edited the question title since the focus of the question changed)
As answered in Spark: Reading files using different delimiter than new line, I used textinputformat.record.delimiter to split on "Hello World\n" because I am a bit lucky that the last column always contains the same value. After that I simply walk left to right through the record and when I reach the length field I skip the next n bytes. Everything works now. Thanks for pointing me in the right direction.
There are two problems: 1) The octet-streams can contain newlines
which make textFile() split rows which should be one, and 2) The
octet-streams contain commas and/or double quotes which are not
escaped and mess up my schema.
Well, actually that csv file is properly escaped:
the multiline field is enclosed in double quotes: "MSCF^# .. ^#^#" (which also handles possible separators inside the field)
double quotes inside the field are escaped with another double quote as it should be: Je^K""!
Of course a simple split will not work in this case (and should never be used on csv data), but any csv reader able to handle multiline fields should parse that data correctly.
Also keep in mind that the double quotes inside the octet-stream have to be unescaped, or that data won't be valid (another reason not to use split, but a csv reader that handles this).

How to parse special characters in XML for iPad?

I am getting problem while parsing xml files that contains some special characters like single quote,double quote (', "")etc.I am using NSXMLParser's parser:foundCharacters:method to collect characters in my code.
<synctext type = "word" >They raced to the park Arthur pointed to a sign "Whats that say" he asked Zoo said DW Easy as pie</synctext>
When i parse and save the text from above tag of my xml file,the resultant string is appearing,in GDB, as
"\n\t\tThey raced to the park Arthur pointed to a sign \"Whats that say\" he asked Zoo said DW Easy as pie";
Observe there are 2 issues:
1)Unwanted characters at the beginning of the string.
2)The double quotes around Whats that say.
Can any one please help me how to get rid of these unwanted characters and how to read special characters properly.
NSString*string =[string stringByTrimmingCharactersInSet:[NSCharacterSet characterSetWithCharactersInString:#" \n\t"]];
The parser is apparently returning exactly what's in the string. That is, the XML was coded with the starting tag on one line, a newline, two tabs, and the start of the string. And quotes in the string are obviously there in the original (and it's not clear in at least this example why you'd want to delete them).
But if you want these characters gone then you need to post-process the string. You can use Rams' statement to eliminate the newline and tabs, and stringByReplacingOccurrencesOfString:WithString: to zap the quotes.
(Note that some XML parsers can be instructed to return strings like this with the leading/trailing stuff stripped, but I'm not sure about this one. The quotes will always be there, though.)