sed or perl pattern match and then multiple actions - sed

I want to search for a variable numeric value at the end of a line, once found, do two things:
append something to the end of that line (for example, an html tag)
move down as many line as that numeric value and append a line there as well (for example, close the preceding html tag).
Any quick pointers please? Attached is a screenshot of my sample data and desired changes. Note in some cases, if there is no member, there won't be any numeric value at the end of the line.
Sample data

Your data is in well structure. It is totally fine to use sed.
You need to identify the headers, then operate on them.
input
header 1
item 1.1
item 1.2
item 1.3
header 2
item 2.1
item 2.2
item 2.3
header 3
item 3.1
item 3.2
item 3.3
command
sed -r '/^\S/{s#$#<wrap>#; 1!s#^#</wrap>\n#}; $s#$#</wrap>#' input
output
header 1<wrap>
item 1.1
item 1.2
item 1.3
</wrap>
header 2<wrap>
item 2.1
item 2.2
item 2.3
</wrap>
header 3<wrap>
item 3.1
item 3.2
item 3.3
</wrap>

OK, I guess I'll settle with this one liner for now (see my sample data in the question):
perl -pe '/\((\d+) members?\)/ && do {$close = $.+$1; s/$/OPEN_TAG/;}; $.==$close && do {s/$/CLOSE_TAG/};' input-file

Related

github markdown split source table lines

I got a pretty complex markdown table with plenty of columns.
I want to keep linter (in my case it's makdownlint) happy and keep lines pretty and in 80 characters limits. But headers data is complex so my table looks like this
| fooooooooooooo | baaaaaaaaar | foooooooooooo | baaaaaaaaar | fooooooooooo |
|----------------|-------------|---------------|-------------|--------------|
|1|2|3|4|5|
We result of that table is that I need and looks ok on GitHub
I'm not sure this is a great idea, but is there any way to split table cells between line in source, but keep rendered data the same?
Something like this:
| fooooooooooooo |\
| baaaaaaaaar \
| foooooooooooo \
| baaaaaaaaar \
| fooooooooooo |
In short: No.
GitHub's spec does not provide for breaking a row across lines. Of note is the description of rows:
Each row consists of cells containing arbitrary text, in which inlines
are parsed, separated by pipes (|). A leading and trailing pipe is
also recommended for clarity of reading, and if there’s otherwise
parsing ambiguity. Spaces between pipes and cell content are trimmed.
Block-level elements cannot be inserted in a table.
Of course, while that doesn't specifically support it, it also doesn't explicitly exclude breaking a row across multiple lines. However, notice that the syntax does not offer any way (outside of a line break) to define when one row ends and another row begins (unlike the header row, which requires a "deliminator row" to divide it from the body of the table). As you cannot define the division between rows, then the only way the parser can determine when one row ends and another begins is with a line break.
And then we have this issue:
The remainder of the table’s rows may vary in the number of cells. If
there are a number of cells fewer than the number of cells in the
header row, empty cells are inserted. If there are greater, the excess
is ignored:
In other words, the parser can not count columns to determine if the next line is a continuation of the previous row or a new row.
Finally, elsewhere the spec states that:
A backslash at the end of the line is a hard line break:
There are some exceptions for specific types of content, but tables are not mentioned at all in the backslash escapes section of the spec and therefore do not fit any of those exceptions. Thus, using a backslash escape at the end of the line only reinforces the fact that the line ends a row, which is the opposite of your desired effect.
So, no, there is no way to wrap a table row across multiple lines.
For comparison consider MultiMarkdown, which had support for the same table syntax long before GitHub offered it. MultiMarkdown's documentation plainly states:
Cell content must be on one line only
This behavior matches PHP Markdown Extra, which first introduced the syntax. In fact, I'm not aware of any implementation of this specific table syntax which supports any way for one row to be defined on multiple lines.

Is it possible to have multiple row separators in Talend?

I'm facing a challenge for one of my first projects as a junior dev. I'm using Talend to open some metadata files that have a series of "key=value" pairs within the files. I eventually need to transform the metadata and write it as a new row in an Excel file.
The metadata file looks something like this:
DOCTYPE=some_data
DOCNBR=some_data
DOCREV=some_data
DOCBASE=some_data
DOCNAME=some_data
RELEASE=some_data
DWG=TYPE=2;NAME=some_data;SIZE=some_data
DESCRIPTION=some_data
Line 7 of the example above (DWG=TYPE=2;NAME=some_data;SIZE=some_data) is what I'm stuck on when I'm attempting to create a new delimited metadata file, using "=" as the field separator and "\n" as the row separator.
Is there a way to have multiple row separators to include ";" so that I could have the other items on line 7 on their own rows?
Yes you can.
Write a regex which include \n and ; both and give it to the field delimiter field

Exiftool - modify metadata format

Suppose I have 5000 images with following metadata in the LABEL field.
0001 ELEPHANT
0002 ELEPHANT
0003 ELEPHANT
...
4999 ELEPHANT
5000 ELEPHANT
I wish to change the format to:
ELEPHANT-0001
ELEPHANT-0002
ELEPHANT-0003
…
ELEPHANT-4999
ELEPHANT-5000
In other words, I want to do the following for a metadata field of multiple images:
#### NAME —> NAME-####
From what I can gather there could be two ways of doing this
Ignore the current metadata in the images, and reference a (plain text? csv?) file that I prepare separately; or
Read the file's metadata as a string, identify the space and the number preceding the space, save that number, and finally make a new string by concatenating the number and space, and adding a hyphen in between!
Any suggestions?
Expanding upon the answer I gave in the exiftool forums.
The basic command would be
exiftool "-LABEL<${LABEL;s/(\d{4}) (.*)/$2-$1/}" <FileOrDir>
You basically want to copy a tag into the same tag, with some modifications. The option to copy a tag is the less than (or greater than) symbol < or >. A common mistake is to use the equal sign = which is used to assign a static value to a tag.
To do the modification to the tag, it takes the Advance Formatting option, which is actually some in-line perl code. In this example, the tag is treated as a perl string and a regex substitution is used. It matches and captures the first four digits (\d{4}), matches the space (but doesn't capture it), then matches and captures the rest of the tag (.*). The two captures are assigned to the variables $1 and $2, respectively. In the replace half of the substitution $2-$1, the two captures are reversed with the hyphen between them.
To take full advantage of the advance formatting, some basic perl and regex knowledge is helpful.
Once you are sure of the command, you can add -overwrite_original to suppress the generation of backup files and -r to recurse into subdirectories.

Spark: Split CSV with newlines in octet-stream field

I am using Scala to parse CSV files. Some of these files have fields which are non-textual data like images or octet-streams. I would like to use Apache Spark's textFile() method to split up the CSV into rows, and
split(",[ ]*(?=([^\"]*\"[^\"]*\")*[^\"]*$)")
to split the row into fields. Unfortunatly this does not work with files that have these mentioned binary fields. There are two problems: 1) The octet-streams can contain newlines which make textFile() split rows which should be one, and 2) The octet-streams contain commas and/or double quotes which are not escaped and mess up my schema.
The files are usually big, couple of MBs up to couple of 100MBs. I have to take the CSV's as they are, although I could preprocess them.
All I want to achieve is a working split function so I can ignore the field with the octet-stream. Nevertheless, a great bonus would be to extract the textual information in the octet-stream.
So how would I go forward to solve my problems?
Edit: A typical record obtained with cat, the newlines are from the file, not for cosmetic purposes (shortened):
7,url,user,02/24/2015 02:29:00 AM,03/22/2015 03:12:36 PM,octet-stream,27156,"MSCF^#^#^#^#�,^#^#^#^#^#^#D^#^#^#^#^#^#^#^C^A^A^#^C^#^D^#^#^#^#^#^T^#^#^#^#^#^P^#�,^#^#^X=^#^#^#^#^#^#^#^#^#^#�^#^#^#^E^#^A^#��^A^#^#^#^#^#^#^#WF6�!^#Info.txt^#=^B^#^#��^A^#^#^#WF7�^#^#List.xml^#^�^#^#��^A^#^#^#WF:�^#^#Filename.txt^#��>��
^#�CK�]�r��^Q��T�^O�^#�-�j�]��FI�Ky��Ei�Je^K""!�^Qx #�*^U^?�^_�;��ħ�^LI^#$(�^Q���b��\N����t�����+������ȷgvM�^L̽�LǴL�^L��^ER��w^Ui^M��^X�Kޓ�^QJȧ��^N~��&�x�bB��D]1�^B|^G���g^SyG�����:����^_P�^T�^_�����U�|B�gH=��%Z^NY���,^U�^VI{��^S�^U�!�^Lpw�T���+�a�z�l������b����w^K��or��pH� ��ܞ�l��z�^\i=�z�:^C�^S!_ESCW��ESC""��g^NY2��s�� u���X^?�^R^R+��b^]^Ro�r���^AR�h�^D��^X^M�^]ޫ���ܰ�^]���0^?��^]�92^GhCx�DN^?
mY<{��L^Zk�^\���M�^V^HE���-Ե�$f�f����^D�e�^R:�u����� ^E^A�Ȑ�^B�^E�sZ���Yo��8Eސ�}��&JY���^A9^P������^P����~Jʭy��`�^9«�""�U� �:�}3���6�Hߧ�v���A7^Xi^L^]�sA�^Q�7�5d�^Xo˛�tY
Bp��4�Y���7DkV_���\^_q~�w�|�a�s̆���#�g�ӳu�^�!W}�n��Rgż_2�]�p�2}��b�G9�M^Q
�����:�X����bR[ԳZV!^G����^U�tq�&�Y6b��GR���s#mn6Z=^ZH^]�b��R^G�C�0R��{r1��4�#�
=r/X2�^O�����r^M�Rȕ�goG^X-����}���P+˥Qf�#��^C�Բ�z1�I�j����6�^Np���ܯ^P�[�^Tzԏ���^F2�e��\�E�߻6c�%���$�:E�*�*©t�y�J�,�S�2U�S�^X}ME�]��]�i��G�su�""��!�-��!r'ܷe_et Y^K^?0���l^A��^^�m�1/q����|�_r�5$�%�([x��W^E�G^^y���#����Z2^?ڠ�^_��^AҶ�OO��^]�vq%:j�^?�jX��\�]����^S�^^n�^C��>.^CY^O-� �_�\K����:p�<7Sֺnj���-Yk�r���^Q^M�n�J^B��^Z0^?�(^C��^W³!�g�Z�~R�A^M�^O^^�%;��Ԗ�p^S�w���*m^S���jڒ|�����<�^S�;Z^^Fc�1���^O�G_o����8��CS���w��^?��n�2~��m���G;��rx4�(�]�'��^E���eƧ�x��.�w�9WO�^^�י3��0,�y��H�Y�.H�x�""'���h}灢^T�Gm;^XE�̼�J��c�^^񾠫;�^A�qZ1ׁBZ^Q�^A^FB�^QbQ�_�3|ƺ�EvZ���^S�w���^P���9^MT��ǩY[+�+�9�Ԩ�^O�^Q���Fy(+�9p�^^Mj�2��Y^?��ڞ��^Ķb�^Z�ψMр}�ڣ�^^S�^?��^U�^Wڻ����z�^#��uk��k^^�>^O�^W�ݤO�h�^G�����Kˇ�.�R|�)-��e^G�^]�/J����U�ϴ�a���i5HO�^L�ESCg�R'���.����d���+~�}��ڝ^Y5]l�3jg54M�������2t�5^Y}�q)��^O;�X\�q^Ox~Vۗ�t�^\f� >k;^G�K5��,��X�t/�ǧ^G""5��4^MiΟ�n��^B^]�|�����V��ߌ֗Q~�H���8��t��5��ܗ�
�Z�^c�6N�ESCG����^_��>��t^L^R�^:�x���^]v�{^#+KM��qԎ�.^S�%&��=^W-�=�^S�����^CI���&^]_�s�˞�y�z�Jc^W�kڠ�^\��^]j�����^O��;�oY^^�^V59;�c��^B��T�nb����^C��^N��s�x�<{�9-�F�T�^N�5�^Se-���^T�Y[���`^ZsL��v�բ<C�+�~�^ۚ��""�Yκ2^_�^VxT�>��/ݳ^U�m�^#���3^Ge�n^Vc�V�^#�NVn�,�q��^^^]gy�R�S��Ȃ$���>A�d����xg�^GB3�M�J�^QJ^]�^\�{.�D��碎�^W�8a����qޠl?,'^R�^X�Cgy�P[����mڞ��H�Z�s�SD&蠤�s�E��nu�O#O<��3wj`C-%w�W�J�^WP^T�^]r^NT�TC�Lq�Z�f�!�;�l�Y��Gb��>�ud�hx�Ԭ^N)9�^N!k�҉s�35v������.�""^]��~4������۴�Z^]u�^Ti^^�i:�)K��P᳕!�#�^?�>��EE^VE-u�^SgV^L��<��^D�O<�+�J.�c�Z#>�.l����^S�
ESC��(��E�j�π쬖���2{^U&b\��P^S�`^O^XdL�^ 6bu��FD��^#^#^#^#","field_x, data",field_y,field_z
Expected output would be an array
("7","url","user","02/24/2015 02:29:00 AM","03/22/2015 03:12:36 PM","octet-stream","27156","field_x, data",field_y",field_z")
Or, but this is probably another question, such an array (like running strings on the octet-stream field):
("7","url","user","02/24/2015 02:29:00 AM","03/22/2015 03:12:36 PM","octet-stream","27156","Info.txt List.xml Filename.txt","field_x, data",field_y",field_z")
Edit 2: Every file that has a binary field also contains a length field for it. So instead of splitting directly I can walk left to right through my record and extract the fields. This is certainly a great improvement of my current situation but problem 1) still persists. How can I split those files reliably?
I took a closer look at the files and a header looks like this:
RecordId, Field_A, Content_Type, Content_Length, Content, Field_B
(Where Content_Type can be "octet-stream", Content_Length the number of bytes in the Content field, and Content obviously the data). And good for me, the value of Field_B is predictable, let's assume for a certain file it's always "Hello World".
So instead of using Spark's default behaviour splitting on newlines, how can I achieve that Spark is only splitting on newlines following "Hello World"? (I also edited the question title since the focus of the question changed)
As answered in Spark: Reading files using different delimiter than new line, I used textinputformat.record.delimiter to split on "Hello World\n" because I am a bit lucky that the last column always contains the same value. After that I simply walk left to right through the record and when I reach the length field I skip the next n bytes. Everything works now. Thanks for pointing me in the right direction.
There are two problems: 1) The octet-streams can contain newlines
which make textFile() split rows which should be one, and 2) The
octet-streams contain commas and/or double quotes which are not
escaped and mess up my schema.
Well, actually that csv file is properly escaped:
the multiline field is enclosed in double quotes: "MSCF^# .. ^#^#" (which also handles possible separators inside the field)
double quotes inside the field are escaped with another double quote as it should be: Je^K""!
Of course a simple split will not work in this case (and should never be used on csv data), but any csv reader able to handle multiline fields should parse that data correctly.
Also keep in mind that the double quotes inside the octet-stream have to be unescaped, or that data won't be valid (another reason not to use split, but a csv reader that handles this).

format a word document

I have written the index of my report in word document but the page numbers are not properly formatted, I even tried to use table for it but it is still not working .
TABLE OF CONTENTS
Chapter: 1 Introduction…………………………………………………………….…....……..1
1.1 Project Summary……………………………………………………….......………..2
1.2 Objective……………………………………………………….……….…….….........2
1.3 Scope…………………………………………………….…………………...........…...2
1.4 Technology and literature……………………………….……………………..2
like above i ve my index. In word document page numbers are not arranged in a line.kindly help me.
Try to use instead of Whitespace button a Tab button.
And you may add a points "." at well. And numbers will be in true places