Pulling out the "cell" values using SED from an RTF file

Pulling out the "cell" values using SED from an RTF file - sed

How can I extract the "cell" information from an RTF file using SED (bash shell). ie all the character strings between any pairs of { }, of which there can be several on a line of RTF. I want to strip out all the RTF code and just keep the table values.

This might work for you (GNU sed):
sed '/{/!d;s/[^{]*{\([^}]*\)}/\1\n/;P;D' file
It deletes any line that does not have an opening brace. Then removes any characters upto and including the first opening brace. Then prints the string(s) within but not including the closing brace on a separate line.

Related

Prevent newline in (.md) files

How do I prevent newlines in the readme.md files (GitHub)?
We can always write the whole thing in one line to prevent it. But is there an exclusive tag/option to prevent the same, especially for tags that create newlines (headings) like span in html?

Doesn't a space followed by a backslash do the concatenation you want? It does for me. That way I can break a paragraph into one sentence per line.

How to replace the same character in multiple text files?

So I have over 100 text files, all of which are over the size required to be opened in a normal text editor (eg; notepad, notepad++). Meaning I cannot use those mentioned.
All text files contain the same format, they contain:
abc0001:00000009a
abc0054:000000809a
abc00888:054450000009a
and so on..
I was wondering, how do I replace the ":" in each of those text files to then be "\n" (regex for new line)
So then it would be:
abc0001
00000009a
abc0054
000000809a
abc00888
054450000009a
How would I do this to all of the 100 text files, without doing this manually and individually. (if there's any way?)
Any help is appreciated.

You can use sed. The following does something similar to what you want. The question concerns Unix, but a lot of Unix utilities have been ported to MS Windows (even sed): http://gnuwin32.sourceforge.net/packages/sed.htm
UNIX: Replace Newline w/ Colon, Preserving Newline Before EOF
Something like (where you provide your text file as input, and the output becomes your new text file):
sed 's/:/\n/g'

What is this field separator (^M)?

I run into this one while trying to parse a text file with Perl. The original file looks like this in vim:
When I tried to print the 2nd column (87 here), somehow, the ^M showed up in vim:
I'm just curious what this "^M" is? Does anyone know? Thanks!

^M is ASCII character 13, known as a carriage return. MS-DOS uses a carriage return followed by a line feed (ASCII 10) to mark the end of a line. Unix systems use a line feed only. Usually you will "see" a carriage return when using an editor that thinks your file is using Unix style line endings but actually has MS-DOS style line endings.

What code format shows proper line breaks?

I am exporting some Access tables to txt files and there are a lot of problems with the txt file. One of those problems being line breaks not visible in the txt file itself. If I copy a line with a line break into Notepad++ from Notepad, it'll break into 2 lines.
So I believe this may be a code format problem, but I can't find the proper one to resolve this. I'm currently exporting to the default Western European, but should I export tot UTF, Unicode, ASCII or something else?

When exporting from MS Access (or VB/VBA in general), make sure you're using vbCrLf constant (Carriage Return plus Line Feed) for line breaks. That constant corresponds to HEX values 0D 0A.
In Windows, it is a convention to use the above 2 characters together as line breaks, while in many other platforms, such as Unix/Linux/MacOS/etc. typically just 0A is used.
That brings up an issue: Notepad, the standard Windows text file viewer, cannot deal with 0A alone and does not treat such symbols as line breaks. More advanced editors, such as Notepad++ or UltraEdit, display such files correctly, though.

The CSV export function in Microsoft Office applications (Excel, Access) terminate a data row with CR+LF and write for a line break within a data value (multi-line string) just LF into the file. (I think just CR was written into the CSV file for a line break in older versions of Office before Office 2007.)
Most text editors detect those LF without CR (respectively CR without LF) and convert them to CR+LF on loading the CSV file which results on viewing of the CSV file in text editor in supposed wrong CSV lines as number of data values is not correct on data rows with data values containing a line break.
However, newline characters within a double quoted value in a CSV file are correct according to CSV specification as described in Wikipedia article about Comma-separated values.
But most applications with support on import from CSV file do not support CSV files with newline characters within a double quoted value and therefore some data values are imported wrong. Also regular expression replaces can't be done on a CSV file with newline characters within a data value because the number of separator character is not constant on all lines.
UltraEdit has for editing such CSV files with only LF (or CR) for a line break within a data value a special configuration setting. At Advanced - Configuration - File Handling - DOS/Unix/Mac Handling the option Never prompt to convert files to DOS format or Prompt to convert if file is not DOS format with clicking on button No if this prompt is displayed must be selected and additionally Only recognize DOS terminated lines (CR/LF) as new lines for editing must be enabled.
The CSV file with CR+LF for end of data row and only LF (or CR) for a line-break within a data value is loaded with those settings in UltraEdit with number of lines equal the number of data rows. And the line-feeds without carriage return (respectively the carriage returns without line-feed) in the CSV file are displayed as character in the lines with a small rectangle as no font has a glyph for a carriage return or line-feed defined because they are whitespace characters with no width. A Perl regular expression find searching for \r(?!\n)|\n(?<!\r) can be used now to find those line breaks within data values and replace them with something different like a space character or remove them.
Which character encoding (ASCII, ANSI, Unicode (UTF-16), UTF-8) to use on export depends on which characters can exist in string values. A Unicode encoding is necessary if string values can have also characters not included in local code page.

How to handle new line characters when using COPY in POSTGRESQL

I have text that has the following form in my csv:
'0001'|'text1'|'\ntext2'|'text3'\n
However when I try to import the data into my postgres instance, it keeps breaking by thinking the first newline character is the start of a new line. Is there an easy way to tell postgres to import the newline character into the field?

If delimiters are explicitly set you avoid the trouble of special characters being interpreted, and instead are taken literally. The same thing can be said about quotes. The parser needs to know how to recognize strings to not interpret \n as a newline.
Here's the documentation:
Backslash characters () can be used in the COPY data to quote data
characters that might otherwise be taken as row or column delimiters.
In particular, the following characters must be preceded by a
backslash if they appear as part of a column value: backslash itself,
newline, carriage return, and the current delimiter character.
SO, you might have
COPY data FROM STDIN WITH CSV HEADER DELIMITER E'|' QUOTE E'\'';

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pulling out the "cell" values using SED from an RTF file - sed

How can I extract the "cell" information from an RTF file using SED (bash shell). ie all the character strings between any pairs of { }, of which there can be several on a line of RTF. I want to strip out all the RTF code and just keep the table values.

This might work for you (GNU sed): sed '/{/!d;s/[^{]{\([^}]\)}/\1\n/;P;D' file It deletes any line that does not have an opening brace. Then removes any characters upto and including the first opening brace. Then prints the string(s) within but not including the closing brace on a separate line.

Related

Prevent newline in (.md) files

How to replace the same character in multiple text files?

What is this field separator (^M)?

What code format shows proper line breaks?

How to handle new line characters when using COPY in POSTGRESQL

Categories

Resources

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pulling out the "cell" values using SED from an RTF file - sed

How can I extract the "cell" information from an RTF file using SED (bash shell). ie all the character strings between any pairs of { }, of which there can be several on a line of RTF. I want to strip out all the RTF code and just keep the table values.

This might work for you (GNU sed): sed '/{/!d;s/[^{]*{\([^}]*\)}/\1\n/;P;D' file It deletes any line that does not have an opening brace. Then removes any characters upto and including the first opening brace. Then prints the string(s) within but not including the closing brace on a separate line.

Related

Prevent newline in (.md) files

How to replace the same character in multiple text files?

What is this field separator (^M)?

What code format shows proper line breaks?

How to handle new line characters when using COPY in POSTGRESQL

Categories

Resources

This might work for you (GNU sed): sed '/{/!d;s/[^{]{\([^}]\)}/\1\n/;P;D' file It deletes any line that does not have an opening brace. Then removes any characters upto and including the first opening brace. Then prints the string(s) within but not including the closing brace on a separate line.