Remove an unknown part of a line using sed

Remove an unknown part of a line using sed - sed

I want to go from:
[important text, numbers/letters only] [unknown text]/Windows/[text that I want to keep, will have slashes]
to
[important text, numbers/letters only] Windows/[text that I want to keep, will have slashes]
The unknown text could be a range of possibilities but it will not contain Windows
So basically I want to cut off the text before Windows that comes after [important text, numbers/letters only]
The space after important text will always be the first space in the line
I also want it to be able to work if the unknown text and the slash after it are not present

You didn't give any live examples so I'm assumming something like this is a valid example
abc123 unknown text which might have spaces/Windows/There might be /es here
My main assumption is that the first section doesn't have spaces. Otherwise You don't really have a way to deliniate between the important and the unknown text.
You can achieve what you are after with
sed -e 's/^\([^ ]\) [^/]*\/Windows/\1 Windows/'
The first set of parantheses captures everything from the start of the line that doesn't have a space in it. We then throw away everything up until '/Windows' and keep everything after that.

Related

Crystal Report is automatically (randomly) adding white space after a sequence of text X's before a number in a field. Want to prevent the whitespace

I have a field that I'm sure does not have white space in it at a certain location, but for some reason Crystal Reports is automatically adding space to the field but only after X's in certain scenarios. This is a text field which I'm sure is pulling down from a text field in the db as well.
This is hard to explain without a pictured example, so here's what I mean:
Random Spaces
As you can see, there is whitespace after the X's. it should not be there, as this is not what is coming from the db. And strangely, when copy/pasted from the report, there is no white space either! Here it is when copy/pasted straight from the report:
"Test 15dig w spaces XXXXXXXXXXX2345"
Why is this occurring, and how can it be corrected? Currently there is no real formula for the field, it is just taking whatever it coming from the db straight into that field. The whitespace is being added automatically somehow, and I'm not sure at all why.
Here is what I've tried: Have tried calling ToText on the field (even though it is already a text field). Have also tried formatting the field in various different ways. Tried asking on SAP forum but no help as of yet.

Copy & Paste the text "Test 15dig w spaces XXXXXXXXXXX2345" into a text object in Crystal. Apply the same font and formatting (use the brush toolbar button to clone the formatting).
If you don't see the same strange space, the problem is due to non-printable characters in the database field. In that case, you can strip away such characters using a formula and the Replace() function.

Why do I get extra characters (arrow symbol) while fetching text through Tesseract?

Whenever I fetch text in any language, the output has this extra character (arrow symbol), which is not there in the image. I'd like to understand, why it is present, and how to avoid these extra characters in the output.

That's most likely the implicit page separator \f, which Notepad shows as that arrow. For some details on that topic, see: What page separators are used in txt output by Tesseract 4.0.0?
You can try to add -c page_separator="" to your config. You shouldn't see that symbol in your output then. Please notice, page breaks are entirely disabled then also.

Last string appears at beginning of line in formatted output

Has anyone any idea why the following would format itself in a weird way? In several years I've had no problem with creating simple text output but this problem has me baffled.
I'm using the line
print "$BC,$Ttl,$FN,$SN,$Finalage,$OurLoc,$OurDT,$FinalPC\n";
Every value is a simple text string on which I've run "chomp" to remove return characters.
I would expect the output to look like the following:
*DD10099999,,Information Services,Guest Ticket 2,41,C G,03/11/2020,NE8 9BB*
$BC is the first item and $FinalPC is the postcode at the end.
Instead I get:
*,NE8 9BB99, ,Information Services,Guest Ticket 2,41,C G,03/11/2020*
The final item has somehow moved to the beginning of the line and overwritten the first item. This is happening consistently on every line of my screen and text file output and I'm completely stumped as to why. The data is read from a text file and compared with database output which is also simple text. There are no occurrences of \b anywhere in my code. Why would a backspace character get into it?

The string in $OurDT ends with a carriage return, which causes your terminal to home the cursor. Presumably, the value of $OurDT came from a Windows file read on a unixy machine.
One option is to fix the file (e.g. by using the dos2unix utility).
Another is to accept both CRLF and LF as line endings (e.g. by using s/\s+\z// instead of chomp).

Why is this LSEP symbol showing up on Chrome and not Firefox or Edge?

So this web page is rendering with these symbols and they are found throughout this website/application but on no other sites. Can anyone tell me
What this symbol is?
Why it is showing up only in one browser?

That character is U+2028 Line Separator, which is a kind of newline character. Think of it as the Unicode equivalent of HTML’s .
As to why it shows up here: my guess would be that an internal database uses LSEP to not conflict with literal newlines or HTML tags (which might break the database or cause security errors), and either:
The server-side scripts that convert the database to HTML neglected to replace LSEP with 
Chrome just breaks standards by displaying LSEP as a printing (visible) character, or
You have a font installed that displays LSEP as a printing character that only Chrome detects. To figure out which font it is, right click on the offending text and click “Inspect”, then switch to the “Computed” tab on the right-hand panel. At the very bottom you should see a section labeled “Rendered Fonts” which will help you locate the offending font.
More information on the line separator, excerpted from the Unicode standard, Chapter 5.8, Newline Guidelines (on p. 12 of this PDF):
Line Separator and Paragraph Separator
A paragraph separator—independent of how it is encoded—is used to indicate a
separation between paragraphs. A line separator indicates where a line break
alone should occur, typically within a paragraph. For example:
This is a paragraph with a line separator at this point,
causing the word “causing” to appear on a different line, but not causing
the typical paragraph indentation, sentence breaking, line spacing, or
change in flush (right, center, or left paragraphs).
For comparison, line separators basically correspond to HTML , and
paragraph separators to older usage of HTML (modern HTML delimits
paragraphs by enclosing them in ...). In word processors, paragraph
separators are usually entered using a keyboard RETURN or ENTER; line
separators are usually entered using a modified RETURN or ENTER, such as
SHIFT-ENTER.
A record separator is used to separate records. For example, when exchanging
tabular data, a common format is to tab-separate the cells and to use a CRLF
at the end of a line of cells. This function is not precisely the same as line
separation, but the same characters are often used.
Traditionally, NLF started out as a line separator (and sometimes record
separator). It is still used as a line separator in simple text editors such as
program editors. As platforms and programs started to handle word processing
with automatic line-wrap, these characters were reinterpreted to stand for
paragraph separators. For example, even such simple programs as the Windows
Notepad program and the Mac SimpleText program interpret their platform’s NLF
as a paragraph separator, not a line separator. Once NLF was reinterpreted to
stand for a paragraph separator, in some cases another control character was
pressed into service as a line separator. For example, vertical tabulation VT
is used in Microsoft Word. However, the choice of character for line separator
is even less standardized than the choice of character for NLF. Many Internet
protocols and a lot of existing text treat NLF as a line separator, so an
implementer cannot simply treat NLF as a paragraph separator in all
circumstances.
Further reading:
Unicode Technical Report #13: Newline Guidelines
General Punctuation (U+2000–U+206F) chart PDF
SE: Why are there so many spaces and line breaks in Unicode?
SO: What is unicode character 2028 (LS / Line Separator) used for?
U+2028 on codepoints.net A misprint here says that U+2028 was added in v. 1.1 of the Unicode standard, which is false — it was added in 1.0

I found that in WordPress the easiest way to remove "L SEP" and "P SEP" characters is to execute this two SQL queries:
UPDATE wp_posts SET post_content = REPLACE(post_content, UNHEX('e280a9'), '')
UPDATE wp_posts SET post_content = REPLACE(post_content, UNHEX('e280a8'), '')
The javascript way (mentioned in some of the answers) can break some things (in my case some modal windows stopped working).

You can use this tool...
http://www.nousphere.net/cleanspecial.php
...to remove all the special characters that Chrome displays.
Steps:
Paste your HTML and Clean using HTML option.
You can manually delete the characters in the editor on this page and see the result.
Paste back your HTML in file and save :)

I recently ran into this issue, tried a number of fixes but ultimately I had to paste the text into VIM and there was an extra space I had to delete. I tried a number of HTML cleaners but none of them worked, VIM was the key!

9999years answers is great.
In case you use Symfony with Twig template I would recommend to check for an empty Twig block. In my case it was an empty Twig block with an invisible char inside.
The LSEP char was only displayed on certain device / browser.
On the other I had a blank space above the header and I could not see any invisible char.
I had to inspect the GET request to see that the value 1f18 was before the open html tag.
Once I removed an empty Twig block it was gone.
hope this can help someone one day ...

My problem was similar, it was "PSEP" or "P SEP". Similar issue, an invisible character in my file.
I replaced \x{2029} with a normal space. Fixed. This problem only appeared on Windows Chrome. Not on my Mac.

I agree with #Kapil Bathija - Basically you can copy & paste your HTML code into http://www.nousphere.net/cleanspecial.php and convert it.
Then it will convert the special characters for you - Just remove the spaces in between the words and you will realize you have to press backspace 2x meaning there is an invalid character that can't be translated.
I had the same issue and it worked just fine afterwards.

You can also copy the text, paste it into a HTML editor such as Coda, remove the linebreak, copy it and paste it back into your site.
Video here: https://www.loom.com/share/501498afa7594d95a18382f1188f33ce

Looks like my client pasted HTML into Wordpress after initially creating it with MS-Word. Even deleting the and visible spaces did not fix the issue. The extended characters became visible in vi/vim.
If you don't have vi/vim available, try highlighting from 2 chars before the LSEP to 2 chars after the LSEP; delete that chunk, and re-type the correct characters.

Odd characters in Sublime Text 2: `SOH` and `ACK`

In converting old notes from org syntax to mmd, I used the Clean Text app to remove extra line breaks and non-unicode characters, and to convert to safe line-endings. When pasting the text back into Sublime Text 2, I noticed several odd characters. I don't really care too much about why they're there, I'd just like to know what the characters are, and if they're searchable using a regex?

They are control characters, they don't have a printable representation. I don't know how they ended up in your file.
In a regex, you can search for SOH with \u0001 and ACK with \u0006

I said I didn't care about the reason at the time. I did eventually find out what the problem was the next time this happened. Turned out it had to do with running a LaTeX auto-build tool in the background. It was already being compiled in one shell and hung due to an error. The next time I tried to compile, these control characters started showing up in my editor.
I no longer make assumptions about the number of processes I have running. ps aux | grep <whatever> is your friend.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse