I have a CSV file source that has a field containing "CR" and "LF" type escape characters.
I'm trying to use a DataFlow and Derived Column to remove the unwanted characters but its not exactly working.
Here is my input with the unwanted escape chars
I'm using this expression against the synopsis column:
regexReplace(regexReplace(synopsis, `[\n]`, ''),`[\r]`, '')
As suggested in this similar post - How to replace CR and LF in Data Factory expressions
But I'm still getting some LF chars, but also a lot of extra commas.
Here is my output still with LF and extra Commas
Original in text format:
"TTL-100912","False",,"Bad Guys, The","GEN-ANI",,"Nobody has ever failed so hard at trying to be good as The Bad Guys.
In the new action comedy from DreamWorks Animation, based on the New York Times best-selling book series, a crackerjack criminal crew of animal outlaws are about to attempt their most challenging con yet--becoming model citizens.
Never have there been five friends as infamous as The Bad Guys--dashing pickpocket Mr. Wolf (Academy Award® winner Sam Rockwell, Three Billboards Outside Ebbing, Missouri), seen-it-all safecracker Mr. Snake (Marc Maron, GLOW), chill master-of-disguise Mr. Shark (Craig Robinson, Hot Tub Time Machine franchise), short-fused "muscle" Mr. Piranha (Anthony Ramos, In the Heights) and sharp-tongued expert hacker Ms. Tarantula (Awkwafina, Crazy Rich Asians), aka "Webs." The film co-stars Zazie Beetz (Joker), Lilly Singh (Bad Moms) and Emmy winner Alex Borstein (The Marvelous Mrs. Maisel).
Based on the blockbuster Scholastic book series by Aaron Blabey, THE BAD GUYS is directed by Pierre Perifel (animator, the Kung Fu Panda films), making his feature-directing debut. The film is produced by Damon Ross (development executive Trolls, The Boss Baby, co-producer Nacho Libre) and Rebecca Huntley (associate producer, The Boss Baby). The executive producers are Aaron Blabey, Etan Cohen and Patrick Hughes. ",,"BAD GUYS, THE","Bad Guys, The",,,,"2021-10-12 15:39:24",,
The way the source file was being produced was my complicating factor.
I changed the Row Delimiter in the production of my source csv to "¬" from "," and then was able to use this expression in a data flow to clean the contents of the Synopsis field:
regexReplace(synopsis, ',|[\n]|[\r]', ' ')
Related
Disclaimer: I have no engineering background whatsoever - please don't hold it against me ;)
What I'm trying to do:
Scan a bunch of text strings and find the ones that
are more than one word
contain title case (at least one capitalized word after the first one)
but exclude specific proper nouns that don't get checked for title case
and disregard any parameters in curly brackets
Example: Today, a Man walked his dogs named {FIDO} and {Fifi} down the Street.
Expectation: Flag the string for title capitalization because of Man and Street, not because of Today, {FIDO} or {Fifi}
Example: Don't post that video on TikTok.
Expectation: No flag because TikTok is a proper noun
I have bits and pieces, none of them error-free from what https://www.regextester.com/ keeps telling me so I'm really hoping for help from this community.
What I've tried (in piece meal but not all together):
(?=([A-Z][a-z]+\s+[A-Z][a-z]+))
^(?!(WordA|WordB)$)
^((?!{*}))
I think your problem is not really solvable solely with regex...
My recommendation would be splitting the input via [\s\W]+ (e.g. with python's re.split, if you really need strings with more than one word, you can check the length of the result), filtering each resulting word if the first character is uppercase (e.g with python's string.isupper) and finally filtering against a dictionary.
[\s\W]+ matches all whitespace and non-word characters, yielding words...
The reasoning behind this different approach: compiling all "proper nouns" in a regex is kinda impossible, using "isupper" also works with non-latin letters (e.g. when your strings are unicode, [A-Z] won't be sufficient to detect uppercase). Filtering utilizing a dictionary is a way more forward approach and much easier to maintain (I would recommend using set or other data type suited for fast lookups.
Maybe if you can define your use case more clearer we can work out a pure regex solution...
I've run into an interesting problem I'm hoping someone can shed some light on.
I'm trying to pull a unique list of names from an MS SQL Database - but the company has been sloppy with their names. They were tacking on a code to the end of last name for some users. I need to remove that code.
Example:
firstname lastname
John Doe
Mary Smith AST
Mike Jackson AST
Brian Astor
Jackie Masterson
In the example, "AST" is the code they tack on. It's not tacked on to all last names either. I need to get an output of just the last names without the code.
I would have expected this is a simple use of REPLACE. I tried:
select REPLACE(lastname, ' AST', '') from table
Note the leading space in the quotes for the search phrase... this does work to remove the "AST" appended to the last names.
However - my problem is that it will also remove anywhere AST appears at the BEGINNING of the field. So Brian Astor comes out as "Brian or" since the field started with AST. However... it correctly does not remove ast from the middle, so Jackie Masterson is fine.
Any ideas why it is ignoring the leading space in my search phrase for the beginning of the field? I've tried ltrim to eliminate the possibility the field has leading spaces.
Thanks!
Replace with an empty string will eliminate the searched string anywhere in your source string. So the behaviour is as expected.
If you only need to replace ' ast' at the end of your searched string, try something like this:
select replace(lastname + '$$$', ' AST$$$', '') from table
Of course you need to be sure that the $$$ appended don't appear by chance in your source string (lastname). Which I guess is not that likely.
Here is an example on my github profile - https://github.com/jack17529
I want to change this -
Silver Bullet in Issue KILLING.____
Master Mind to create Issues.______
My strongest language is Python not English.
I want to have newline instead of blanks.
like this -
Silver Bullet in Issue KILLING.
Master Mind to create Issues.
My strongest language is Python not English.
I have checked Bitbucket Bio is nowhere related to Github Bio.
Maybe they don't allow us to do it via the normal way, But It is possible to do of course. We can use the auto newline rule for the words which are too long for appending to the current line, for our need. All we need to do is putting other Unicode Spaces instead of normal space. And normal space between lines, for using newline rule against forbidden newline rule.
And if you want a free line, because of the character limitations, you can use the longer one;
" " instead of " " (Try selecting spaces between quotes with your mouse)
Also this trick allows me create unnecessary spaces in the Stack Overflow too, like above, in the quote box.
Here is the result: github.com/cosmicog:
I have tried other answers, html ways, but no, they handle html tricks of course.
Note: This causes a bad look in the list view and the profile overview tooltip:
Maybe that is why it is not allowed but I hope they will fix this in the future.
As told to me by github support there is no way !, see here -
According to Github Support
I just did it by simply copying and pasting the character corresponding to this codepoint | unicode-table.com | as many time as needed in order to align the text the way I wanted.
This is the procedure I followed: at the end of each line I pressed Enter, then I filled the new line with 7 instances of the character mentioned above; then I pressed Enter again and started the new line with its text.
This question is a little stale, but I found it before I solved this myself, so I thought I'd drop my solution.
The bio doesn't appear to honor markdown, but neither does it accept HTML entities or elements. I worked around this with non-breaking characters to create long "words" similar to how you've used "_".
You can see in my bio that I needed a " " and a "‑" to format mine. The long word will pop to the next line. If you have a real short line, you can extend it with a lot of non-breaking spaces, but this probably isn't necessary. Since you cannot enter " " you need to use copy/paste or ALT codes (not looked up, but someone might add these for you). Those are the real characters above, so you can take them from this answer.
Refer: How to create newline in Github Bio
Just use in HTML editor mode to new line is OK, This is my GitHub Bio
I have a large .csv file with 9 million rows. Some of these columns contain text with quotes or other special characters in them I would like to import from this .csv file into the database. For example I would like to import this row:
ID BH Units Name Type_building Year_cons
1 4 900.00 schoolgebouw "De Bolster Schoolgebouw 2014-01-01
As you can see there is a double quote in the fourth column. None of the values in the .csv file are quoted, but sometimes a double quote or backslash '\' appears in the text. When I try to upload the data using:
\COPY <tablename> FROM <path to file> WITH CSV DELIMITER ';' NULL '\N';
It gives an error message: ERROR value to long for type character varying(25).
Apparently it sees the double quote as the start of a string and it tries to combine everything after it in the .csv file (including the fifth and sixth column) into a single cell (so that cell will contain 'De Bolster Schoolgebouw 2014-01-01'), which doesnt fit because the 'Name' column allows max 25 characters.
I found a similar topic (Is it possible to turn off quote processing in the Postgres COPY command with CSV format?) in which this solution was presented:
\COPY <tablename> FROM <path to file> WITH CSV DELIMITER ';' QUOTE E'\b' NULL '\N';
I think what it does is sets the quote value (default is double quote) to something else, in this case a backspace, so it won't recognize a double quote as a quote anymore. However when I run this I get another error: INVALID input syntax for integer.
What has happened is that every value now is quoted, so ID with value '1' becomes value '"1"' and because ID is defined as an integer it won't accept quotes.
Do you have any idea how to import double quotes and other special characters from a .csv file into a postgres database?
Thanks in advance!!
Based on the error message, I'd be suspicious it has anything to do with double quoting or anything of the sort -- had it been so, it would have been a widely reported bug and fixed ages ago.
When it comes to Postgres, the error messages are almost always correct and helpful. As such, consider the very real possibility that there are more characters than meets the eye.
My own guess is that you've some trailing (or leading) spaces in there somewhere, and as such have pieces of data that look 24 characters long when viewed in a spreadsheet while being, in fact, longer.
If you don't, my second guess would be some kind of bizarro character sets conflicts or effects. Perhaps you've some double byte characters, or two single characters behaving as a single one due to a diacritic in there. These look fine in the viewer you're using for your data; but then when these get interpreted or viewed as utf8 they end up counting as two distinct characters. Unlikely imo, but possible (example).
Lastly and per Frank's suggestion, try removing the length constraint. It is only slowing you down as things stand, because it slows down inserts and is preventing you to move forward. Once done importing, re-add the constraint to the table's definition. You'll then be able to find the offending rows using the likes of:
select name from table where length(name) > 24;
... and upon fixing them, you'll be able to re-add your constraint if it serves any purpose. (Hint: it doesn't, or at the very least shouldn't have. There's a real person out there whose name is: "Kim-Jong Sexy Glorious Beast Divine Dick Father Lovely Iron Man Even Unique Poh Un Winn Charlie Ghora Khaos Mehan Hansa Kimmy Humbero Uno Master Over Dance Shake Bouti Bepop Rocksteady Shredder Kung Ulf Road House Gilgamesh Flap Guy Theo Arse Hole Im Yoda Funky Boy Slam Duck Chuck Jorma Jukka Pekka Ryan Super Air Ooy Rusell Salvador Alfons Molgan Akta Papa Long Nameh Ek.")
In my app before I send a string off I need to work out if the text entered in the textbox is a UK Postcode. I don't have the regex ability to work that out for myself and after searching around I can't seem to work it out! Just wondered if anyone has done a similar thing in the past?
Or if anyone can point me in the right direction I would be most appreciative!
Tom
Wikipedia has a good section about this. Basically the answer depends on what sort of pathological cases you want to handle. For example:
An alternative short regular expression from BS7666 Schema is:
[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-HJLNP-UW-Z]{2}
The above expressions fail to exclude many non-existent area codes (such as A, AA, Z and ZY).
Basically, read that section of Wikipedia thoroughly and decide what you need.
for post codes without spaces (e.g. SE19QZ) I use: (its not failed me yet ;-) )
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})
if spaces (e.g. SE1 9QZ) , then:
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})$
You can match most post codes with this regex:
/[A-Z]{1,2}[0-9]{1,2}\s?[0-9]{1,2}[A-Z]{1,2}/i
Which means... A-Z one or two times ({1,2}) followed by 0-9 1 or two times, followed by a space \s optionally ? followed by 0-9 one or two times, followed by A-Z one or two times.
This will match some false positives, as I can make up post codes like ZZ00 00ZZ, but to accurately match all post codes, the only way is to buy post code data from the post office - which is quite expensive. You could also download free post code databases, but they do not have 100% coverage.
Hope this helps.
Wikipedia has some regexes for UK Postcodes: http://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation