I have a CSV file source that has a field containing "CR" and "LF" type escape characters.
I'm trying to use a DataFlow and Derived Column to remove the unwanted characters but its not exactly working.
Here is my input with the unwanted escape chars
I'm using this expression against the synopsis column:
regexReplace(regexReplace(synopsis, `[\n]`, ''),`[\r]`, '')
As suggested in this similar post - How to replace CR and LF in Data Factory expressions
But I'm still getting some LF chars, but also a lot of extra commas.
Here is my output still with LF and extra Commas
Original in text format:
"TTL-100912","False",,"Bad Guys, The","GEN-ANI",,"Nobody has ever failed so hard at trying to be good as The Bad Guys.
In the new action comedy from DreamWorks Animation, based on the New York Times best-selling book series, a crackerjack criminal crew of animal outlaws are about to attempt their most challenging con yet--becoming model citizens.
Never have there been five friends as infamous as The Bad Guys--dashing pickpocket Mr. Wolf (Academy Award® winner Sam Rockwell, Three Billboards Outside Ebbing, Missouri), seen-it-all safecracker Mr. Snake (Marc Maron, GLOW), chill master-of-disguise Mr. Shark (Craig Robinson, Hot Tub Time Machine franchise), short-fused "muscle" Mr. Piranha (Anthony Ramos, In the Heights) and sharp-tongued expert hacker Ms. Tarantula (Awkwafina, Crazy Rich Asians), aka "Webs." The film co-stars Zazie Beetz (Joker), Lilly Singh (Bad Moms) and Emmy winner Alex Borstein (The Marvelous Mrs. Maisel).
Based on the blockbuster Scholastic book series by Aaron Blabey, THE BAD GUYS is directed by Pierre Perifel (animator, the Kung Fu Panda films), making his feature-directing debut. The film is produced by Damon Ross (development executive Trolls, The Boss Baby, co-producer Nacho Libre) and Rebecca Huntley (associate producer, The Boss Baby). The executive producers are Aaron Blabey, Etan Cohen and Patrick Hughes. ",,"BAD GUYS, THE","Bad Guys, The",,,,"2021-10-12 15:39:24",,
The way the source file was being produced was my complicating factor.
I changed the Row Delimiter in the production of my source csv to "¬" from "," and then was able to use this expression in a data flow to clean the contents of the Synopsis field:
regexReplace(synopsis, ',|[\n]|[\r]', ' ')
If I search on a table with a Name field using "^Word$" it will find it.
If I have a Wordform in a Word1 Word2 > Word3 construction e.g.
United States of America > USA
the same query will work. However if I do the same wordform in reverse e.g. Word3 > Word1 Word2:
USA > United States of America
Then is is not found using the same start/end modifier. However my habit is to do Word1 > Word2 Word3 so that Word 2 and Word 3 can still be found in a search which won't work the other way.
Is there a way to set up the Start/End modifier search so that it still finds W1 > W2 W3?
The only suggestion I have is to use regexp_filter to do the expansion, rather than wordforms.
regexp_filter = \bUSA\b => United States of America
or similar. benefit have more control over capitaization (eg only do uppercase USA)
This means, the expansion happens much earlier in the tokeniation process, meaning it has less effect on extended query syntax.
In theory a query
"^Word$"
should then be turned into
"^United States of America$"
which still works :)
I think wordforms dont work, because America$ will have been put into the index as a keyword. But the query looking for both ^ and $ on the one word.
I used pg_trgrm to check string matches and I am pretty happy with the results. But it is not pefrectly the way I want it. I want that searches like "poduto" finds "produtos" (the r was missing). And Also that "sofáa" finds "sofa". I am using posgresql 9.6.
It does find "vermelho" when I type "vermelo" (h is missing). And it does find "sofa" when I type "sof". It seems that only some letters in middle can be left out and I always can miss a final letter. I want to be able to miss any letter in the middle of the word. And also be able to commit "two mistakes" in the case of sofáa and sofá (I used an accent and used one additional "a").
The solution is to lower pg_trgm.similarity_threshold (or pg_trgm.word_similarity_threshold if you are using <% or %>).
Then words with lower similarity will also be found.
I'm trying to use Postgres' full text search, but I'm struggling to get certain query phrases working properly when stemming is involved.
strawberries matches strawberry
fruity does not match fruit
From what I've read these stemming algorithms are internal to Postgres and can't necessarily be modified easily. Does anyone know if the -y suffix can be stemmed properly?
This is too long for a comment.
I assume you are at least familiar with the documentation on the subject. I think the simplest method would be to create a synonym dictionary with the pairs that are equivalent.
You need to be careful. There are lots of words in English where you cannot remove the "y":
lay <> la
Gaily <> Gail (woman's name)
Daily <> Dail (Irish parliament)
foxy <> fox
analog <> analogy
And this doesn't include the zillions of words where removing the "y" creates a non-word (a large class are -ly words; another, -way words).
You will need to manual create these yourselves.
I am not intimately familiar with Postgres's dictionaries. But you should be able to accomplish what you want.
Can I request star and stop position of matches in a document with sphinx?
Given say
select ID from idx_1 WHERE (MATCH('#(name) "New York"'))
can I ask it to tell me the character position of the first Letter, 'N', in New York and the last letter, K, in New Yor'K' in the match?
Sphinx does not track character positions, so can't directly tell you that.
Could use BuildExcerpts or SNIPPET function, which could perhaps compare the output with the documet to deduce the position yourself.
Or there is the PACKEDFACTORS function, which will give you many details of the ranking calculation. In there is the WORD position of each keyword. (sphinx does track word positions, as all its matching is work (well token) based)
Yes you can use like below while using sphinxQL
select * from courses where MATCH('#title PMP') limit 0, 100;