Why do some characters become ? and others become ☐ (␇) when encoding into a code page? - encoding

Short version
What's the reasoning behind the mapper sometimes using ? and other times using ☐?
Unicode: €‚„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ
In 850: ?'".┼╬^%S<OZ''""☐--~Ts>ozY
^\_____________/^\_______/
| | | |
| best fit | best fit
| |
replacement replacement
CS Fiddle
Long Version
I was encoding some text to code page 850, and while a lot of characters that users use exist perfectly in the 850 code page, there are some that don't match exactly. Instead the mapper (e.g. .NET System.Text.Encoding, or Winapi WideStringToMultiByte) provides a best fit:
| Character | In code-page 850 |
|-----------|------------------|
| | |
| | |
| ‚ U+201A | ' |
| | |
| „ U+201E | " |
| … U+2026 | . |
| † U+2020 | ┼ |
| ‡ U+2021 | ╬ |
| ˆ U+02C6 | ^ |
| ‰ U+2030 | % |
| Š U+0160 | S |
| ‹ U+2039 | < |
| ΠU+0152 | O |
| | |
| Ž U+017D | Z |
| | |
| | |
| ‘ U+2018 | ' |
| ’ U+2019 | ' |
| “ U+201C | " |
| ” U+201D | " |
| | |
| – U+2013 | - |
| — U+2014 | - |
| ˜ U+02DC | ~ |
| ™ U+2122 | T |
| š U+0161 | s |
| › U+203A | > |
| œ U+0153 | o |
| | |
| ž U+017E | z |
| Ÿ U+0178 | Y |
These best fits are right, good, appropriate, wanted, and entirely reasonable:.
But some characters do not map:
| Character | In code-page 850 |
|-----------|------------------|
| € U+20AC | ? 0x37 | literally 0x37 (i.e. U+003F Question Mark)
| • U+2022 | ☐ 0x07 | literally 0x07 (i.e. U+0007 BELL)
What's the deal?
Why is it sometimes a question mark, and other times a ␇?
Note: This lack of mapping isn't terribly important to me. If the federal government doesn't support a reasonable encoding, then they'll take the garbage i give them. So i'm fine with it.
A problem comes later when i try to call MultiByteToWideChar to reverse the mapping, and the function fails due to invalid characters. And while i can try to figure out the issue with reverse encoding back into characters later; i'm curious what the encoding mapper is trying to tell me.
Bonus fun
The careful observer will understand why i chose the characters i did, in the order i did, and why there are gaps. I didn't want to mention it so as to confuse readers of the question.

The answer is both subtle, and obvious.
When performing a mapping, the encoder tries to perform a best-fit. And so while a lot of the characters don't exist in the target code-page; they can be approximated well enough.
Some characters don't have any equivalent, nor any best-fit mapping, and so are simply replaced with ?.
U+0037 QUESTION MARK
So the text:
The price of petrol is € 1.56 in Germany.
Will unfortunately become:
The price of petrol is ? 1.56 in Germany.
The question mark means that the character has no equivalent and was just lost.
The other character is more subtle
In ASCII, the first 32 characters are control characters, e.g.
13: Carriage Return (␍)
10: Line Feed (␊)
9: Horizontal Tab (␉)
11: Vertical Tab (␋)
7: Bell (␇)
27: Escape (␛)
30: Record Separator (␞)
These control code are generally unprintable. But code page 437 did something unique: they defined characters for those first 32 codes:
13: Eighth note (♪)
10: Inverse white circle (◙)
9: White circle (○)
11: Male Sign (♂)
7: Bullet (•)
27: Right Angle (∟)
30: Black up-pointing triangle (▲)
This has interesting implications if you had some text such as:
The price of petrol␍␊
• Germany: €1.56␍␊
• France: €1.49
When encoded in Code Page 850 becomes:
The price of petrol♪◙• Germany: ?1.56♪◙• France: €1.49
Notice 3 things:
The € symbol was lost; replaced with ?
The • symbol was retained
The CR LF symbols were lost; replaced with ♪ and ◙
Trying to decode the code page 437/850 back to real characters presents a problem:
If i want to retain my CRLF, i have to assume that the characters in the 1..32 range actually are ASCII control characters
The price of petrol␍␊
␇ Germany: €1.56␍␊
␇ France: €1.49
If i want to retain my characters (e.g. ¶, •, §) , i have to permanently lose my CRLF, and assume that the characters in 1..32 are actually characters:
The price of petrol♪◙• Germany: €1.56♪◙• France: €1.49
There's no good way out of this.
Ideally Code Page 437 would not have done this to the first 32 characters in the code page, and kept the control characters. And ideally anyone trying to convert the text to 437:
• The price of petrol is € 1.56 in Germany ♪sad song♪
would come back with
? The price of petrol is ? 1.56 in Germany ?sad song?
But that's not what the 437 code page is.
It's a horrible mess; where you have to pick your poison and die slowly.
Rest in Peace Michael Kaplan
This answer brought to you by "☼" (U+263c, a.k.a. WHITE SUN WITH RAYS)
A proud member of the glyph chars collection for more years than Seattle has seen sun
See Michael Kaplan's archived blog entry (🕗):
What the &%#$ does MB_USEGLYPHCHARS do?
I'm still angry at the Microsoft PM who shut down his blog out of spite.

Related

PySpark SQL query to return row with most number of words

I am trying to come up with a pyspark sql query to return the row within the text column of the review Dataframe with the most number of words.
I would like to return both the full text as well as the number of words. This question is in regards to the reviews of the Yelp dataset. Here is what I have so far but apparently it is not (fully) correct:
query = """
SELECT text,LENGTH(text) - LENGTH(REPLACE(text,' ', '')) + 1 as count
FROM review
GROUP BY text
ORDER BY count DESC
"""
spark.sql(query).show()
Here is an example of a few rows from the dataframe:
[Row(business_id='ujmEBvifdJM6h6RLv4wQIg', cool=0, date='2013-05-07 04:34:36', funny=1, review_id='Q1sbwvVQXV2734tPgoKj4Q', stars=1.0, text='Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.', useful=6, user_id='hG7b0MtEbXx5QzbzE6C_VA'),
Row(business_id='NZnhc2sEQy3RmzKTZnqtwQ', cool=0, date='2017-01-14 21:30:33', funny=0, review_id='GJXCdrto3ASJOqKeVWPi6Q', stars=5.0, text="I *adore* Travis at the Hard Rock's new Kelly Cardenas Salon! I'm always a fan of a great blowout and no stranger to the chains that offer this service; however, Travis has taken the flawless blowout to a whole new level! \n\nTravis's greets you with his perfectly green swoosh in his otherwise perfectly styled black hair and a Vegas-worthy rockstar outfit. Next comes the most relaxing and incredible shampoo -- where you get a full head message that could cure even the very worst migraine in minutes --- and the scented shampoo room. Travis has freakishly strong fingers (in a good way) and use the perfect amount of pressure. That was superb! Then starts the glorious blowout... where not one, not two, but THREE people were involved in doing the best round-brush action my hair has ever seen. The team of stylists clearly gets along extremely well, as it's evident from the way they talk to and help one another that it's really genuine and not some corporate requirement. It was so much fun to be there! \n\nNext Travis started with the flat iron. The way he flipped his wrist to get volume all around without over-doing it and making me look like a Texas pagent girl was admirable. It's also worth noting that he didn't fry my hair -- something that I've had happen before with less skilled stylists. At the end of the blowout & style my hair was perfectly bouncey and looked terrific. The only thing better? That this awesome blowout lasted for days! \n\nTravis, I will see you every single time I'm out in Vegas. You make me feel beauuuutiful!", useful=0, user_id='yXQM5uF2jS6es16SJzNHfg'),
Row(business_id='WTqjgwHlXbSFevF32_DJVw', cool=0, date='2016-11-09 20:09:03', funny=0, review_id='2TzJjDVDEuAW6MR5Vuc1ug', stars=5.0, text="I have to say that this office really has it together, they are so organized and friendly! Dr. J. Phillipp is a great dentist, very friendly and professional. The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable! I don't have dental insurance, but they have this insurance through their office you can purchase for $80 something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit! I highly recommend this office for the nice synergy the whole office has!", useful=3, user_id='n6-Gk65cPZL6Uz8qRm3NYw')]
And expected output if this was the review with the most words:
I have to say that this office really has it together, they are so organized and friendly! Dr. J. Phillipp is a great dentist, very friendly and professional. The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable! I don't have dental insurance, but they have this insurance through their office you can purchase for $80 something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit! I highly recommend this office for the nice synergy the whole office has!
And then something like Word count = xxxx
Edit: Here the example output for the first review using this code:
query = """
SELECT text, size(split(text, ' ')) AS word_count
FROM review
ORDER BY word_count DESC
"""
spark.sql(query).show(20, False)
Review returned with highest number of words:
Got a date with de$tiny?
** A ROMANTIC MOMENT WITH **
** THE BEST VIEW IN TOWN**
------------------------------------------------
/ **CN TOWER'S** \
/ **REVOLVING RESTAURANT** \
\ /
\ ----------------------------------------------- /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
/ \
===========
o o~
/|~ ~|\
/\ / \ uhm, maybe not. the view may be great but a $30 to
$40 bleh $teak ain't necessarily gonna get you some
action later. Cheaper to get takeout from Harvey's and
eat and the beach! |4329 |
Encapsulating the UDF you had into native SQL logic by splitting string into an array of words and finding the array size.
spark.sql("SELECT text, size(split(text, ' ')) as word_count FROM review ORDER BY word_count DESC").show(200, False)
Example
data = [("This is a sentence.",), ("This sentence has 5 words.", )]
review = spark.createDataFrame(data, ("text", ))
review.registerTempTable("review")
spark.sql("SELECT text, size(split(text, ' ')) as word_count FROM review ORDER BY word_count DESC").show(200, False)
Output
+--------------------------+----------+
|text |word_count|
+--------------------------+----------+
|This sentence has 5 words.|5 |
|This is a sentence. |4 |
+--------------------------+----------+

Generating a markdown table with key bindings in Spacemacs

What is the best way to generate a markdown table with key bindings in Spacemacs (evil mode)?
Update: To clarify, this question is not about editing markdown, but automatically generating the table content for a large number of key bindings.
This could be an elisp function iterating through the possible single keystrokes (letters, numbers, punctuation, and possibly space and some control characters, with and without modifier keys), seeing what function each key is bound to (if any), and getting the description of the function.
You can do that manually using SPC h d k, but it would be handy to generate a table, given the number of possible key bindings and the way they can depend on the buffer mode and the state.
The table should show single keystrokes (letters, numbers, punctuation) with and without modifiers, the function bound to them, and the first line of the function description.
The result should look something like this:
https://github.com/cjolowicz/howto/blob/master/spacemacs.md
| Key | Mnemonic | Description | Function |
| ------ | -------- | --------------------------------------------------------------- | ------------------------ |
| a | *append* | Switch to Insert state just after point. | `evil-append` |
| b | *backward* | Move the cursor to the beginning of the COUNT-th previous word. | `evil-backward-word-begin` |
| c | *change* | Change text from BEG to END with TYPE. | `evil-change` |
| d | *delete* | Delete text from BEG to END with TYPE. | `evil-delete` |
| e | *end* | Move the cursor to the end of the COUNT-th next word. | `evil-forward-word-end` |
| f | *find* | Move to the next COUNT’th occurrence of CHAR. | `evil-find-char` |
| g | *goto* | (prefix) | |
| h | | Move cursor to the left by COUNT characters. | `evil-backward-char` |
| i | *insert* | Switch to Insert state just before point. | `evil-insert` |
| j | | Move the cursor COUNT lines down. | `evil-next-line` |
| k | | Move the cursor COUNT lines up. | `evil-previous-line` |
| l | | Move cursor to the right by COUNT characters. | `evil-forward-char` |
| m | *mark* | Set the marker denoted by CHAR to position POS. | `evil-set-marker` |
| n | *next* | Goes to the next occurrence. | `evil-ex-search-next` |
| o | *open* | Insert a new line below point and switch to Insert state. | `evil-open-below` |
| p | *paste* | Disable paste transient state if there is more than 1 cursor. | `evil-mc-paste-after` |
| q | | Record a keyboard macro into REGISTER. | `evil-record-macro` |
| r | *replace* | Replace text from BEG to END with CHAR. | `evil-replace` |
| s | *substitute* | Change a character. | `evil-substitute` |
| t | *to* | Move before the next COUNT’th occurrence of CHAR. | `evil-find-char-to` |
| u | *undo* | Undo changes. | `evil-tree-undo` |
| v | *visual* | Characterwise selection. | `evil-visual-char` |
| w | *word* | Move the cursor to the beginning of the COUNT-th next word. | `evil-forward-word-begin` |
| x | *cross* | Delete next character. | `evil-delete-char` |
| y | *yank* | Saves the characters in motion into the kill-ring. | `evil-yank` |
| z | *scroll* | (prefix) | |
(The Mnemonic column would of course be handcrafted.)
The orgtbl-mode minor mode that comes with Org (and therefore Emacs itself) should be able to help here. Activate it, then use Tab and Ret to navigate from cell to cell, letting orgtbl create and balance cells as you go. (Balancing happens when you navigate to a new cell, e.g. with Tab.)
You'll have to start the table yourself, e.g. with something like
| Key | Mnemonic | Description | Function |
|-
but from there orgtbl can take over. You can also use things like org-table-insert-column and org-table-move-row-down to make other kinds of tabular changes.
I'm not entirely sure how nicely this will play with evil-mode or what bindings it will use come out of the box, but it's worth a try.

Full text search configuration on postgresql

I'm facing an issue concerning the text search configuration on postgresql.
I have a table users wich contains a column name. The name of users maybe a french, english, spanish or any other language.
So I need to use the Full Text Search of postgresql. The default text serach configuration I'm using now is the simple configuration but is not efficient to make the search and get the suitable results.
I'm trying to combine different text search configuration like this:
(to_tsvector('english', document) || to_tsvector('french', document) || to_tsvector('spanish', document) || to_tsvector('russian', document)) ##
(to_tsquery('english', query) || to_tsquery('french', query) || to_tsquery('spanish', query) || to_tsquery('russian', query))
But this query didn't give suitable results, if we test for example:
select (to_tsvector('english', 'adam and smith') || to_tsvector('french', 'adam and smith') || to_tsvector('spanish', 'adam and smith') || to_tsvector('russian', 'adam and smith'))
tsvector: 'adam':1,4,7,10 'and':5,8 'smith':3,6,9,12
Using the origin language of the word:
select (to_tsvector('english', 'adam and smith'))
tsvector: 'adam':1 'smith':3
The first thing to mention that the stopwords were not token into consideration when we combine different configuration with || operator.
Is there any solution to combine different text search configuration and use the suitable language when a user search a text?
Maybe you think that || is an “or” operator, but it concatenates text search vectors.
Take a look at what happens in your expression.
Running \dF+ french in psql will show you that for asciiwords, a French Snowball stemmer is used. That removes stop words and reduces the words to their stem. Similar for English and Russian.
You can use ts_debug to see this in operation:
test=> SELECT * FROM ts_debug('english', 'adam and smith');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | adam | {english_stem} | english_stem | {adam}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | and | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | smith | {english_stem} | english_stem | {smith}
(5 rows)
test=> SELECT * FROM ts_debug('french', 'adam and smith');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+---------------+-------------+---------
asciiword | Word, all ASCII | adam | {french_stem} | french_stem | {adam}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | and | {french_stem} | french_stem | {and}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | smith | {french_stem} | french_stem | {smith}
(5 rows)
Now if you concatenate these four tsvectors, you end up with adam in position 1, 4, 7 and 10.
There is no good way to use full text search for different languages at once.
But if it is really personal names you are searching, I would do the following:
Create a text search configuration with a simple dictionary for asciiwords, and either use an empty stopword file for the dictionary or one that contains stopwords that are acceptable in all languages.
Personal names normally should not be stemmed, so you avoid that problem. And if you miss a stopword, that's no big deal. It only makes the resulting tsvector (and index) larger, but with personal names there should not be too many stopwords anyway.

Error in org-table-sum org-mode?

I am just getting started with Emacs org-mode and I am already getting really confused about a simple column sum (org-table-sum). I start with
| date | sum |
|------+-------|
| | 16.2 |
| | 6.16 |
| | 6.16 |
| | |
When I hit C-c + (org-table-sum) below the second column I get the correct sum 28.52. If I add another line to make it
| date | sum |
|------+-------|
| | 16.2 |
| | 6.16 |
| | 6.16 |
| | 13.11 |
| | |
C-c + gives me 41.629999999999995. ???
If I change the last line from 13.11to 13.12, C-c +will give me (the correct) 41.64.
WTF?
Any explanation appreciated! Thanks!
Most decimal numbers cannot be represented exactly in binary floating point encoding (either single or double precision).
Test 13.11 here, to see that after conversion to double precision, the nearest number represented is 13.109999656677246.
This problem is not emacs related, but is a fundamental issue when working with floating point representation in a different base (binary rather than decimal).
Using calc's vsum, the result is OK:
| date | sum |
|------+-------|
| | 16.2 |
| | 6.16 |
| | 6.16 |
| | 13.11 |
|------+-------|
| | 41.63 |
#+TBLFM: #6$2=vsum(#I..#II)
This works because calc works with arbitrary precision and will not encode the numbers in a binary floating point format.

Escaping special characters in to_tsquery

How do you espace special characters in string passed to to_tsquery? For instance, this kind of query:
select to_tsquery('AT&T');
Produces:
NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
to_tsquery
------------
(1 row)
Edit: I also noticed that there is the same issue in to_tsvector.
A simple solution is to create the tsquery as follows:
select $$'AT&T'$$::tsquery;
You can make more complex queries:
select $$'AT&T' & Phone | '|Bang!'$$::tsquery;
See the text search docs for more.
I found this comment very useful that uses the plainto_tsquery('AT&T) function https://stackoverflow.com/a/16020565/350195
If you want 'AT&T' to be treated as a search word, you're going to need some customised components, because the default parser splits it as two words:
steve#steve#[local] =# select * from ts_parse('default', 'AT&T');
tokid | token
-------+-------
1 | AT
12 | &
1 | T
(3 rows)
steve#steve#[local] =# select * from ts_debug('simple', 'AT&T');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+--------------+------------+---------
asciiword | Word, all ASCII | AT | {simple} | simple | {at}
blank | Space symbols | & | {} | |
asciiword | Word, all ASCII | T | {simple} | simple | {t}
(3 rows)
As you can see from the documentation for CREATE TEXT PARSER this is not very trivial, as the parser appears to need to be a C function.
You might find this post of someone getting "underscore_word" to be recognised as a single token useful: http://postgresql.1045698.n5.nabble.com/Configuring-Text-Search-parser-td2846645.html