Is There a Way to do Binomial Regression in PySpark? - pyspark

I'm working with a PySpark dataframe and I need to do a Binomial regression, with more than one trial in each row. For example, my table looks like this:
┌──────────┬──────────┬─────────────┬────────────┐
│ Features │ # Trials │ # Successes │ # Failures │
├──────────┼──────────┼─────────────┼────────────┤
│ ... │ 10 │ 4 │ 6 │
│ ... │ 7 │ 2 │ 5 │
│ ... │ 5 │ 4 │ 1 │
└──────────┴──────────┴─────────────┴────────────┘
I don't want to 'ungroup' the data. In statsmodels, there is a possibility to directly do a Binomial Regression on the grouped data with a patsy formula:
formula = '# Successes + # Failures ~ Features'
Is there a way to do so in PySpark as well?

Related

mongodb configuration: specify dbpath relative to config file

Let's consider 3 directories, each holds config and data files of a running mongod instance:
D:\ELIAV\PROJECTS\TMP\MONGODB-CLUSTER
│
├───shard1-1
│ │ mongod.conf
│ │
│ ├───data
│ │ └───db
│ └───logs
├───shard1-2
│ │ mongod.conf
│ │
│ ├───data
│ │ └───db
│ └───logs
└───shard1-3
│ mongod.conf
│
├───data
│ └───db
└───logs
I want to be able to run cmd script from MONGODB-CLUSTER folder and run all the instances.
let's take a peek at one of the conf files:
...
systemLog:
destination: file
path: "./logs/mongod.log"
...
I want to be able to specify relative paths for storage(dbpath) and logs relative to the log file of each instance, regardless of the path from which the cli running the commands is running from.
the dbpath and logs path should be relative and not absolute because I want the setup to work even if the folder would be relocated.
so, is this possible to specify a path relative to the mongod config file?
Alternatively, maybe there is a magic word for the current conf file like $conf so I could specify path as $conf/logs//mongod.log for example

Kustomize: how to apply the same patch in multiple overlays without LoadRestrictionsNone

I have a kustomize layout something like this:
├──release
│ ├──VariantA
│ │ └──kustomization.yaml
│ │ cluster_a.yaml
| └──VariantB
│ └──kustomization.yaml
│ cluster_b.yaml
└──test
├──TestVariantA
│ └──kustomization.yaml; resources=[VariantA]
│ common_cluster_patch.yaml
└──TestVariantB
└──kustomization.yaml; resources=[VariantB]
common_cluster_patch.yaml
My issue is the duplication of common_cluster_patch.yaml. It is a common patch which I need to apply to the the different base cluster objects. I would prefer not to have to maintain identical copies of it for each test variant.
The 2 unsuccessful solutions I tried are:
A common patch resource
├──release
│ ├──VariantA
│ │ └──kustomization.yaml
│ │ cluster_a.yaml
| └──VariantB
│ └──kustomization.yaml
│ cluster_b.yaml
└──test
├──TestVariantA
│ └──kustomization.yaml; resources=[VariantA, TestPatch]
├──TestVariantB
│ └──kustomization.yaml; resources=[VariantB, TestPatch]
└──TestPatch
└──kustomization.yaml
common_cluster_patch.yaml
This fails with no matches for Id Cluster..., presumably because TestPatch is trying to patch an object it doesn't contain.
A common patch directory
├──release
│ ├──VariantA
│ │ └──kustomization.yaml
│ │ cluster_a.yaml
| └──VariantB
│ └──kustomization.yaml
│ cluster_b.yaml
└──test
├──TestVariantA
│ └──kustomization.yaml; resources=[VariantA]; patches=[../TestPatch/common_cluster_patch.yaml]
├──TestVariantB
│ └──kustomization.yaml; resources=[VariantB]; patches=[../TestPatch/common_cluster_patch.yaml]
└──TestPatch
└──common_cluster_patch.yaml
This fails with: '/path/to/test/TestPatch/common_cluster_patch.yaml' is not in or below '/path/to/test/TestVariantA'.
I can work round this and successfully generate my templates with kustomize build --load-restrictor LoadRestrictionsNone, but this comes with dire warnings and portents. I am hoping that there is some better way of organising my resources which doesn't require either workarounds or duplication.
Thanks to criztovyl for this answer! The solution is kustomize components. Components are currently only defined in kustomize.config.k8s.io/v1alpha1 and the reference documentation is a stub, but they are included in current release versions of kustomize.
My solution now looks like:
├──release
│ ├──VariantA
│ │ └──kustomization.yaml
│ │ cluster_a.yaml
| └──VariantB
│ └──kustomization.yaml
│ cluster_b.yaml
└──test
├──TestVariantA
│ └──kustomization.yaml; resources=[VariantA]; components=[../TestCommon]
├──TestVariantB
│ └──kustomization.yaml; resources=[VariantB]; components=[../TestCommon]
└──TestCommon
└──kustomization.yaml; patches=[common_cluster_patch.yaml]
common_cluster_patch.yaml
where test/TestCommon/kustomization.yaml has the header:
apiVersion: kustomize.config.k8s.io/v1alpha1
kind: Component
The crucial difference between a component and a resource is that a component is applied after other processing. This means it can patch an object in the resource which included it.

How to compile Unicode characters in LaTex?

I'm trying to write a little piece of text with LaTex in overleaf. All works right until I use Unicode characters.
I want for example insert this Devanagri symbol: ऄ and make it visible after LaTex compiles it.
This is an example of my document:
\documentclass[a4paper,12pt,openright,notitlepage,twoside]{book}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\begin{document}
symbol: ऄ
\end{document}
Whether I compile with LaTex, the symbol doesn't appear and I get this error:
Package inputenc Error: Unicode character ऄ (U+0904) (inputenc)not set up for use with LaTeX.
Whether I compile with LuaLaTex or XeLaTex, the character still does not appear but the error message disappears.
I tried all the methods described in this post: https://tex.stackexchange.com/questions/34604/entering-unicode-characters-in-latex but no one work for me.
does anyone have a solution to figure out this problem?
If you compile with xelatex or lualatex, you'll need to select a font which contains this glyph.
If you work locally on your computer, you can run the command albatross ऄ from the command line to find out which of your fonts has it:
__ __ __
.---.-.| | |--.---.-.| |_.----.-----.-----.-----.
| _ || | _ | _ || _| _| _ |__ --|__ --|
|___._||__|_____|___._||____|__| |_____|_____|_____|
Unicode code point [904] mapping to ऄ
┌─────────────────────────────────────────────────────────────────────────────┐
│ Font name │
├─────────────────────────────────────────────────────────────────────────────┤
│ .LastResort │
├─────────────────────────────────────────────────────────────────────────────┤
│ Devanagari Sangam MN,देवनागरी संगम एम॰एन॰ │
├─────────────────────────────────────────────────────────────────────────────┤
│ ITF Devanagari Marathi,आई॰टी॰एफ़॰ देवनागरी मराठी │
├─────────────────────────────────────────────────────────────────────────────┤
│ ITF Devanagari,आई॰टी॰एफ़॰ देवनागरी │
├─────────────────────────────────────────────────────────────────────────────┤
│ Kohinoor Devanagari,कोहिनूर देवनागरी │
├─────────────────────────────────────────────────────────────────────────────┤
│ Lohit Devanagari │
├─────────────────────────────────────────────────────────────────────────────┤
│ Lohit Hindi │
├─────────────────────────────────────────────────────────────────────────────┤
│ Shobhika,Shobhika Regular │
├─────────────────────────────────────────────────────────────────────────────┤
│ Shree Devanagari 714,श्री देवनागरी ७१४ │
└─────────────────────────────────────────────────────────────────────────────┘
Or if you are using overleaf, consult this list of installed fonts https://www.overleaf.com/latex/examples/fontspec-all-the-fonts/hjrpnxhrrtxc
So in my case, I can take e.g. the Shobhika font:
% !TeX TS-program = xelatex
\documentclass[a4paper,12pt,openright,notitlepage,twoside]{book}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{fontspec}
\setmainfont{Shobhika}
\begin{document}
symbol: ऄ
\end{document}

postgresql [42883] ERROR: function to_tsvector("unknown", "unknown") does not exist

I am new to postgresql and am attempting to use the full text search to_tsvector however I am running into an error.
SQL and error
SELECT to_tsvector('english', 'The quick brown fox jumped over the lazy dog.');
[42883] ERROR: function to_tsvector("unknown", "unknown") does not exist Hint: No function matches the given name and argument types. You may need to add explicit type casts.
SQL and error for different attempt
SELECT to_tsvector('english'::character, 'The quick brown fox jumped over the lazy dog.'::character);
[42883] ERROR: function to_tsvector(character, character) does not exist Hint: No function matches the given name and argument types. You may need to add explicit type casts.
This is frustrating because this feels like the 'hello world' of getting to_tsvector working however I cannot even get this to return. I am using DataGrip 2020.2 with Postgres, but am not sure how to see which version of postgres I am using (i think it is a newer version). Is there a clear mistake in my code above?
You can try to check, what types are used (I am using psql client`):
postgres=# \df to_tsvector
List of functions
┌────────────┬─────────────┬──────────────────┬─────────────────────┬──────┐
│ Schema │ Name │ Result data type │ Argument data types │ Type │
╞════════════╪═════════════╪══════════════════╪═════════════════════╪══════╡
│ pg_catalog │ to_tsvector │ tsvector │ json │ func │
│ pg_catalog │ to_tsvector │ tsvector │ jsonb │ func │
│ pg_catalog │ to_tsvector │ tsvector │ regconfig, json │ func │
│ pg_catalog │ to_tsvector │ tsvector │ regconfig, jsonb │ func │
│ pg_catalog │ to_tsvector │ tsvector │ regconfig, text │ func │
│ pg_catalog │ to_tsvector │ tsvector │ text │ func │
└────────────┴─────────────┴──────────────────┴─────────────────────┴──────┘
(6 rows)
There is not any variant for type character, character.
Your first query is working in my comp. Please, check, version of Postgres, that you use. Older (very old - years unsupported releases) Postgres has not this functionality
postgres=# SELECT to_tsvector('english', 'The quick brown fox jumped over the lazy dog.');
┌───────────────────────────────────────────────────────┐
│ to_tsvector │
╞═══════════════════════════════════════════════════════╡
│ 'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2 │
└───────────────────────────────────────────────────────┘
(1 row)
When you want to use explicit types, you can use regconfig and text:
postgres=# SELECT to_tsvector('english'::regconfig,
'The quick brown fox jumped over the lazy dog.'::text);
┌───────────────────────────────────────────────────────┐
│ to_tsvector │
╞═══════════════════════════════════════════════════════╡
│ 'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2 │
└───────────────────────────────────────────────────────┘
(1 row)

What is the exact difference between Windows-1252(1/3/4) and ISO-8859-1?

We are hosting PHP apps on a Debian based LAMP installation.
Everything is quite ok - performance, administrative and management wise.
However being a somewhat new devs (we're still in high-school) we've run into some problems with the character encoding for Western Charsets.
After doing a lot of researches I have come to the conclusion that the information online is somewhat confusing. It's talking about Windows-1252 being ANSI and totally ISO-8859-1 compatible.
So anyway, What is the difference between Windows-1252(1/3/4) and ISO-8859-1?
And where does ANSI come into this anyway?
What encoding should we use on our Debian servers (and workstations) in order to ensure that clients get all information in the intended way and that we don't lose any chars on the way?
I'd like to answer this in a more web-like manner and in order to answer it so we need a little history. Joel Spolsky has written a very good introductionary article on the absolute minimum every dev should know on Unicode Character Encoding.
Bear with me here because this is going to be somewhat of a looong answer. :)
As a history I'll point to some quotes from there: (Thank you very much Joel! :) )
The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter "A" was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you could use for your own devious purposes.
And all was good, assuming you were an English speaker.
Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255.
So now "OEM character sets" were distributed with PCs and these were still all different and incompatible. And to our contemporary amazement - it was all fine! They didn't have the Internet back than and people rarely exchanged files between systems with different locales.
Joel goes on saying:
In fact as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes.
Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages.
And this is how the "Windows Code pages" were born, eventually. They were actually "parented" by the DOS code pages. And then Unicode was born! :) and UTF-8 is "another system for storing your string of Unicode code points" and actually "every code point from 0-127 is stored in a single byte" and is the same as ASCII. I will not go into anymore specifics of Unicode and UTF-8, but you should read up on the BOM, Endianness and Character Encoding as a general.
On "the ANSI conspiracy", Microsoft actually admits the miss-labeling of Windows-1252 in a glossary of terms:
The so-called Windows character set (WinLatin1, or Windows code page 1252, to be exact) uses some of those positions for printable characters. Thus, the Windows character set is NOT identical with ISO 8859-1. The Windows character set is often called "ANSI character set", but this is SERIOUSLY MISLEADING. It has NOT been approved by ANSI.
So, ANSI when refering to Windows character sets is not ANSI-certified! :)
As Jukka pointed out (credits go to you for the nice answer )
Windows-1252 ISO Latin 1, also known as ISO-8859-1 as a character encoding, so that the code range 0x80 to 0x9F is reserved for control characters in ISO-8859-1 (so-called C1 Controls), wheres in Windows-1252, some of the codes there are assigned to printable characters (mostly punctuation characters), others are left undefined.
However my personal opinion and technical understanding is that both Windows-1252 and ISO-8859-1 ARE NOT WEB ENCODINGS! :) So:
For web pages please use UTF-8 as encoding for the content
So store data as UTF-8 and "spit it out" with the HTTP Header: Content-Type: text/html; charset=utf-8.
There is also a thing called the HTML content-type meta-tag:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Now, what browsers actually do when they encounter this tag is that they start from the beginning of the HTML document again so that they could reinterpret the document in the declared encoding. This should happen only if there is no 'Content-type' header.
Use other specific encodings if the users of your system need files generated from it.
For example some western users may need Excel generated files, or CSVs in Windows-1252. If this is the case, encode text in that locale and then store it on the fs and serve it as a download-able file.
There is another thing to be aware of in the design of HTTP:
The content-encoding distribution mechanism should work like this.
I. The client requests a web page in a specific content-types and encodings via: the 'Accept' and 'Accept-Charset' request headers.
II. Then the server (or web application) returns the content trans-coded to that encoding and character set.
This is NOT THE CASE in most modern web apps. What actually happens it that web applications serve (force the client) content as UTF-8. And this works because browsers interpret received documents based on the response headers and not on what they actually expected.
We should all go Unicode, so please, please, please use UTF-8 to distribute your content wherever possible and most of all applicable. Or else the elders of the Internet will haunt you! :)
P.S.
Some more nice articles on using MS Windows characters in Web Pages can be found here and here.
The most authoritative reference to meanings of character encoding names is the IANA registry Character Sets.
Windows-1252 is commonly known as Windows Latin 1 or as Windows West European or something like that. It differs from ISO Latin 1, also known as ISO-8859-1 as a character encoding, so that the code range 0x80 to 0x9F is reserved for control characters in ISO-8859-1 (so-called C1 Controls), wheres in Windows-1252, some of the codes there are assigned to printable characters (mostly punctuation characters), others are left undefined.
ANSI comes here as a misnomer. Microsoft once submitted Windows-1252 to American National Standards Institute (ANSI) to be adopted as a standard; the proposal was rejected, but Microsoft still calls their code “ANSI”. For further confusion, they may use “ANSI” for different encodings (basically, the “native 8-bit encoding” of a Windows installation).
In the web context, declaring ISO-8859-1 will be taken as if you declared Windows-1252. The reason is that C1 Controls are not used, or useful, on the web, whereas the added characters are often used, even on pages mislabelled as ISO-8859-1. So in practical terms it does not matter which one you declare.
There might still be some browsers that actually interpret data as ISO-8859-1 if declared so, but they must be very rare (the last I remember seeing was a version of Opera about ten years ago).
You do not describe what problems you have encountered. The most common cause of problems seems to be that data is actually UTF-8 encoded but declared as ISO-8859-1 (or Windows-1252), or vice versa. This becomes a real problem to web page authors if a server forces a Content-Type header declaring a character encoding and it is one that they cannot deal with in their authoring environment (or don’t know how to do that).
This table gives an overview about the differences. It shows all characters which are defined in Windows-1252 but not available in ISO-8859-1/ISO-8859-15:
│ …0 │ …1 │ …2 │ …3 │ …4 │ …5 │ …6 │ …7 │ …8 │ …9 │ …A │ …B │ …C │ …D │ …E │ …F │
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
8… │ € │ │ ‚ │ ƒ │ „ │ … │ † │ ‡ │ ˆ │ ‰ │ Š │ ‹ │ Œ │ │ Ž │ │
Unicode │ 20AC │ │ 201A │ 0192 │ 201E │ 2026 │ 2020 │ 2021 │ 02C6 │ 2030 │ 0160 │ 2039 │ 0152 │ │ 017D │ │
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
9… │ │ ‘ │ ’ │ “ │ ” │ • │ – │ — │ ˜ │ ™ │ š │ › │ œ │ │ ž │ Ÿ │
Unicode │ │ 2018 │ 2019 │ 201C │ 201D │ 2022 │ 2013 │ 2014 │ 02DC │ 2122 │ 0161 │ 203A │ 0153 │ │ 017E │ 0178 │
Unlike Windows-1252 range 0x80…0x9F is used for Control Codes in ISO-8859-1.
This table shows the differences between Windows-1252, ISO-8859-1 and ISO-8859-15
Character │ € │ Š │ š │ Ž │ ž │ Œ │ œ │ Ÿ │ ¤ │ ¦ │ ¨ │ ´ │ ¸ │ ¼ │ ½ │ ¾ │
───────────────────────────────────────────────────────────────────────────────────────────────────────
ISO 8859-1 │ – │ – │ – │ – │ – │ – │ – │ – │ A4 │ A6 │ A8 │ B4 │ B8 │ BC │ BD │ BE │
ISO 8859-15 │ A4 │ A6 │ A8 │ B4 │ B8 │ BC │ BD │ BE │ – │ – │ – │ – │ – │ – │ – │ – │
Windows-1252 │ 80 │ 8A │ 9A │ 8E │ 9E │ 8C │ 9C │ 9F │ A4 │ A6 │ A8 │ B4 │ B8 │ BC │ BD │ BE │
Unicode │ 20AC │ 160 │ 161 │ 17D │ 17E │ 152 │ 153 │ 178 │ A4 │ A6 │ A8 │ B4 │ B8 │ BC │ BD │ BE │
ANSI (Windows-1252) in countries with an english/latin alphabet, e.g. UK/US/France/Germany and others, refers to the Windows-1252 encoding. https://web.archive.org/web/20170916200715/http://www.microsoft.com:80/resources/msdn/goglobal/default.mspx
Windows-1252. and ISO-8859-1 are very similar. They only differ
in 32 characters.
In Windows-1252, the characters from 128 to 159 are used for some useful
characters such as the Euro symbol.
In ISO-8859-1 these characters are mapped to control characters which
are useless in HTML.
__
so a suggestion
so see if 128 is euro symbol.. if it is it's Windows 1252.
__
The codes from 128 to 159 are not in use in ISO-8859-1, but many
browsers will display the characters from the Windows-1252)
character set instead of nothing.
These 2 links list them both.
http://www.w3schools.com/charsets/ref_html_ansi.asp
http://www.w3schools.com/charsets/ref_html_8859.asp
Some comments were very useful and I amended my post accordingly based on them.
Chenfeng points out
On Windows, "ANSI" refers to the system codepage specified by the locale, whatever that is (Arabic/Chinese/Cyrillic/Vietnamese/...). It does not [necessarily] refer.. to Windows-1252. You can test this by changing your locale and then use notepad.exe to save a text file in "ANSI". According to this MS documentation, there are 14 different "ANSI" code pages https://learn.microsoft.com/en-us/windows/desktop/intl/code-page-identifiers
Wernfriend points out
https://web.archive.org/web/20170916200715/http://www.microsoft.com:80/resources/msdn/goglobal/default.mspx and that usa codepage 437 is the 'OEM codepage', (see OEM column), and the OEM codepage is the one used by the cmd prompt. And he points out / suggests, showing from that webpage, that in many non-english/latin-alphabet speaking countries ansi is not windows 1252. I notice that for example, hebrew ansi uses 1255. (hebrew OEM codepage is 862).