How to find number from text

How to find number from text - pyspark

This is a small example of a pyspark column (String) in my dataframe.
column | new_column
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Hoy es día de ABC/KE98789T983456 clase. | 98789
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Como ABC/KE 34562Z845673 todas las mañanas | 34562
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Hoy tiene ABC/KE 110330/L63868 clase de matemáticas, | 110330
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Marcos se ABC 898456/L56784 levanta con sueño. | 898456
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Marcos se ABC898456 levanta con sueño. | 898456
------------------------------------------------------------------------------------------------- |--------------------------------------------------
comienza ABC - KE 60014 -T60058 | 60014
------------------------------------------------------------------------------------------------- |--------------------------------------------------
inglés y FOR 102658/L61144 ciencia. Se viste, desayuna | 102658
------------------------------------------------------------------------------------------------- |--------------------------------------------------
y comienza FOR ABC- 72981 / KE T79581: el camino hacia la | 72981
------------------------------------------------------------------------------------------------- |--------------------------------------------------
escuela. Se FOR ABC 101665 - 103035 - 101926 - 105484 - 103036 - 103247 - encuentra con su | [101665,103035,101926,105484,103036,103247]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
escuela ABCS 206048/206049/206050/206051/205225-FG-matemáticas- | [206048,206049,206050,206051,205225]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
encuentra ABCS 111553/L00847 & 111558/L00895 - matemáticas | [111553, 111558]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
ciencia ABC 163278/P20447 AND RETROFIT ABCS 164567/P21000 - 164568/P21001 - desayuna | [163278,164567,164568 ]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
ABC/KE 71729/T81672 - 71781/T81674 71782/T81676 71730/T81673 71783/T81677 71784/T | [71729,71781,71782,71730,71783,71784]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
ciencia ABC/KE2646/L61175:E/F-levanta con sueño L61/62LAV AT Z5CTR/XC D3-1593 | [2646]
-----------------------------------------------------------------------------------------------------------------------------------------------------
escuela ABCS 6048/206049/6050/206051/205225-FG-matemáticas- MSN 2345 | [6048,206049,6050,206051,205225]
-----------------------------------------------------------------------------------------------------------------------------------------------------
FOR ABC/KE 109038_L35674_DEFINE AND DESIGN IMPROVEMENTS OF 1618 FROM 118(PDS4 BRACKETS) | [109038]
-----------------------------------------------------------------------------------------------------------------------------------------------------
y comienza FOR ABC- 2981 / KE T79581: el camino hacia la 9856 | [2981]
I want to extract all numbers that contain: 4, 5 or 6 digits from this text.
Condition and cases to extract them:
- Attached to ABC/KE (first line in the example above).
- after ABC/KE + space (second and third line).
- after ABC + space (line 4)
- after ABC without space (line 5)
- after ABC - KE + space
- after for word
- after ABC- + space
- after ABC + space
- after ABCS (line 10 and 11)
Example of failed cases:
Column | new_column
------------------------------------------------------------------------------------------------------------------------
FOR ABC/KE 109038_L35674_DEFINE AND DESIGN IMPROVEMENTS OF 1618 FROM 118(PDS4 BRACKETS) | [1618] ==> should be [109038]
------------------------------------------------------------------------------------------------------------------------
ciencia ABC/KE2646/L61175:E/F-levanta con sueño L61/62LAV AT Z5CTR/XC D3-1593 | [1593] ==> should be [2646]
------------------------------------------------------------------------------------------------------------------------
escuela ABCS 6048/206049/6050/206051/205225-FG-matemáticas- MSN 2345 | [6048,206049,6050,206051,205225, 2345] ==> should be [6048,206049,6050,206051,205225]
I hope that I resumed the cases, you can see my example above and the expect output.
How can I do it ?
Thank you

One way using regexes to clean out the data and set up a lone anchor with value of ABC to identify the start of a potential match. after str.split(), iterate through the resulting array to flag and retrieve consecutive matching numbers that follow this anchor.
Edit: Added underscore _ into the data pattern (\b(\d{4,6})(?=[A-Z/_]|$)) so that it now allows underscore as an anchor to follow the matched substring of 4-6 digit. this fixed the first line, line 2 and 3 should be working with the existing regex patterns.
import re
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.functions import udf
(1) Use regex patterns to clean out the raw data so that we have only one anchor ABC to identify the start of a potential match:
clean1: use [-&\s]+ to convert '&', '-' and whitespaces to a SPACE ' ', they are used to connect a chain of numbers
example: `ABC - KE` --> `ABC KE`
`103035 - 101926 - 105484` -> `103035 101926 105484`
`111553/L00847 & 111558/L00895` -> `111553/L00847 111558/L00895`
clean2: convert text matching the following three sub-patterns into 'ABC '
+ ABCS?(?:[/\s]+KE|(?=\s*\d))
+ ABC followed by an optional `S`
+ followed by at least one slash or whitespace and then `KE` --> `[/\s]+KE`
example: `ABC/KE 110330/L63868` to `ABC 110330/L63868`
+ or followed by optional whitespaces and then at least one digit --> (?=\s*\d)
example: ABC898456 -> `ABC 898456`
+ \bFOR\s+(?:[A-Z]+\s+)*
+ `FOR` words
example: `FOR DEF HJK 12345` -> `ABC 12345`
data: \b(\d{4,6})(?=[A-Z/_]|$) is a regex to match actual numbers: 4-6 digits followed by [A-Z/] or end_of_string
(2) Create a dict to save all 3 patterns:
ptns = {
'clean1': re.compile(r'[-&\s]+', re.UNICODE)
, 'clean2': re.compile(r'\bABCS?(?:[/\s-]+KE|(?=\s*\d))|\bFOR\s+(?:[A-Z]+\s+)*', re.UNICODE)
, 'data' : re.compile(r'\b(\d{4,6})(?=[A-Z/_]|$)', re.UNICODE)
}
(3) Create a function to find matched numbers and save them into an array
def find_number(s_t_r, ptns, is_debug=0):
try:
arr = re.sub(ptns['clean2'], 'ABC ', re.sub(ptns['clean1'], ' ', s_t_r.upper())).split()
if is_debug: return arr
# f: flag to identify if a chain of matches is started, default is 0(false)
f = 0
new_arr = []
# iterate through the above arr and start checking numbers when anchor is detected and set f=1
for x in arr:
if x == 'ABC':
f = 1
elif f:
new = re.findall(ptns['data'], x)
# if find any matches, else reset the flag
if new:
new_arr.extend(new)
else:
f = 0
return new_arr
except Exception as e:
# only use print in local debugging
print('ERROR:{}:\n [{}]\n'.format(s_t_r, e))
return []
(4) defind the udf function
udf_find_number = udf(lambda x: find_number(x, ptns), ArrayType(StringType()))
(5) get the new_column
df.withColumn('new_column', udf_find_number('column')).show(truncate=False)
+------------------------------------------------------------------------------------------+------------------------------------------------+
|column |new_column |
+------------------------------------------------------------------------------------------+------------------------------------------------+
|Hoy es da de ABC/KE98789T983456 clase. |[98789] |
|Como ABC/KE 34562Z845673 todas las ma?anas |[34562] |
|Hoy tiene ABC/KE 110330/L63868 clase de matem篓垄ticas, |[110330] |
|Marcos se ABC 898456/L56784 levanta con sue?o. |[898456] |
|Marcos se ABC898456 levanta con sue?o. |[898456] |
|comienza ABC - KE 60014 -T60058 |[60014] |
|ingl篓娄s y FOR 102658/L61144 ciencia. Se viste, desayuna |[102658] |
|y comienza FOR ABC- 72981 / KE T79581: el camino hacia la |[72981] |
|escuela. Se FOR ABC 101665 - 103035 - 101926 - 105484 - 103036 - 103247 - encuentra con su|[101665, 103035, 101926, 105484, 103036, 103247]|
|escuela ABCS 206048/206049/206050/206051/205225-FG-matem篓垄ticas- |[206048, 206049, 206050, 206051, 205225] |
|encuentra ABCS 111553/L00847 & 111558/L00895 - matem篓垄ticas |[111553, 111558] |
|ciencia ABC 163278/P20447 AND RETROFIT ABCS 164567/P21000 - 164568/P21001 - desayuna |[163278, 164567, 164568] |
|ABC/KE 71729/T81672 - 71781/T81674 71782/T81676 71730/T81673 71783/T81677 71784/T |[71729, 71781, 71782, 71730, 71783, 71784] |
+------------------------------------------------------------------------------------------+------------------------------------------------+
(6) code for debugging, use find_number(row.column, ptns, 1) to check how/if the first two regex patterns work as expected:
for row in df.limit(10).collect():
print('{}:\n {}\n'.format(row.column, find_number(row.column, ptns)))
Some Notes:
in clean2 pattern, ABCS and ABS are treated the same way. if they are different, just remove the 'S' and add a new alternative ABCS\s*(?=\d) to the end of the pattern
re.compile(r'\bABC(?:[/\s-]+KE|(?=\s*\d))|\bFOR\s+(?:[A-Z]+\s+)*|ABCS\s*(?=\d)')
current pattern clean1 only treats '-', '&' and whitespaces as consecutive connector, you might add more characters or words like 'and', 'or', for example:
re.compile(r'[-&\s]+|\b(?:AND|OR)\b')
FOR words is \bFOR\s+(?:[A-Z]+\s+)*, this might be adjusted based on if numbers are allowed in words etc.
This was tested on Python-3. using Python-2, there might be issue with unicode, you can fix it by using the method in the first answer of reference

Related

altair bar chart for grouped data sorting NOT working

`Reservation_branch_code | ON_ACCOUNT | Rcount
:-------------------------------------------------:
0 1101 | 170 | 5484
1 1103 | 101 | 5111
2 1118 | 1 | 232
3 1121 | 0 | 27
4 1126 | 90 | 191`
would like to chart sorted by "Rcount" and x axis is "Reservation_branch_code"
this below code gives me chart without Sorting by Rcount
base =alt.Chart(df1).transform_fold(
['Rcount', 'ON_ACCOUNT'],
as_=['column', 'value']
)
bars = base.mark_bar().encode(
# x='Reservation_branch_code:N',
x='Reservation_branch_code:O',
y=alt.Y('value:Q', stack=None), # stack =None enables layered bar
color=alt.Color('column:N', scale=alt.Scale(range=["#f50520", "#bab6b7"])),
tooltip=alt.Tooltip(['ON_ACCOUNT','Rcount']),
#order=alt.Order('value:Q')
)
text = base.mark_text(
align='center',
color='black',
baseline='middle',
dx=0,dy=-8, # Nudges text to right so it doesn't appear on top of the bar
).encode(
x='Reservation_branch_code:N',
y='value:Q',
text=alt.Text('value:Q', format='.0f')
)
rule = alt.Chart(df1).mark_rule(color='blue').encode(
y='mean(Rcount):Q'
)
(bars+text+rule).properties(width=790,height=330)
i sorted data in dataframe...used in that df in altair chart
but not found X axis is not sorted by Rcount column........Thanks

You can pass a list with the sort order:
import altair as alt
from vega_datasets import data
source = data.barley()
alt.Chart(source).mark_bar().encode(
x='sum(yield):Q',
y=alt.Y('site:N', sort=source['site'].unique().tolist())
)

kdb: what is the type of parse"x[y]"?

For expression parse"x[y]", I think it means a function/map/list application of x on argument y, possibly a projection. Please correct me if I'm wrong.
Now entering into Q console, I get below output.
q)parse"x[y]"
x
y
q)l:parse"x[y]"
q)count l
2
q)type l
0h
q)l[0]
`x
q)l[1]
`y
q)type l[0]
-11h
q)type l[1]
-11h
We see l has type 0h. It has length 2. Both elements have type -11h. Why is the list then, not of type 11h?

In short, it's a parse tree signature 'type' construct, very similar to lambda (which would be type 64). k recognises and evaluates expression in parse tree when object reaches call stack, de-referencing symbols-which acts like a pointer.
TL;DR - Step by step breakdown
Basic evaluation constructs
"x[y]" ~ "x y"
q)x:count
q)y: 1 2 3
q)parse "x y"
x
y
q)x y
3
q)`x y
3
q)eval parse "x y"
3
Simple indexing, 'simplified' 2nd param -> int atom:
q)parse "y 2" //this nicely shows parse tree
`y
2
q)eval parse "y 2"
3
q)eval (`y;2)
3
q)`y 2
3
q)y 2
3
Both above examples are mixed lists (0h), with actual elements following immediately - where symbols - are references. Instead with normal mixed list construct where all lenghts and types are defined later
Looking at serialised IPC frame helps to understand this construct
Establish what expressions are the same
q)parse["x[y]"] ~ parse "x y"
1b
q)parse["x[y]"] ~ parse "x#y"
0b
q)parse["x[y]"] ~ parse "{x y}"
0b
Compare byte frames
q)-8!parse "x[y]"
0x0100000014000000000002000000f57800f57900
q)-8!parse "x y"
0x0100000014000000000002000000f57800f57900
q)-8!`x`y
0x01000000120000000b000200000078007900
q)-8!enlist `x`y
0x01000000180000000000010000000b000200000078007900
Compare parse tree output from 1st paragraph with actual obj instead of reference
q)-8!parse "x y"
0x0100000014000000000002000000f57800f57900
q)-8!parse "x 2"
0x010000001a000000000002000000f57800f90200000000000000
Lambda structure to complete the list for reference
q)-8!{x y}
0x010000001500000064000a00050000007b7820797d
/0x7b "{" ; 0x7d "}"
And finally the IPC frame breakdown
`x`y (symbol list)
q)symlist:`endian`isync`x`y`length`type`attr`msg!sums[0 1 1 1 1 4 1 1] _ -8!`x`y
q)symlist
endian| ,0x01 (little endian)
isync | ,0x00
x | ,0x00
y | ,0x00
length| 0x12000000
type | ,0x0b (11h - symbol list - 1st difference)
attr | ,0x00
msg | 0x0200000078007900
dig into the symlist:
q)msg:`lenght`msg!sums[0 4] _ symlist`msg
q)msg
lenght| 0x02000000
msg | 0x78007900
Here we can clearly see, how null terminated symbol follows length of vector immediately, due to explicit type: 0x0b - 11h - symbol list
q)`char$msg`msg
"x\000y\000"
parse "x[y]" - parse tree breakdown
q)o1:`endian`isync`x`y`length`type`attr`msg!sums[0 1 1 1 1 4 1 1]_ -8!parse"x[y]"
q)o1
endian| ,0x01
isync | ,0x00
x | ,0x00
y | ,0x00
length| 0x14000000
type | ,0x00 (could expect 64 - 100h lambda)
attr | ,0x00
msg | 0x02000000f57800f57900
Difference from simple list:
q)`lenght`type1`msg1`type2`msg2!sums[0 4 1 2 1 ] _ o1`msg
lenght| 0x02000000
type1 | ,0xf5 (-11h - symbol atom) - acts as a reference
msg1 | 0x7800 null terminated "x"
type2 | ,0xf5 (-11h - symbol atom)
msg2 | 0x7900 null terminated "y" - acts as a reference

statistical test to compare 1st/2nd differences based on output from ggpredict / ggeffect

I want to conduct a simple two sample t-test in R to compare marginal effects that are generated by ggpredict (or ggeffect).
Both ggpredict and ggeffect provide nice outputs: (1) table (pred prob / std error / CIs) and (2) plot. However, it does not provide p-values for assessing statistical significance of the marginal effects (i.e., is the difference between the two predicted probabilities difference from zero?). Further, since I’m working with Interaction Effects, I'm also interested in a two sample t-tests for the First Differences (between two marginal effects) and the Second Differences.
Is there an easy way to run the relevant t tests with ggpredict/ggeffect output? Other options?
Attaching:
. reprex code with fictitious data
. To be specific: I want to test the following "1st differences":
--> .67 - .33=.34 (diff from zero?)
--> .5 - .5 = 0 (diff from zero?)
...and the following Second difference:
--> 0.0 - .34 = .34 (diff from zero?)
See also Figure 12 / Table 3 in Mize 2019 (interaction effects in nonlinear models)
Thanks Scott
library(mlogit)
#> Loading required package: dfidx
#>
#> Attaching package: 'dfidx'
#> The following object is masked from 'package:stats':
#>
#> filter
library(sjPlot)
library(ggeffects)
# create ex. data set. 1 row per respondent (dataset shows 2 resp). Each resp answers 3 choice sets, w/ 2 alternatives in each set.
cedata.1 <- data.frame( id = c(1,1,1,1,1,1,2,2,2,2,2,2), # respondent ID.
QES = c(1,1,2,2,3,3,1,1,2,2,3,3), # Choice set (with 2 alternatives)
Alt = c(1,2,1,2,1,2,1,2,1,2,1,2), # Alt 1 or Alt 2 in choice set
LOC = c(0,0,1,1,0,1,0,1,1,0,0,1), # attribute describing alternative. binary categorical variable
SIZE = c(1,1,1,0,0,1,0,0,1,1,0,1), # attribute describing alternative. binary categorical variable
Choice = c(0,1,1,0,1,0,0,1,0,1,0,1), # if alternative is Chosen (1) or not (0)
gender = c(1,1,1,1,1,1,0,0,0,0,0,0) # male or female (repeats for each indivdual)
)
# convert dep var Choice to factor as required by sjPlot
cedata.1$Choice <- as.factor(cedata.1$Choice)
cedata.1$LOC <- as.factor(cedata.1$LOC)
cedata.1$SIZE <- as.factor(cedata.1$SIZE)
# estimate model.
glm.model <- glm(Choice ~ LOC*SIZE, data=cedata.1, family = binomial(link = "logit"))
# estimate MEs for use in IE assessment
LOC.SIZE <- ggpredict(glm.model, terms = c("LOC", "SIZE"))
LOC.SIZE
#>
#> # Predicted probabilities of Choice
#> # x = LOC
#>
#> # SIZE = 0
#>
#> x | Predicted | SE | 95% CI
#> -----------------------------------
#> 0 | 0.33 | 1.22 | [0.04, 0.85]
#> 1 | 0.50 | 1.41 | [0.06, 0.94]
#>
#> # SIZE = 1
#>
#> x | Predicted | SE | 95% CI
#> -----------------------------------
#> 0 | 0.67 | 1.22 | [0.15, 0.96]
#> 1 | 0.50 | 1.00 | [0.12, 0.88]
#> Standard errors are on the link-scale (untransformed).
# plot
# plot(LOC.SIZE, connect.lines = TRUE)

ControlGetText get from ClassNN to Number

ClassNN= TDBEditArpa16
I want AutoHotKey to get this value as a number and save it to an integer variable. How can I do this?
Note: I tried to do it through the following code, but the program can not identify it as a number.
ControlGetText, qtp1, TDBEditArpa16, Alteração de Produtos, Informações de Custo
StringTrimRight, qtp1, qtp1, 4
qtp1 = (%qtp1% + 2)
msgbox %qtp1%

Storing the result of an expression:
To assign a result to a variable, use the := operator.
ControlGetText, qtp1, TDBEditArpa16, Alteração de Produtos, Informações de Custo
StringTrimRight, qtp1, qtp1, 4
qtp1 := (qtp1 + 2)
msgbox %qtp1%
Variable names in an expression are not enclosed in percent signs

What does the line ## -9,9 +9,10 ## mean in a diff file?

Can someone please explain the third line in the sample diff output below (i.e., the one that starts with ##)? I understand the changes represented by the remaining lines but am having trouble making sense of that third line...
--- a/code/c-skeleton/Makefile
+++ b/code/c-skeleton/Makefile
## -9,9 +9,10 ##
TEST_SRC=$(wildcard tests/*_tests.c)
TESTS=$(patsubst %.c,%,$(TEST_SRC))

## -9,9 +9,10 ##
...specifies where in the source and destination files changes take place, by line number and size of the chunk being edited, both before and after the changes.
Specifically:
## -9,9 +9,10 ##
^ ^^ ^ ^^ ^
| || | || \----- The "10" is the number of lines in the hunk after being
| || | || modified; this patch, then, must add a line, since the
| || | || new count (of 10) is longer than the old count (of 9).
| || | |\------- This "9" is the line number in the new file where the
| || | | modified hunk is placed.
| || | \-------- This "+" is a hint that the following numbers refer to
| || | the new file, after modification.
| || \---------- This "9" is the number of lines in the hunk before being
| || modified.
| |\------------ This "9" is the line number in the original file.
| \------------- This "-" is a hint that the following numbers refer to the
| original file.
\---------------- This "##" is a marker indicating that this is the start of a
new hunk.
That is to say: in the original file, the hunk being modified consists of 9 lines starting at line 9; in the destination file, it's 10 lines starting at line 9.
See the detailed description of unified diff format in the GNU diffutils documentation.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to find number from text - pyspark

Related

altair bar chart for grouped data sorting NOT working

kdb: what is the type of parse"x[y]"?

statistical test to compare 1st/2nd differences based on output from ggpredict / ggeffect

ControlGetText get from ClassNN to Number

What does the line ## -9,9 +9,10 ## mean in a diff file?

Categories

Resources