Substring function to extract part of the string - substring

data = {'desc': ['ADRIAN PETER - ANN 80020355787C - 11 Baillon Pass.pdf', 'AILEEN MARCUS - ANC 800E15432922 - 5 Mandarin Way.pdf',
'AJITH SINGH - ANN 80020837750 - 11 Berkeley Loop.pdf', 'ALEX MARTIN-CURTIS - ANC 80021710355 - 26 Dovedale St.pdf',
'Alice.Smith\Jodee - Karen - ANE 80020428377 - 58 Harrisdale Dr.pdf']}
df = pd.DataFrame(data, columns = ['desc'])
df
From the data frame, I want to create a new column called ID, and in that ID, I want to have only those values starting after ANN, ANC or ANE. So I am expecting a result as below.
ID
80020355787C
800E15432922
80020837750
80021710355
80020428377
I tried running the code below, but it did not get the desired result. Appreciate your help on this.
df['id'] = df['desc'].str.extract(r'\-([^|]+)\-')

You can use - AN[NCE] (800[0-9A-Z]+) -, where:
AN[NCE] matches literally AN followed by N or C or E;
800[0-9A-Z]+ matches literally 800 followed by one or more characters between 0 and 9 or between A and Z.
>>> df['desc'].str.extract(r'- AN[NCE] (800[0-9A-Z]+) -')
0
0 80020355787C
1 800E15432922
2 80020837750
3 80021710355
4 80020428377
If not all your ids start with "800", you can just remove it from the pattern.

Related

An issue with argument "sortv" of function seqIplot()

I'm trying to plot individual sequences by means of function seqIplot() in TraMineR. These individual sequences represent work trajectories, completed by former school's graduates via a WEB questionnaire.
Using argument "sortv", I'd like to sort my sequences according to the order of the levels of one covariate, the year of graduation, named "PROMO".
"PROMO" is a factor variable contained in a data frame named "covariates.seq", gathering covariates together:
str(covariates.seq)
'data.frame': 733 obs. of 6 variables:
$ ID_SQ : Factor w/ 733 levels "1","2","3","5",..: 1 2 3 4 5 6
7 8 9 10 ...
$ SEXE : Factor w/ 2 levels "Féminin","Masculin": 1 1 1 1 2 1
1 2 2 1 ...
$ PROMO : Factor w/ 6 levels "1997","1998",..: 1 2 2 4 4 3 2 2
2 2 ...
$ DEPARTEMENT : Factor w/ 10 levels "BC","GCU","GE",..: 1 4 7 8 7 9
9 7 7 4 ...
$ NIVEAU_ADMISSION: Factor w/ 2 levels "En Premier Cycle",..: NA 1 1 1 1
1 NA 1 1 1 ...
$ FILIERE_SECTION : Factor w/ 4 levels "Cursus Classique",..: NA 4 2 NA
1 1 NA NA 4 3 ..
I'm also using "SEXE", the graduates' gender, as a grouping variable. To plot the individual sequences so, my command is as follows:
seqIplot(sequences, group = covariates.seq$SEXE,
sortv = covariates.seq$PROMO,
cex.axis = 0.7, cex.legend = 0.7)
I expected that, by using a process time axis (with the year of graduation as sequence-dependent origin), sorting the sequences according to the order of the levels of "PROMO" would give a plot with groups of sequences from the longest (for the older graduates) to the shortest (for the younger graduates).
But I've got an issue: in the output plot, the sequences don't appear to be correctly sorted according to the levels of "PROMO". Indeed, by using "sortv = covariates.seq$PROMO" as in the command above, the plot doesn't show groups of sequences from the longest to the shortest, as expected. It looks like the plot obtained without using the argument "sortv" (see Figures below).
Without using argument "sortv"
Using "sortv = covariates.seq$PROMO"
Note that I have 733 individual sequences in my object "sequences", created as follows:
labs <- c("En poste","Au chômage (d'au moins 6 mois)", "Autre situation
(d'au moins 6 mois)","En poursuite d'études (thèse ou hors
thèse)", "En reprise d'études / formation (d'au moins 6 mois)")
codes <- c("En poste", "Au chômage", "Autre situation", "En poursuite
d'études", "En reprise d'études / formation")
sequences <- seqdef(situations, alphabet = labs, states = codes, left =
NA, right = "DEL", missing = NA,
cnames = as.character(seq(0,7400/365,1/365)),
xtstep = 365)
The values of the covariates are sorted in the same order as the individual sequences. The covariate "PROMO" doesn't contain any missing value.
Something's going wrong, but what?
Thank you in advance for your help,
Best,
Arnaud.
Using a factor as sortv argument in seqIplot works fine as illustrated by the example below:
sdc <- c("aabbccdd","bbbccc","aaaddd","abcabcab")
sd <- seqdecomp(sdc, sep="")
seq <- seqdef(sd)
fac <- factor(c("2000","2001","2001","2000"))
par(mfrow=c(1,3))
seqIplot(seq, with.legend=FALSE)
seqIplot(seq, sortv=fac, with.legend=FALSE)
seqlegend(seq)

Text file processing in Matlab

I have a text output from a program with a set format. I need to parse ~200 of them to extract an information. I tried in MATLAB with 'textscan' but did not work. Following is the input:
MOTIFS SUMMARY:
1) TTATAGCCGC (GCGGCTATAA) 1.986
2) AAACCGCCTC (GAGGCGGTTT) 1.865
DETAILED RESULTS:
1) TTATAGCCGC (GCGGCTATAA) 1.986
Matrix: MAT1 TTATAGCCGC
A 0.1249 0.177 0.7364 0.1189 0.7072 0.1149 0.09858 0.1096
C 0.0899 0.07379 0.1136 0.1298 0.08662 0.1293 0.7528 0.721
G 0.06828 0.1284 0.07195 0.1031 0.1352 0.6708 0.05556 0.0713
T 0.7169 0.6209 0.07802 0.6482 0.07096 0.08492 0.09305 0.09804
OCCURRENCES:
>GENE_1 1 TTATAGCCGC 1 561 +
>GENE_2 24 TAATAGCCGC 0.928699 762 -
>GENE_3 10 ATATAGCCGC 0.904905 185 -
>GENE_1 7 TTATAGCAGC 0.901785 726 +
**********
2) AAACCGCCTC (GAGGCGGTTT) 1.865
Matrix: MAT2 AAACCGCCTC
A 0.653 0.7401 0.7763 0.1323 0.09619 0.09134 0.07033 0.1383
C 0.1163 0.07075 0.09441 0.749 0.6347 0.1132 0.6559 0.6982
G 0.09136 0.09402 0.07385 0.04209 0.1799 0.7332 0.1241 0.07568
T 0.1393 0.09518 0.05541 0.07659 0.08921 0.06234 0.1497 0.08786
OCCURRENCES:
>GENE_1 21 AAACCGCCTC 1 963 +
>GENE_2 14 AAACGGCCTC 0.928198 212 +
>GENE_2 8 AAACCGTCTC 0.92009 170 +
>GENE_4 3 TAACCGCCTC 0.918883 370 +
**********
I am trying to count the unique() occurrence under each motif and add it to the MOTIF SUMMARY and a final average of them. My expected output is:
MOTIFS SUMMARY:
1) TTATAGCCGC (GCGGCTATAA) 1.986 3
2) AAACCGCCTC (GAGGCGGTTT) 1.865 3
AVERAGE OCCURRENCE: 3
For motif 1, unique occurrence is 3 (GENE_1, GENE_2, GENE_3). Similarly for motif 2, it is again 3 (GENE_1, GENE_2, GENE_4)
How can I use OCCURRENCES and ****** as blocks ? so that, I can regexp GENE_x to store it and count.
Kindly help.
Thanks,
AP
You better try to change the original text file so that it will be legal matlab m file code, then just use 'eval' function to run it .
Most of the job will be to find where to insert '=' and '[' ']' and '%' for ignore parts.
If all files are identical in format than it will be easy.

How to comment on a specific line number on a PR on github

I am trying to write a small script that can comment on github PRs using eslint output.
The problem is eslint gives me the absolute line numbers for each error.
But github API wants the line number relative to the diff.
From the github API docs: https://developer.github.com/v3/pulls/comments/#create-a-comment
To comment on a specific line in a file, you will need to first
determine the position in the diff. GitHub offers a
application/vnd.github.v3.diff media type which you can use in a
preceding request to view the pull request's diff. The diff needs to
be interpreted to translate from the line in the file to a position in
the diff. The position value is the number of lines down from the
first "##" hunk header in the file you would like to comment on.
The line just below the "##" line is position 1, the next line is
position 2, and so on. The position in the file's diff continues to
increase through lines of whitespace and additional hunks until a new
file is reached.
So if I want to add a comment on new line number 5 in the above image, then I would need to pass 12 to the API
My question is how can I easily map between the new line numbers which the eslint will give in it's error messages to the relative line numbers required by the github API
What I have tried so far
I am using parse-diff to convert the diff provided by github API into json object
[{
"chunks": [{
"content": "## -,OLD_TOTAL_LINES +NEW_STARTING_LINE_NUMBER,NEW_TOTAL_LINES ##",
"changes": [
{
"type": STRING("normal"|"add"|"del"),
"normal": BOOLEAN,
"add": BOOLEAN,
"del": BOOLEAN,
"ln1": OLD_LINE_NUMBER,
"ln2": NEW_LINE_NUMBER,
"content": STRING,
"oldStart": NUMBER,
"oldLines": NUMBER,
"newStart": NUMBER,
"newLines": NUMBER
}
}]
}]
I am thinking of the following algorithm
make an array of new line numbers starting from NEW_STARTING_LINE_NUMBER to
NEW_STARTING_LINE_NUMBER+NEW_TOTAL_LINESfor each file
subtract newStart from each number and make it another array relativeLineNumbers
traverse through the array and for each deleted line (type==='del') increment the corresponding remaining relativeLineNumbers
for another hunk (line having ##) decrement the corresponding remaining relativeLineNumbers
I have found a solution. I didn't put it here because it involves simple looping and nothing special. But anyway answering now to help others.
I have opened a pull request to create the similar situation as shown in question
https://github.com/harryi3t/5134/pull/7/files
Using the Github API one can get the diff data.
diff --git a/test.js b/test.js
index 2aa9a08..066fc99 100644
--- a/test.js
+++ b/test.js
## -2,14 +2,7 ##
var hello = require('./hello.js');
-var names = [
- 'harry',
- 'barry',
- 'garry',
- 'harry',
- 'barry',
- 'marry',
-];
+var names = ['harry', 'barry', 'garry', 'harry', 'barry', 'marry'];
var names2 = [
'harry',
## -23,9 +16,7 ## var names2 = [
// after this line new chunk will be created
var names3 = [
'harry',
- 'barry',
- 'garry',
'harry',
'barry',
- 'marry',
+ 'marry', 'garry',
];
Now just pass this data to diff-parse module and do the computation.
var parsedFiles = parseDiff(data); // diff output
parsedFiles.forEach(
function (file) {
var relativeLine = 0;
file.chunks.forEach(
function (chunk, index) {
if (index !== 0) // relative line number should increment for each chunk
relativeLine++; // except the first one (see rel-line 16 in the image)
chunk.changes.forEach(
function (change) {
relativeLine++;
console.log(
change.type,
change.ln1 ? change.ln1 : '-',
change.ln2 ? change.ln2 : '-',
change.ln ? change.ln : '-',
relativeLine
);
}
);
}
);
}
);
This would print
type (ln1) old line (ln2) new line (ln) added/deleted line relative line
normal 2 2 - 1
normal 3 3 - 2
normal 4 4 - 3
del - - 5 4
del - - 6 5
del - - 7 6
del - - 8 7
del - - 9 8
del - - 10 9
del - - 11 10
del - - 12 11
add - - 5 12
normal 13 6 - 13
normal 14 7 - 14
normal 15 8 - 15
normal 23 16 - 17
normal 24 17 - 18
normal 25 18 - 19
del - - 26 20
del - - 27 21
normal 28 19 - 22
normal 29 20 - 23
del - - 30 24
add - - 21 25
normal 31 22 - 26
Now you can use the relative line number to post a comment using github api.
For my purpose I only needed the relative line numbers for the newly added lines, but using the table above one can get it for deleted lines also.
Here's the link for the linting project in which I used this. https://github.com/harryi3t/lint-github-pr

Matlab - read unstructured file

I'm quite new with Matlab and I've been searching, unsucessfully, for the following issue: I have an unstructure txt file, with several rows I don't need, but there are a number of rows inside that file that have an structured format. I've been researching how to "load" the file to edit it, but cannot find anything.
Since i don't know if I was clear, let me show you the content in the file:
8782 PROJCS["UTM-39",GEOGC.......
1 676135.67755473056 2673731.9365976951 -15 0
2 663999.99999999302 2717629.9999999981 -14.00231124135486 3
3 709999.99999999162 2707679.2185399458 -10 2
4 679972.20003752434 2674637.5679516452 0.070000000000000007 1
5 676124.87132483651 2674327.3183533219 -18.94794942571912 0
6 682614.20527054626 2671000.0000000549 -1.6383425512446661 0
...........
8780 682247.4593014461 2676571.1515358146 0.1541080392180566 0
8781 695426.98657108378 2698111.6168302582 -8.5039945992245904 0
8782 674723.80100125563 2675133.5486935056 -19.920312922947179 0
16997 3 21
1 2147 658 590
2 1855 2529 5623
.........
I'd appreciate if someone can just tell me if there is the possibility to open the file to later load only the rows starting with 1 to the one starting with 8782. First row and all the others are not important.
I know than manually copy and paste to a new file would be a solution, but I'd like to know about the possibility to read the file and edit it for other ideas I have.
Thanks!
% Now lines{i} is the string of the i'th line.
lines = strsplit(fileread('filename'), '\n')
% Now elements{i}{j} is the j'th field of the i'th line.
elements = arrayfun(#(x){strsplit(x{1}, ' ')}, lines)
% Remove the first row:
elements(1) = []
% Take the first several rows:
n_rows = 8782
elements = elements(1:n_rows)
Or if the number of rows you need to take is not fixed, you can replace the last two statements above by:
firsts = arrayfun(#(x)str2num(x{1}{1}), elements)
n_rows = find((firsts(2:end) - firsts(1:end-1)) ~= 1, 1, 'first')
elements = elements(1:n_rows)

How to calculate a Mod b in Casio fx-991ES calculator

Does anyone know how to calculate a Mod b in Casio fx-991ES Calculator. Thanks
This calculator does not have any modulo function. However there is quite simple way how to compute modulo using display mode ab/c (instead of traditional d/c).
How to switch display mode to ab/c:
Go to settings (Shift + Mode).
Press arrow down (to view more settings).
Select ab/c (number 1).
Now do your calculation (in comp mode), like 50 / 3 and you will see 16 2/3, thus, mod is 2. Or try 54 / 7 which is 7 5/7 (mod is 5).
If you don't see any fraction then the mod is 0 like 50 / 5 = 10 (mod is 0).
The remainder fraction is shown in reduced form, so 60 / 8 will result in 7 1/2. Remainder is 1/2 which is 4/8 so mod is 4.
EDIT:
As #lawal correctly pointed out, this method is a little bit tricky for negative numbers because the sign of the result would be negative.
For example -121 / 26 = -4 17/26, thus, mod is -17 which is +9 in mod 26. Alternatively you can add the modulo base to the computation for negative numbers: -121 / 26 + 26 = 21 9/26 (mod is 9).
EDIT2: As #simpatico pointed out, this method will not work for numbers that are out of calculator's precision. If you want to compute say 200^5 mod 391 then some tricks from algebra are needed. For example, using rule
(A * B) mod C = ((A mod C) * B) mod C we can write:
200^5 mod 391 = (200^3 * 200^2) mod 391 = ((200^3 mod 391) * 200^2) mod 391 = 98
As far as I know, that calculator does not offer mod functions.
You can however computer it by hand in a fairly straightforward manner.
Ex.
(1)50 mod 3
(2)50/3 = 16.66666667
(3)16.66666667 - 16 = 0.66666667
(4)0.66666667 * 3 = 2
Therefore 50 mod 3 = 2
Things to Note:
On line 3, we got the "minus 16" by looking at the result from line (2) and ignoring everything after the decimal. The 3 in line (4) is the same 3 from line (1).
Hope that Helped.
Edit
As a result of some trials you may get x.99991 which you will then round up to the number x+1.
You need 10 ÷R 3 = 1
This will display both the reminder and the quoitent
÷R
There is a switch a^b/c
If you want to calculate
491 mod 12
then enter 491 press a^b/c then enter 12. Then you will get 40, 11, 12. Here the middle one will be the answer that is 11.
Similarly if you want to calculate 41 mod 12 then find 41 a^b/c 12. You will get 3, 5, 12 and the answer is 5 (the middle one). The mod is always the middle value.
You can calculate A mod B (for positive numbers) using this:
Pol( -Rec( 1/2πr , 2πr × A/B ) , Y ) ( πr - Y ) B
Then press [CALC], and enter your values for A and B, and any value for Y.
/ indicates using the fraction key, and r means radians ( [SHIFT] [Ans] [2] )
type normal division first and then type shift + S->d
Here's how I usually do it. For example, to calculate 1717 mod 2:
Take 1717 / 2. The answer is 858.5
Now take 858 and multiply it by the mod (2) to get 1716
Finally, subtract the original number (1717) minus the number you got from the previous step (1716) -- 1717-1716=1.
So 1717 mod 2 is 1.
To sum this up all you have to do is multiply the numbers before the decimal point with the mod then subtract it from the original number.
Note: Math error means a mod m = 0
It all falls back to the definition of modulus: It is the remainder, for example, 7 mod 3 = 1.
This because 7 = 3(2) + 1, in which 1 is the remainder.
To do this process on a simple calculator do the following:
Take the dividend (7) and divide by the divisor (3), note the answer and discard all the decimals -> example 7/3 = 2.3333333, only worry about the 2. Now multiply this number by the divisor (3) and subtract the resulting number from the original dividend.
so 2*3 = 6, and 7 - 6 = 1, thus 1 is 7mod3
Calculate x/y (your actual numbers here), and press a b/c key, which is 3rd one below Shift key.
Simply just divide the numbers, it gives yuh the decimal format and even the numerical format. using S<->D
For example: 11/3 gives you 3.666667 and 3 2/3 (Swap using S<->D).
Here the '2' from 2/3 is your mod value.
Similarly 18/6 gives you 14.833333 and 14 5/6 (Swap using S<->D).
Here the '5' from 5/6 is your mod value.