assign value in spark dataframe - pyspark

I have a spark dataframe :
+-----------------+------------+--------------------+------------------+------------------+
|opp_id__reference|oplin_status| stage| std_amount| std_line_amount|
+-----------------+------------+--------------------+------------------+------------------+
|OP-180618-7456377| Pending|7 - Deliver & Val...|31395.462999391966|13072.069816517043|
|OP-180618-7456377| Pending|7 - Deliver & Val...|31395.462999391966| 13.85958009943131|
+-----------------+------------+--------------------+------------------+------------------+
I would like to assign GREAT to oppt_line which std_line_amount >= 30% std_amount .
The expected output :
542 OP-180112-6925769 Pending 7 - Deliver & Validate 363802.836296 31261.159197 False
543 OP-180112-6925769 Pending 7 - Deliver & Validate 363802.836296 46832.656747 False
544 OP-180112-6925769 Pending 7 - Deliver & Validate 363802.836296 118542.329840 False
359 OP-180222-7065558 Pending 7 - Deliver & Validate 2.434888e+05 670.785793 False
389 OP-160712-5051474 Pending 7 - Deliver & Validate 1.288711e+05 1288.780000 False
770 OP-180720-7563258 Pending 7 - Deliver & Validate 1.366182e+05 13.859580 False
For this I did in pandas dataframe :
DF_BR6['greater']=DF_BR6.std_line_amount.gt(DF_BR6.groupby('opp_id__reference').std_amount.transform('sum')*0.3)
Can you help me to achieve it in spark dataframe please?
Thanks
Bests

Related

Substring function to extract part of the string

data = {'desc': ['ADRIAN PETER - ANN 80020355787C - 11 Baillon Pass.pdf', 'AILEEN MARCUS - ANC 800E15432922 - 5 Mandarin Way.pdf',
'AJITH SINGH - ANN 80020837750 - 11 Berkeley Loop.pdf', 'ALEX MARTIN-CURTIS - ANC 80021710355 - 26 Dovedale St.pdf',
'Alice.Smith\Jodee - Karen - ANE 80020428377 - 58 Harrisdale Dr.pdf']}
df = pd.DataFrame(data, columns = ['desc'])
df
From the data frame, I want to create a new column called ID, and in that ID, I want to have only those values starting after ANN, ANC or ANE. So I am expecting a result as below.
ID
80020355787C
800E15432922
80020837750
80021710355
80020428377
I tried running the code below, but it did not get the desired result. Appreciate your help on this.
df['id'] = df['desc'].str.extract(r'\-([^|]+)\-')
You can use - AN[NCE] (800[0-9A-Z]+) -, where:
AN[NCE] matches literally AN followed by N or C or E;
800[0-9A-Z]+ matches literally 800 followed by one or more characters between 0 and 9 or between A and Z.
>>> df['desc'].str.extract(r'- AN[NCE] (800[0-9A-Z]+) -')
0
0 80020355787C
1 800E15432922
2 80020837750
3 80021710355
4 80020428377
If not all your ids start with "800", you can just remove it from the pattern.

How to get the difference between two columns using Talend

I Have two excel sheets that i want to compare using talend job
First excel named Compare_Me_1
PN
STT
Designation
AY73101000
20
RC0402FR-0743K2L
AY73101000
22
RK73H1ETTP4322F
AY73101000
22
ERJ-2RKF4322X
Ac2566
70
CRCW040243K2FKED
Second excel named Compare_Me_2
PN
STT
Designation
AY73101000
20
RC0402FR-0743K2L
AY73101000
22
RK73H1ETTP4322F
AY73101000
21
ERJ-2RKF4322X
Ac2566
70
CRCW040243K2FKED
what i want to achieve is this output
PN1
STT1
STT2
STT_OK_Ko
Designation1
Designation2
Designation_Ok_Ko
AY73101000
20
20
ok
RC0402FR-0743K2L
RC0402FR-0743K2L
ok
AY73101000
22
22
ok
RK73H1ETTP4322F
RK73H1ETTP4322F
ok
AY73101000
22
21
ko
ERJ-2RKF4322X
ERJ-2RKF4322X
ok
Ac2566
70
70
ok
CRCW040243K2FKED
CRCW040243K2FKED
ok
So to achieve this i developed a talend job that looks like below :
In My tMap i linked PN with a leftouterjoin and All Matches correspandance .
And to get for example STT_Ok_KO i used bellow code to compare my two input :
(!Relational.ISNULL(row14.STT) && !Relational.ISNULL(row13.STT) &&
row14.STT.equals(row13.STT) ) ||
(Relational.ISNULL(row14.STT) && Relational.ISNULL(row13.STT))
?"ok":"ko"
Is this the correct way to achieve my ouput ? If not , recommand me to use an other method
Any suggest is welcome .
You probably need to follow the long steps below :

How to get Geoserver to use the MongoDB-Geo-Index for Heatmap Transformations?

We are currently trying to switch the backend for our Geoserver from Postgis to MongoDB.
Apart from our Geoserver-Heatmap Transformations it is working quite well.
While visualizing the Heatmap Geoserver seems to ignore the BBox constrains, this results in a full table scan even when zooming in.
Normally a filtering with $geoIntersec is done on the defined BBox, but not while using
the Heatmap Transformation.
Anybody with similar problem and possibly a solution?
We are using Geoserver Version 2.15.2.
Here the Geoserver logs with the logged MongoDB query.
2020-04-12 20:46:27,226 DEBUG [org.geotools.data.util] - CRSConverterFactory can be applied from Strings to CRS only.
2020-04-12 20:46:27,227 DEBUG [org.geotools.data.util] - InterpolationConverterFactory can be applied from Strings to Interpolation only.
2020-04-12 20:46:27,227 DEBUG [org.geotools.data.util] - CRSConverterFactory can be applied from Strings to CRS only.
2020-04-12 20:46:27,227 DEBUG [org.geotools.data.util] - InterpolationConverterFactory can be applied from Strings to Interpolation only.
2020-04-12 20:46:27,227 INFO [org.geoserver.flow] - Control-flow inactive, there are no configured rules
2020-04-12 20:46:27,229 DEBUG [org.geotools.process.factory] - Failed to locate the field 1 in class class org.geotools.process.vector.HeatmapProcess
2020-04-12 20:46:27,230 DEBUG [org.geotools.process.factory] - Failed to locate the field 1 in class class java.lang.Integer
2020-04-12 20:46:27,232 DEBUG [org.geotools.renderer.lite] - Computed scale denominator: 1063.361240291149
2020-04-12 20:46:27,233 DEBUG [org.geotools.styling] - number of fts set 1
2020-04-12 20:46:27,233 DEBUG [org.geotools.renderer.lite] - creating rules for scale denominator - 1,063.361
2020-04-12 20:46:27,233 DEBUG [org.geotools.renderer.lite] - Processing 1 stylers for test:stations2
2020-04-12 20:46:27,233 DEBUG [org.geotools.renderer.lite] - Expanding rendering area by 1 pixels to consider stroke width
2020-04-12 20:46:27,233 DEBUG [org.geotools.renderer.lite] - Querying layer test:stations2 with bbox: ReferencedEnvelope[16.83694749302019 : 16.83789164596449, 47.864875483641775 : 47.86694080816731]
2020-04-12 20:46:27,234 DEBUG [org.geotools.process.factory] - Failed to locate the field 1 in class class org.geotools.process.vector.HeatmapProcess
2020-04-12 20:46:27,234 DEBUG [org.geotools.process.factory] - Failed to locate the field 1 in class class java.lang.Integer
2020-04-12 20:46:27,235 DEBUG [org.geotools.data.mongodb] - find({ }, { "geometry" : 1 , "contact.mail" : 1 , "name" : 1 , "count" : 1 , "measurements.unit" : 1 , "measurements.values.time" : 1 , "_id" : 1 , "id" : 1 , "measurements.name" : 1 , "measurements.values.value" : 1})

select and calculate a new column in a spark dataframe pyspark

I have a spark dataframe with this format:
opp_id__reference|oplin_status| stage| std_amount| std_line_amount|
+-----------------+------------+--------------------+----------------+----------------+
|OP-171102-67318| Won|7 - Deliver & Val...|6243.316662349|6243.31666234948|
|OP-180910-77114| Won|7 - Deliver & Val...|5014.57880858921|5014.57880858921|
|OP-180910-76544| Pending|7 - Deliver & Val...|5014.57880858921|5014.57880858921|
|OP-180910-76544| Pending|7 - Deliver & Val...|5014.57880858921|5614.57880858921|
|OP-180910-76544| Won|7 - Deliver & Val...|5014.57880858921|5994.57880858921|
I would like to extract the list of opp_id__reference that the sum of records which has oplin_status = "Pending" is bigger than std_amount
This hiw I did :
# select opp_line which stage =='7 - Deliver & Validate' and oplin_status =='Pending'
DF_BR8 = df.filter(df.stage.contains("7 - Deliver")).select('opp_id__reference', 'oplin_status', 'stage', 'std_amount', 'std_line_amount')
DF_BR8_1 = DF_BR8.groupby('opp_id__reference', 'std_amount', 'oplin_status').agg({'std_line_amount': 'sum'}).withColumnRenamed('sum(std_line_amount)','sum_column')
DF_res = DF_BR8_1.filter(DF_BR8_1.oplin_status.contains("Pending"))
DF_res1 =DF_res.filter(DF_res.sum_column <= 0.3*DF_BR8_1.std_amount)
My question is it : is it correct what i did? is there any other way more simple to do?
Thanks

How to comment on a specific line number on a PR on github

I am trying to write a small script that can comment on github PRs using eslint output.
The problem is eslint gives me the absolute line numbers for each error.
But github API wants the line number relative to the diff.
From the github API docs: https://developer.github.com/v3/pulls/comments/#create-a-comment
To comment on a specific line in a file, you will need to first
determine the position in the diff. GitHub offers a
application/vnd.github.v3.diff media type which you can use in a
preceding request to view the pull request's diff. The diff needs to
be interpreted to translate from the line in the file to a position in
the diff. The position value is the number of lines down from the
first "##" hunk header in the file you would like to comment on.
The line just below the "##" line is position 1, the next line is
position 2, and so on. The position in the file's diff continues to
increase through lines of whitespace and additional hunks until a new
file is reached.
So if I want to add a comment on new line number 5 in the above image, then I would need to pass 12 to the API
My question is how can I easily map between the new line numbers which the eslint will give in it's error messages to the relative line numbers required by the github API
What I have tried so far
I am using parse-diff to convert the diff provided by github API into json object
[{
"chunks": [{
"content": "## -,OLD_TOTAL_LINES +NEW_STARTING_LINE_NUMBER,NEW_TOTAL_LINES ##",
"changes": [
{
"type": STRING("normal"|"add"|"del"),
"normal": BOOLEAN,
"add": BOOLEAN,
"del": BOOLEAN,
"ln1": OLD_LINE_NUMBER,
"ln2": NEW_LINE_NUMBER,
"content": STRING,
"oldStart": NUMBER,
"oldLines": NUMBER,
"newStart": NUMBER,
"newLines": NUMBER
}
}]
}]
I am thinking of the following algorithm
make an array of new line numbers starting from NEW_STARTING_LINE_NUMBER to
NEW_STARTING_LINE_NUMBER+NEW_TOTAL_LINESfor each file
subtract newStart from each number and make it another array relativeLineNumbers
traverse through the array and for each deleted line (type==='del') increment the corresponding remaining relativeLineNumbers
for another hunk (line having ##) decrement the corresponding remaining relativeLineNumbers
I have found a solution. I didn't put it here because it involves simple looping and nothing special. But anyway answering now to help others.
I have opened a pull request to create the similar situation as shown in question
https://github.com/harryi3t/5134/pull/7/files
Using the Github API one can get the diff data.
diff --git a/test.js b/test.js
index 2aa9a08..066fc99 100644
--- a/test.js
+++ b/test.js
## -2,14 +2,7 ##
var hello = require('./hello.js');
-var names = [
- 'harry',
- 'barry',
- 'garry',
- 'harry',
- 'barry',
- 'marry',
-];
+var names = ['harry', 'barry', 'garry', 'harry', 'barry', 'marry'];
var names2 = [
'harry',
## -23,9 +16,7 ## var names2 = [
// after this line new chunk will be created
var names3 = [
'harry',
- 'barry',
- 'garry',
'harry',
'barry',
- 'marry',
+ 'marry', 'garry',
];
Now just pass this data to diff-parse module and do the computation.
var parsedFiles = parseDiff(data); // diff output
parsedFiles.forEach(
function (file) {
var relativeLine = 0;
file.chunks.forEach(
function (chunk, index) {
if (index !== 0) // relative line number should increment for each chunk
relativeLine++; // except the first one (see rel-line 16 in the image)
chunk.changes.forEach(
function (change) {
relativeLine++;
console.log(
change.type,
change.ln1 ? change.ln1 : '-',
change.ln2 ? change.ln2 : '-',
change.ln ? change.ln : '-',
relativeLine
);
}
);
}
);
}
);
This would print
type (ln1) old line (ln2) new line (ln) added/deleted line relative line
normal 2 2 - 1
normal 3 3 - 2
normal 4 4 - 3
del - - 5 4
del - - 6 5
del - - 7 6
del - - 8 7
del - - 9 8
del - - 10 9
del - - 11 10
del - - 12 11
add - - 5 12
normal 13 6 - 13
normal 14 7 - 14
normal 15 8 - 15
normal 23 16 - 17
normal 24 17 - 18
normal 25 18 - 19
del - - 26 20
del - - 27 21
normal 28 19 - 22
normal 29 20 - 23
del - - 30 24
add - - 21 25
normal 31 22 - 26
Now you can use the relative line number to post a comment using github api.
For my purpose I only needed the relative line numbers for the newly added lines, but using the table above one can get it for deleted lines also.
Here's the link for the linting project in which I used this. https://github.com/harryi3t/lint-github-pr