How to update value of an dataframe if it satisfies a specific condition inside a nested loop in spark scala - scala

sammple datajust need to know how can we update values inside a df with specific condition.
My Df contains some store related data, like store id, store name, Address, latitude, longitude ..
I need to find radius using this latitude and longitude
sqrt((x1-x2)^2) +((y1-y2)^2))
x1= 1st row latitude, x2 =2nd row latitude, like wise longitude also.
here I need to compare each store with other stores, so a nested loop.
So I converted Latitude and Longitude as lists and with the help of these 2 lists am doing the iteration
I have added new columns Radius and New_ID already
after running this the value of result is not getting updated in the dataframe,
please help me out,
If any more details required please let me know
while (i<latlist.length-1)
{
j=1
id=id+1
while(j<longlist.length)
{
result = sqrt(pow(latlist(i)- latlist(j),2) + pow(longlist(i) - mylonglist(j),2))
df3=df2.withColumn("Radius", col("Radius")+result)
} j=j+1;
df4= df3.filter(df3("Radius")<=1.32).withColumn("New_ID", col("New_ID")+id)
}
i=i+1
}
df4.show(10)
}
Sample Data
StoreName StoreReg Latitude Longitude Radius New_ID
Abc MH 50.5684 6.9894 0 0
Xyz DE 47.9783 7.4984 0 0
Pqr AS 67.8479 10.7029 0 0
Qwr LI 53.8733 8.8393 0 0
Dsg GY 49.0832 9.78946 0 0
Hnr TY 51.8937 8.5678 0 0
Erf ER 52.7689 7.9763 0 0

Related

Apply groupby in udf from a increase function Pyspark

I have the follow function:
import copy
rn = 0
def check_vals(x, y):
global rn
if (y != None) & (int(x)+1) == int(y):
return rn + 1
else:
# Using copy to deepcopy and not forming a shallow one.
res = copy.copy(rn)
# Increment so that the next value with start form +1
rn += 1
# Return the same value as we want to group using this
return res + 1
return 0
#pandas_udf(IntegerType(), functionType=PandasUDFType.GROUPED_AGG)
def check_final(x, y):
return lambda x, y: check_vals(x, y)
I need apply this function in a follow df:
index initial_range final_range
1 1 299
1 300 499
1 500 699
1 800 1000
2 10 99
2 100 199
So I need that follow output:
index min_val max_val
1 1 699
1 800 1000
2 10 199
See, that the grouping field there are a news abrangencies, that are the values min(initial) and max(final), until the sequence is broken, applying the groupBy.
I tried:
w = Window.partitionBy('index').orderBy(sf.col('initial_range'))
df = (df.withColumn('nextRange', sf.lead('initial_range').over(w))
.fillna(0,subset=['nextRange'])
.groupBy('index')
.agg(check_final("final_range", "nextRange").alias('check_1'))
.withColumn('min_val', sf.min("initial_range").over(Window.partitionBy("check_1")))
.withColumn('max_val', sf.max("final_range").over(Window.partitionBy("check_1")))
)
But, don't worked.
Anyone can help me?
I think pure Spark SQL API can solve your question and it doesn't need to use any UDF, which might be an impact of your Spark performance. Also, I think two window function is enough to solve this question:
df.withColumn(
'next_row_initial_diff', func.col('initial_range')-func.lag('final_range', 1).over(Window.partitionBy('index').orderBy('initial_range'))
).withColumn(
'group', func.sum(
func.when(func.col('next_row_initial_diff').isNull()|(func.col('next_row_initial_diff')==1), func.lit(0))
.otherwise(func.lit(1))
).over(
Window.partitionBy('index').orderBy('initial_range')
)
).groupBy(
'group', 'index'
).agg(
func.min('initial_range').alias('min_val'),
func.max('final_range').alias('max_val')
).drop(
'group'
).show(100, False)
Column next_row_initial_diff: Just like the lead you use to shift/lag the row and check if it's in sequence.
Column group: To group the sequence in index partition.

How to match the coordinates (UTM and geometry) of this df/sp objects?

I'd be really happy if you could help me with this problem. I want to geom_point the df "daa_84" into the shp file "shp_5". After viewing multiple related questions on stackoverflow and testing their answers (as create a sp object from "daa_84" and transform UTM coordinates to match it with the coordinates of "shp_5"), I only get something like the plot. Also, I know that the UTM zone (19S) and the EPSG code related to my country (32719) of the coords system (WGS84) are needed for "something" haha. Any ideas?
> head(daa_84)
# A tibble: 6 x 2
utm_este utm_norte
<dbl> <dbl>
1 201787 6364077
2 244958 6247258
3 245947 6246281
4 246100 6247804
5 246358 6242918
6 246470 6332356
> head(shp_5)
Simple feature collection with 6 features and 1 field
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: -7973587 ymin: -3976507 xmax: -7838155 ymax: -3766040
projected CRS: WGS 84 / Pseudo-Mercator
Comuna geometry
1 Rinconada MULTIPOLYGON (((-7871440 -3...
2 Cabildo MULTIPOLYGON (((-7842610 -3...
3 Petorca MULTIPOLYGON (((-7873622 -3...
4 Panquehue MULTIPOLYGON (((-7874932 -3...
5 Olmué MULTIPOLYGON (((-7916865 -3...
6 Cartagena MULTIPOLYGON (((-7973501 -3...
ggplot() + geom_sf(data = shp_5, aes()) +
geom_point(data = daa_84, aes(x= "utm_este", "utm_norte"),
alpha = 0.05, size = 0.5) +
labs(x = "Latitude", y = "Longitude")+
theme_bw()
my progress so far
EDIT
in addition to the answer of william3031, this code also works
library(sf)
daa_84 = tribble(~utm_este, ~utm_norte,
201787, 6364077,
244958, 6247258,
245947, 6246281,
246100, 6247804,
246358, 6242918,
246470, 6332356)
daa_84 = st_as_sf(daa_84,
coords=c('utm_este', 'utm_norte'),
crs=st_crs(32719)) %>%
st_transform(st_crs(shp_5))
This will work for you. I have used a different dataset for South America as you haven't provided a reproducible example.
library(tidyverse)
library(sf)
library(spData) # just for the 'world' dataset
# original
daa_84 <- data.frame(
utm_este = c(201787L, 244958L, 245947L, 246100L, 246358L, 246470L),
utm_norte = c(6364077L, 6247258L, 6246281L, 6247804L, 6242918L, 6332356L)
)
# converted
daa_84_sf <- st_as_sf(daa_84, coords = c("utm_este", "utm_norte"), crs = 32719)
# load world to get South America
data("world")
sam <- world %>%
filter(continent == "South America")
# plot
ggplot() +
geom_sf(data = sam) +
geom_sf(data = daa_84_sf)

Why is there a -1 at the end of the range function?

I understand the whole code and
I just want to know why there has to be a -1 at the end of the range function.
I've been checking it out with pythontutor but I can't make it out.
#Given 2 strings, a and b, return the number of the positions where they
#contain the same length 2 substring. So "xxcaazz" and "xxbaaz" yields 3,
#since the "xx", "aa", and "az" substrings appear in the same place in
#both strings.
def string_match(a, b):
shorter = min(len(a), len(b))
count = 0
for i in range(shorter -1): #<<<<<<<<< This is -1 I don't understand.
a_sub = a[i:i+2]
b_sub = b[i:i+2]
if a_sub == b_sub:
count = count + 1
return count
string_match('xxcaazz', 'xxbaaz')
string_match('abc', 'abc')
string_match('abc', 'axc')
I expect to understand why there has to be a -1 at the end of the range function. I will appreciate your help and explanation!
The value indices of the for loop are counted since 0 so the final value actually would be the (size -1)

Compute the Frequency of bigrams in Matlab

I am trying to compute and plot the distribution of bigrams frequencies
First I did generate all possible bigrams which gives 1296 bigrams
then i extract the bigrams from a given file and save them in words1
my question is how to compute the frequency of these 1296 bigrams for the file a.txt?
if there are some bigrams did not appear at all in the file, then their frequencies should be zero
a.txt is any text file
clear
clc
%************create bigrams 1296 ***************************************
chars ='1234567890abcdefghijklmonpqrstuvwxyz';
chars1 ='1234567890abcdefghijklmonpqrstuvwxyz';
bigram='';
for i=1:36
for j=1:36
bigram = sprintf('%s%s%s',bigram,chars(i),chars1(j));
end
end
temp1 = regexp(bigram, sprintf('\\w{1,%d}', 1), 'match');
temp2 = cellfun(#(x,y) [x '' y],temp1(1:end-1)', temp1(2:end)','un',0);
bigrams = temp2;
bigrams = unique(bigrams);
bigrams = rot90(bigrams);
bigram = char(bigrams(1:end));
all_bigrams_len = length(bigrams);
clear temp temp1 temp2 i j chars1 chars;
%****** 1. Cleaning Data ******************************
collection = fileread('e:\a.txt');
collection = regexprep(collection,'<.*?>','');
collection = lower(collection);
collection = regexprep(collection,'\W','');
collection = strtrim(regexprep(collection,'\s*',''));
%*******************************************************
temp = regexp(collection, sprintf('\\w{1,%d}', 1), 'match');
temp2 = cellfun(#(x,y) [x '' y],temp(1:end-1)', temp(2:end)','un',0);
words1 = rot90(temp2);
%*******************************************************
words1_len = length(words1);
vocab1 = unique(words1);
vocab_len1 = length(vocab1);
[vocab1,void1,index1] = unique(words1);
frequencies1 = hist(index1,vocab_len1);
I. Character counting problem for a string
bsxfun based solution for counting characters -
counts = sum(bsxfun(#eq,[string1-0]',65:90))
Output -
counts =
2 0 0 0 0 2 0 1 0 0 ....
If you would like to get a tabulate output of counts against each letter -
out = [cellstr(['A':'Z']') num2cell(counts)']
Output -
out =
'A' [2]
'B' [0]
'C' [0]
'D' [0]
'E' [0]
'F' [2]
'G' [0]
'H' [1]
'I' [0]
....
Please note that this was a case-sensitive counting for upper-case letters.
For a lower-case letter counting, use this edit to this earlier code -
counts = sum(bsxfun(#eq,[string1-0]',97:122))
For a case insensitive counting, use this -
counts = sum(bsxfun(#eq,[upper(string1)-0]',65:90))
II. Bigram counting case
Let us suppose that you have all the possible bigrams saved in a 1D cell array bigrams1 and the incoming bigrams from the file are saved into another cell array words1. Let us also assume certain values in them for demonstration -
bigrams1 = {
'ar';
'de';
'c3';
'd1';
'ry';
't1';
'p1'}
words1 = {
'de';
'c3';
'd1';
'r9';
'yy';
'de';
'ry';
'de';
'dd';
'd1'}
Now, you can get the counts of the bigrams from words1 that are present in bigrams1 with this code -
[~,~,ind] = unique(vertcat(bigrams1,words1));
bigrams_lb = ind(1:numel(bigrams1)); %// label bigrams1
words1_lb = ind(numel(bigrams1)+1:end); %// label words1
counts = sum(bsxfun(#eq,bigrams_lb,words1_lb'),2)
out = [bigrams1 num2cell(counts)]
The output on code run is -
out =
'ar' [0]
'de' [3]
'c3' [1]
'd1' [2]
'ry' [1]
't1' [0]
'p1' [0]
The result shows that - First element ar from the list of all possible bigrams has no find in words1 ; second element de has three occurrences in words1 and so on.
Hey similar to Dennis solution you can just use histc()
string1 = 'ASHRAFF'
histc(string1,'ABCDEFGHIJKLMNOPQRSTUVWXYZ')
this checks the number of entries in the bins defined by the string 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' which is hopefully the alphabet (just wrote it fast so no garantee). The result is:
Columns 1 through 21
2 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0
Columns 22 through 26
0 0 0 0 0
Just a little modification of my solution:
string1 = 'ASHRAFF'
alphabet1='A':'Z'; %%// as stated by Oleg Komarov
data=histc(string1,alphabet1);
results=cell(2,26);
for k=1:26
results{1,k}= alphabet1(k);
results{2,k}= data(k);
end
If you look at results now you can easily check rather it works or not :D
This answer creates all bigrams, loads in the file does a little cleanup, ans then uses a combination of unique and histc to count the rows
Generate all Bigrams
note the order here is important as unique will sort the array so this way it is created presorted so the output matches expectation;
[y,x] = ndgrid(['0':'9','a':'z']);
allBigrams = [x(:),y(:)];
Read The File
this removes capitalisation and just pulls out any 0-9 or a-z character then creates a column vector of these
fileText = lower(fileread('d:\loremipsum.txt'));
cleanText = regexp(fileText,'([a-z0-9])','tokens');
cleanText = cell2mat(vertcat(cleanText{:}));
create bigrams from file by shifting by one and concatenating
fileBigrams = [cleanText(1:end-1),cleanText(2:end)];
Get Counts
the set of all bigrams is added to our set (so the values are created for all possible). Then a value ∈{1,2,...,1296} is assigned to each unique row using unique's 3rd output. Counts are then created with histc with the bins equal to the set of values from unique's output, 1 is subtracted from each bin to remove the complete set bigrams we added
[~,~,c] = unique([fileBigrams;allBigrams],'rows');
counts = histc(c,1:1296)-1;
Display
to view counts against text
[allBigrams, counts+'0']
or for something potentially more useful...
[sortedCounts,sortInd] = sort(counts,'descend');
[allBigrams(sortInd,:), sortedCounts+'0']
ans =
or9
at8
re8
in7
ol7
te7
do6 ...
Did not look into the entire code fragment, but from the example at the top of your question, I think you are looking to make a histogram:
string1 = 'ASHRAFF'
nr = histc(string1,'A':'Z')
Will give you:
2 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
(Got a working solution with hist, but as #The Minion shows histc is more easy to use here.)
Note that this solution only deals with upper case letters.
You may want to do something like so if you want to put lower case letters in their correct bin:
string1 = 'ASHRAFF'
nr = histc(upper(string1),'A':'Z')
Or if you want them to be shown separately:
string1 = 'ASHRaFf'
nr = histc(upper(string1),['a':'z' 'A':'Z'])
bi_freq1 = zeros(1,all_bigrams_len);
for k=1: vocab_len1
for i=1:all_bigrams_len
if char(vocab1(k)) == char(bigrams(i))
bi_freq1(i) = frequencies1(k);
end
end
end

Matlab HDF5: Read DIMENSION_LIST attribute

I'm trying to read HDF5 files with Matlab. I created the files in Fortran, which is only relevant in that I used h5dsattach_scale_f to attached scale datasets to each dimension of my given primary dataset. Most of my logic works well, but I'm having trouble reading the attributes of my primary dataset in order to get at the attached scales.
I start by iterating through each dataset in the file. Once I know I have my primary dataset, I iterate through its attributes with this call:
[status, index_out, SD] = H5A.iterate(dset_id, 'H5_INDEX_NAME', 'H5_ITER_NATIVE', 0, #hdf5_sds_attr_iter, SD);
That calls this function for every attribute:
function [status, SD] = hdf5_sds_attr_iter(dset_id, attr_name, info, SD)
status = 0;
disp(attr_name);
if ~strcmp(attr_name, 'DIMENSION_LIST')
return;
end
attr_id = H5A.open(dset_id, attr_name, 'H5P_DEFAULT');
space = H5A.get_space (attr_id);
[~, dims, ~] = H5S.get_simple_extent_dims(space);
info2 = H5A.get_info(attr_id);
disp(info2);
rdata = H5A.read(attr_id, 'H5ML_DEFAULT');
disp(rdata);
for i = 1:dims
disp(rdata{i});
end
H5S.close(space);
H5A.close(attr_id);
end
This is the output:
DIMENSION_LIST
3
corder_valid: 1
corder: 0
cset: 0
data_size: 48
[8x1 uint8]
[8x1 uint8]
[8x1 uint8]
184
17
0
0
0
0
0
0
32
28
0
0
0
0
0
0
240
29
0
0
0
0
0
0
If I do h5dump on the dataset, this is what that attribute looks like:
ATTRIBUTE "DIMENSION_LIST" {
DATATYPE H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): (DATASET 1400 /beamdata scale rank 1 ),
(1): (DATASET 6512 /beamdata scale rank 2 ),
(2): (DATASET 6976 /beamdata scale rank 3 )
}
}
Since those numbers (1400, 6512, 6976) do not appear elsewhere in the dump, I don't know how to use them or the output of H5A.read (rdata) to actually get at the scale data. The Matlab HDF5 documentation is rather silent on what to do with attribute data. Does anyone know how to process attribute reference data correctly?