How to match the coordinates (UTM and geometry) of this df/sp objects? - coordinates

I'd be really happy if you could help me with this problem. I want to geom_point the df "daa_84" into the shp file "shp_5". After viewing multiple related questions on stackoverflow and testing their answers (as create a sp object from "daa_84" and transform UTM coordinates to match it with the coordinates of "shp_5"), I only get something like the plot. Also, I know that the UTM zone (19S) and the EPSG code related to my country (32719) of the coords system (WGS84) are needed for "something" haha. Any ideas?
> head(daa_84)
# A tibble: 6 x 2
utm_este utm_norte
<dbl> <dbl>
1 201787 6364077
2 244958 6247258
3 245947 6246281
4 246100 6247804
5 246358 6242918
6 246470 6332356
> head(shp_5)
Simple feature collection with 6 features and 1 field
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: -7973587 ymin: -3976507 xmax: -7838155 ymax: -3766040
projected CRS: WGS 84 / Pseudo-Mercator
Comuna geometry
1 Rinconada MULTIPOLYGON (((-7871440 -3...
2 Cabildo MULTIPOLYGON (((-7842610 -3...
3 Petorca MULTIPOLYGON (((-7873622 -3...
4 Panquehue MULTIPOLYGON (((-7874932 -3...
5 Olmué MULTIPOLYGON (((-7916865 -3...
6 Cartagena MULTIPOLYGON (((-7973501 -3...
ggplot() + geom_sf(data = shp_5, aes()) +
geom_point(data = daa_84, aes(x= "utm_este", "utm_norte"),
alpha = 0.05, size = 0.5) +
labs(x = "Latitude", y = "Longitude")+
theme_bw()
my progress so far
EDIT
in addition to the answer of william3031, this code also works
library(sf)
daa_84 = tribble(~utm_este, ~utm_norte,
201787, 6364077,
244958, 6247258,
245947, 6246281,
246100, 6247804,
246358, 6242918,
246470, 6332356)
daa_84 = st_as_sf(daa_84,
coords=c('utm_este', 'utm_norte'),
crs=st_crs(32719)) %>%
st_transform(st_crs(shp_5))

This will work for you. I have used a different dataset for South America as you haven't provided a reproducible example.
library(tidyverse)
library(sf)
library(spData) # just for the 'world' dataset
# original
daa_84 <- data.frame(
utm_este = c(201787L, 244958L, 245947L, 246100L, 246358L, 246470L),
utm_norte = c(6364077L, 6247258L, 6246281L, 6247804L, 6242918L, 6332356L)
)
# converted
daa_84_sf <- st_as_sf(daa_84, coords = c("utm_este", "utm_norte"), crs = 32719)
# load world to get South America
data("world")
sam <- world %>%
filter(continent == "South America")
# plot
ggplot() +
geom_sf(data = sam) +
geom_sf(data = daa_84_sf)

Related

Create LineString from Lat/Lon columns using PySpark

I have a PySpark dataframe containing Lat/Lon points for different trajectories identified by a column "trajectories_id".
trajectory_id
latitude
longitude
1
45
5
1
45
6
1
45
7
2
46
5
2
46
6
2
46
7
What I want to do is to extract for each trajectory_id a LineString and store it in another dataframe, where each row represents a trajectory with "id" and "geometry" columns. In this example, the output should be:
trajectory_id
geometry
1
LINESTRING (5 45, 6 45, 7 45)
2
LINESTRING (5 46, 6 46, 7 46)
This is similar to what has been asked in this question, but in my case I need to use PySpark.
I have tried the following:
import pandas as pd
from shapely.geometry import Point,LineString
df = pd.DataFrame([[1, 45,5], [1, 45,6], [1, 45,7],[2, 46,5], [2, 46,6], [2, 46,7]], columns=['trajectory_id', 'latitude','longitude'])
df1 = spark.createDataFrame(df)
idx_ = df1.select("trajectory_id").rdd.flatMap(lambda x: x).distinct().collect()
geo_df = pd.DataFrame(index=range(len(idx_)),columns=['geometry','trajectory_id'])
k=0
for i in idx_:
df2=df1.filter(F.col("trajectory_id").isin(i)).toPandas()
df2['points']=df2[["longitude", "latitude"]].apply(Point, axis=1)
geo_df.geometry.iloc[k]=str(LineString(df2['points']))
geo_df['trajectory_id'].iloc[k]=i
k=k+1
This code works, but as in my task I am working with many more trajectories (> 2milions), this takes forever as I am converting to Pandas in each iteration.
Is there a way I can obtain the same output in a more efficient way?
As mentioned, I know that using toPandas() (and/or collect() ) is something I should avoid, especially inside a for loop
You can do this by using pyspark SQL's native functions.
import pyspark.sql.functions as func
long_lat_df = df.withColumn('joined_long_lat', func.concat(func.col("longitude"), func.lit(" "), func.col("latitude")));
grouped_df = long_lat_df .groupby('trajectory_id').agg(func.collect_list('joined_long_lat').alias("geometry"))
final_df = grouped_df.withColumn('geometry', func.concat_ws(", ", func.col("geometry")));

How to update value of an dataframe if it satisfies a specific condition inside a nested loop in spark scala

sammple datajust need to know how can we update values inside a df with specific condition.
My Df contains some store related data, like store id, store name, Address, latitude, longitude ..
I need to find radius using this latitude and longitude
sqrt((x1-x2)^2) +((y1-y2)^2))
x1= 1st row latitude, x2 =2nd row latitude, like wise longitude also.
here I need to compare each store with other stores, so a nested loop.
So I converted Latitude and Longitude as lists and with the help of these 2 lists am doing the iteration
I have added new columns Radius and New_ID already
after running this the value of result is not getting updated in the dataframe,
please help me out,
If any more details required please let me know
while (i<latlist.length-1)
{
j=1
id=id+1
while(j<longlist.length)
{
result = sqrt(pow(latlist(i)- latlist(j),2) + pow(longlist(i) - mylonglist(j),2))
df3=df2.withColumn("Radius", col("Radius")+result)
} j=j+1;
df4= df3.filter(df3("Radius")<=1.32).withColumn("New_ID", col("New_ID")+id)
}
i=i+1
}
df4.show(10)
}
Sample Data
StoreName StoreReg Latitude Longitude Radius New_ID
Abc MH 50.5684 6.9894 0 0
Xyz DE 47.9783 7.4984 0 0
Pqr AS 67.8479 10.7029 0 0
Qwr LI 53.8733 8.8393 0 0
Dsg GY 49.0832 9.78946 0 0
Hnr TY 51.8937 8.5678 0 0
Erf ER 52.7689 7.9763 0 0

How to implement a Slope graph in R for two variables

I'm analyzing how many users have used a particular hashtag and how they have contributed to the total number of tweets. My results are:
Data:
20.68% of tweets related to #HashtagX are created by 20 users. Now, these 20 users only represent 0.001% of the total of 14,432 users who have ever used the hashtag #HashtagX.
What happens if we take the top 100 users by number of tweets? 44% of tweets are created by the top 100 users.
If we extend to the top 500 users by number of users we see that 72% of tweets is created by the top 500.
I am wondering how to implement a slope graph because I think that is a good way to show the relationship between both variables, but it is not a default graph provides for any library.
One of the ways to show the relationship between both variables ("Users" vs "Tweets") is a Slope Chart.
Visualization obtained (solved graph for the question):
Slope Chart
1) Libraries
library(ggplot2)
library(scales)
library(ggrepel)
theme_set(theme_classic())
2) Data example
Country = c('20 accounts', '50 accounts', '100 accounts','200 accounts','300 accounts',
'500 accounts','1000 accounts','14.443 accounts')
January = c(0.14, 0.34, 0.69,1.38,2.07,3.46,6.92,100)
April = c(20.68, 33.61, 44.94, 57.49,64.11,72,80,100)
Tweets_N = c(26797, 43547, 58211, 74472,83052,93259,103898,129529)
a = data.frame(Country, January, April)
left_label <- paste(a$Country, paste0(a$January,"%"),sep=" | ")
right_label <- paste(paste0(round(a$April),"%"),paste0(Tweets_N," tweets"),sep=" | ")
a$color_class <- "green"
3) Plot
p <- ggplot(a) + geom_segment(aes(x=1, xend=2, y=January, yend=April, col=color_class), size=.25, show.legend=F) +
geom_vline(xintercept=1, linetype="dashed", size=.1) +
geom_vline(xintercept=2, linetype="dashed", size=.1) +
scale_color_manual(labels = c("Up", "Down"),
values = c("blue", "red")) +
labs(
x="", y = "Percentage") +
xlim(.5, 2.5) + ylim(0,(1.1*(max(a$January, a$April)))) # X and Y axis limits
# Add texts
p <- p + geom_text_repel(label=left_label, y=a$January, x=rep(1, NROW(a)), hjust=1.1, size=3.5,direction = "y")
p <- p + geom_text(label=right_label, y=a$April, x=rep(2, NROW(a)), hjust=-0.1, size=3.5)
p <- p + geom_text(label="Accounts", x=1, y=1.1*(max(a$January, a$April)), hjust=1.2, size=4, check_overlap = TRUE) # title
p <- p + geom_text(label="Tweeets (% of Total)", x=2, y=1.1*(max(a$January, a$April)), hjust=-0.1, size=4, check_overlap = TRUE)
# title
# Minify theme
p + theme(panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
panel.border = element_blank(),
plot.margin = unit(c(1,2,1,2), "cm"))

How can I sum up functions that are made of elements of the imported dataset?

See the code and error. I have already tried Do, For,...and it is not working.
CODE + Error from Mathematica:
Import of survival probabilities _{k}p_x and _{k}p_y (calculated in excel)
px = Import["C:\Users\Eva\Desktop\kpx.xlsx"];
px = Flatten[Take[px, All], 1];
NOTE: The probability _{k}p_x can be found on the position px[[k+2, x -16]
i = 0.04;
v = 1/(1 + i);
JointLifeIndep[x_, y_, n_] = Sum[v^k*px[[k + 2, x - 16]]*py[[k + 2, y - 16]], {k , 0, n - 1}]
Part::pkspec1: The expression 2+k cannot be used as a part specification.
Part::pkspec1: The expression 2+k cannot be used as a part specification.
Part::pkspec1: The expression 2+k cannot be used as a part specification.
General::stop: Further output of Part::pkspec1 will be suppressed during this calculation.
Part of dataset (left corner of the dataset):
k\x 18 19 20
0 1 1 1
1 0.999478086278185 0.999363078716059 0.99927911905056
2 0.998841497412202 0.998642656911039 0.99858030519133
3 0.998121451605207 0.99794428814123 0.99788275311401
4 0.997423447323642 0.997247180349674 0.997174407432264
5 0.996726703362208 0.996539285828369 0.996437857252448
6 0.996019178300768 0.995803204773039 0.99563600297737
7 0.995283481416241 0.995001861216016 0.994823584922968
8 0.994482556091416 0.994189960607964 0.99405569519175
9 0.993671079225432 0.99342255996206 0.993339856748282
10 0.992904079096455 0.992707177451333 0.992611817294026
11 0.992189069953677 0.9919796017009 0.991832027835091
Without having the exact same data files to work with it is often easy for each of us to make mistakes that the other cannot reproduce or understand.
From your snapshot of your data set I used Export in Mathematica to try to reproduce your .xlsx file. Then I tried the following
px = Import["kpx.xlsx"];
px = Flatten[Take[px, All], 1];
py = px; (* fake some py data *)
i = 0.04;
v = 1/(1 + i);
JointLifeIndep[x_, y_, n_] := Sum[v^k*px[[k+2,x-16]]*py[[k+2,y-16]], {k,0,n-1}];
JointLifeIndep[17, 17, 12]
and it displays 362.402
Notice I used := instead of = in my definition of JointLifeIndep. := and = do different things in Mathematica. = will immediately evaluate the right hand side of that definition. This is possibly the reason that you are getting the error that you do.
You should also be careful with your subscript values and make sure that every subscript is between 1 and the number of rows (or columns) in your matrix.
So see if you can try this example with an Excel sheet containing only the snapshot of data that you showed and see if you get the same result that I do.
Hopefully that will be enough for you to make progress.

Total distance of route using Leaflet routing machine in rMaps/rCharts

I would like to produce a shiny app that asks for two addresses, maps an efficient route, and calculates the total distance of the route. This can be done using the Leaflet Routing Machine using the javascript library, however I would like to do a bunch of further calculations with the distance of the route and have it all embedded in a shiny app.
You can produce the map using rMaps by following this demo by Ramnathv here. But I'm not able to pull out the total distance travelled even though I can see that it has been calculated in the legend or controller. There exists another discussion on how to do this using the javascript library - see here. They discuss using this javascript code:
alert('Distance: ' + routes[0].summary.totalDistance);
Here is my working code for the rMap. If anyone has any ideas for how to pull out the total distance of a route and store it, I would be very grateful. Thank you!
# INSTALL DEPENDENCIES IF YOU HAVEN'T ALREADY DONE SO
library(devtools)
install_github("ramnathv/rCharts#dev")
install_github("ramnathv/rMaps")
# CREATE FUNCTION to convert address to coordinates
library(RCurl)
library(RJSONIO)
construct.geocode.url <- function(address, return.call = "json", sensor = "false") {
root <- "http://maps.google.com/maps/api/geocode/"
u <- paste(root, return.call, "?address=", address, "&sensor=", sensor, sep = "")
return(URLencode(u))
}
gGeoCode <- function(address,verbose=FALSE) {
if(verbose) cat(address,"\n")
u <- construct.geocode.url(address)
doc <- getURL(u)
x <- fromJSON(doc)
if(x$status=="OK") {
lat <- x$results[[1]]$geometry$location$lat
lng <- x$results[[1]]$geometry$location$lng
return(c(lat, lng))
} else {
return(c(NA,NA))
}
}
# GET COORDINATES
x <- gGeoCode("Vancouver, BC")
way1 <- gGeoCode("645 East Hastings Street, Vancouver, BC")
way2 <- gGeoCode("2095 Commercial Drive, Vancouver, BC")
# PRODUCE MAP
library(rMaps)
map = Leaflet$new()
map$setView(c(x[1], x[2]), 16)
map$tileLayer(provider = 'Stamen.TonerLite')
mywaypoints = list(c(way1[1], way1[2]), c(way2[1], way2[2]))
map$addAssets(
css = "http://www.liedman.net/leaflet-routing-machine/dist/leaflet-routing-machine.css",
jshead = "http://www.liedman.net/leaflet-routing-machine/dist/leaflet-routing-machine.js"
)
routingTemplate = "
<script>
var mywaypoints = %s
L.Routing.control({
waypoints: [
L.latLng.apply(null, mywaypoints[0]),
L.latLng.apply(null, mywaypoints[1])
]
}).addTo(map);
</script>"
map$setTemplate(
afterScript = sprintf(routingTemplate, RJSONIO::toJSON(mywaypoints))
)
# map$set(width = 800, height = 800)
map
You can easily create a route via the google maps api. The returned data frame will have distance info. Just sum up the legs for total distance.
library(ggmap)
x <- gGeoCode("Vancouver, BC")
way1txt <- "645 East Hastings Street, Vancouver, BC"
way2txt <- "2095 Commercial Drive, Vancouver, BC"
route_df <- route(way1txt, way2txt, structure = 'route')
dist<-sum(route_df[,1],na.rm=T) # total distance in meters
#
qmap(c(x[2],x[1]), zoom = 12) +
geom_path(aes(x = lon, y = lat), colour = 'red', size = 1.5, data = route_df, lineend = 'round')