Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am struggling to understand how i am going to create the following in GraphX in Apache spark. I am given the following:
a hdfs file which has loads of data which comes in the form:
node: ConnectingNode1, ConnectingNode2..
For example:
123214: 521345, 235213, 657323
I need to somehow store this data in an EdgeRDD so that i can create my graph in GraphX, but i have no idea how i am going to go about this.
After you read your hdfs source and have your data in rdd, you can try something like the following:
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx.Edge
// Sample data
val rdd = sc.parallelize(Seq("1: 1, 2, 3", "2: 2, 3"))
val edges: RDD[Edge[Int]] = rdd.flatMap {
row =>
// split around ":"
val splitted = row.split(":").map(_.trim)
// the value to the left of ":" is the source vertex:
val srcVertex = splitted(0).toLong
// for the values to the right of ":", we split around "," to get the other vertices
val otherVertices = splitted(1).split(",").map(_.trim)
// for each vertex to the right of ":", we create an Edge object connecting them to the srcVertex:
otherVertices.map(v => Edge(srcVertex, v.toLong, 1))
}
Edit
Additionally, if your vertices have a constant default weight, you can create your graph straight from the Edges, so you don't need to create a verticesRDD:
import org.apache.spark.graphx.Graph
val g = Graph.fromEdges(edges, defaultValue = 1)
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
Help me to write a Scala program to convert the given date & time to GMT based on airport code.
Input will have 3 Arguments
For E:g For Chennai
2021-07-26 ,00:23, MAA (3 Arguments)
Output will be two arguments
2021-07-25, 18:54
The Chennai time 26-07-2021 00:23 equals to the GMT time 25-07-2021 18:54.
You will first require a mapping of AirportCode -> TimeZone (which can give you UTC offset) or directly AirportCode -> UTC Offset.
Then you can use this timezone/offset to convert the given date-time to GMT.
Here is a file containing Airport Code to timezone mapping for 20,000+ airports - https://raw.githubusercontent.com/opentraveldata/opentraveldata/master/opentraveldata/optd_por_public.csv
You can process this file and create that map of airport-code to time-zone.
Once your have that map, (lets name it airportCodeToTimezoneMap), you can use following code.
import java.time.format.DateTimeFormatter
import java.time.{LocalDateTime, ZoneId}
val airportCodeToTimezoneMap = Map("MAA" -> "Asia/Kolkata")
val inputString = "2021-07-26, 00:23, MAA"
val Array(date, time, airportCode) = inputString.split(",").map(_.strip())
// prepare ZoneId's
val airportTimezoneNameOption = airportCodeToTimezoneMap.get(airportCode)
val airportTimezoneName = airportTimezoneNameOption.get
val airportZoneId = ZoneId.of(airportTimezoneName)
val gmtZoneId = ZoneId.of("GMT")
//
val dateTimeString = s"${date}T${time}"
val airportLocalDateTime = LocalDateTime.parse(dateTimeString, DateTimeFormatter.ISO_LOCAL_DATE_TIME)
val airportZonedDateTime = airportLocalDateTime.atZone(airportZoneId)
val gmtZonedDateTime = airportZonedDateTime.withZoneSameInstant(gmtZoneId)
// gmtZonedDateTime: java.time.ZonedDateTime = 2021-07-25T18:53Z[GMT]
I need the help on the below use case:
Question 1: My RDD is of below format.Now from this RDD ,I want to exclude the rows where airport.code in("PUN","HAR","KAS")
case class airport(code:String,city:String,airportname:String)
val airportRdd=sparkSession.sparkContext.textFile("src/main/resources/airport_data.csv").
map(x=>x.split(","))
val airPortRddTransformed=airportRdd.map(x=>airport(x(0),x(1),x(2)))
val trasnformedRdd=airPortRddTransformed.filter(air=>!(air.code.contains(seqValues:_*)))
But ! is not working .It is telling can not resolve symbol !.Can some one please help me.How to do negate in RDD.I have to use RDD approach only.
Also another question:
Question 2 : The data file is having 70 columns.I have a columns sequence:
val seqColumns=List("lat","longi","height","country")
I want to exclude these columns while loading the RDD.How can I do it.My production RDD is having 70 columns, I just really know the columns names to exclude.Not the index of every column.Again looking for it in RDD approach.I am aware on how to do it in Dataframe approach.
Question 1
Use broadcast to pass list of values to filter function. It seems _* in filter is not working. I changed condition to !seqValues.value.contains(air.code)
Data sample: airport_data.csv
C001,Pune,Pune Airport
C002,Mumbai,Chhatrapati Shivaji Maharaj International Airport
C003,New York,New York Airport
C004,Delhi,Delhi Airport
Code snippet
case class airport(code:String,city:String,airportname:String)
val seqValues=spark.sparkContext.broadcast(List("C001","C003"))
val airportRdd = spark.sparkContext.textFile("D:\\DataAnalysis\\airport_data.csv").map(x=>x.split(","))
val airPortRddTransformed = airportRdd.map(x=>airport(x(0),x(1),x(2)))
//airPortRddTransformed.foreach(println)
val trasnformedRdd = airPortRddTransformed.filter(air => !seqValues.value.contains(air.code))
trasnformedRdd.foreach(println)
Output ->
airport(C002,Mumbai,Chhatrapati Shivaji Maharaj International Airport)
airport(C004,Delhi,Delhi Airport)
Things I would change:
1- You are reading a .csv as a TextFile and then spliting the lines based on ,. You can save this step by reading the file like:
val df = spark.read.csv("src/main/resources/airport_data.csv")
2- Change the order of contains
val trasnformedRdd = airPortRddTransformed.filter(air => !(seqValues.contains(air.code)))
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
How can I create key/value-array pair in Scala. By this I mean in place of value I need an array.
val newRdd1 = rdd1.flatMap(x=>x.split(" "))
.map({case (key, Array(String)) => Array(String) })
You can achieve it using map(), it is similar in either plain scala program or Scala-in-SparkContext.
Example, you have a list of strings:
var sRec = List("key1,a1,a2,a3", "key2,b1,b2,b3", "key3,c1,c2,c3")
You can split it & convert to key/value(array of strings) assuming key is in 0th position, using:
sRec.map(x => (x.split(",")(0), Array(x.split(",")(1), x.split(",")(2), x.split(",")(3)))).
foreach(println)
(key1,[Ljava.lang.String;#7a81197d)
(key2,[Ljava.lang.String;#5ca881b5)
(key3,[Ljava.lang.String;#24d46ca6)
If you want to read a particular array element by key:
sRec.map(x => (x.split(",")(0),Array(x.split(",")(1), x.split(",")(2), x.split(",")(3)))).
map(x => (x._1, x._2(0))).foreach(println)
Output:
(key1,a1)
(key2,b1)
(key3,c1)
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
Assuming I have the following Scala classes:
Human(id: String, task: Task)
Task(id: String, time: Duration)
And having a List[(Human, Task)] with the following elements:
("H2", Task("T3", 5 minute))
("H3", Task("T1", 10 minute))
("H1", Task("T1", 10 minute))
("H1", Task("T2", 5 minute))
Now I want to functionally check if close elements have the same duration, and if so, order them by the human id.
In this case, the final list would have the elements sorted like so:
("H2", Task("T3", 5 minute))
("H1", Task("T1", 10 minute))
("H3", Task("T1", 10 minute))
("H1", Task("T2", 5 minute))
I tried to use sortBy to do so, but the way I'm doing, the final list will be fully ordered by the Human ID, not comparing the times.
Does anyone have any idea how can I do this?
Your question is a bit confused. You say you have a List of (Human,Task) tuples, but then you describe a collection of (String,Task) tuples.
Here's a way to sort a List[Human] according to the rules you've described.
def sortHumans(hs: List[Human]): List[Human] =
if (hs.isEmpty) Nil
else {
val target = hs.head.task.time
hs.takeWhile(_.task.time == target).sortBy(_.id) ++
sortHumans(hs.dropWhile(_.task.time == target))
}
This question already has answers here:
Printing array in Scala
(7 answers)
Closed 6 years ago.
val sparkConf = new SparkConf().setAppName("ShortTwitterAnalysis").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text = sc.textFile("/home/tobbyj/HW1_INF553/shortTwitter.txt")
val twitter = text
.map(_.toLowerCase)
.map(_.replace("\t", ""))
.map(_.replace("\"", ""))
.map(_.replace("\n", ""))
.map(_.replace(".", ""))
.map(_.replaceAll("[\\p{C}]", ""))
.map(_.split("text:")(1).split(",source:")(0))
.zipWithIndex.map(_.swap)
Using above code I have the results as below.
(0,a rose by any other name would smell as sweet)
(1,a rose is a rose is a rose)
(4,rt #nba2k: the battle of two young teams tough season but one will emerge victorious who will it be? lakers or 76ers? https:\/\/tco\/nukkjq\u2026)
(2,love is like a rose the joy of all the earth)
(5,i was going to bake a cake and listen to the football flour refund?)
(3,at christmas i no more desire a rose than wish a snow in may’s new-fangled mirth)
However, the result I want is 'key' starting from 1 and 'value' separated into words like below for your understanding, even though I'm not sure it's going to look like below.
(1,(a, rose, by, any, other, name, would, smell, as, sweet))
(2,(a, rose, is, a, rose, is, a, rose))
...
The code I tired is
.map{case(key, value)=>(key+1, value.split(" "))}
but give me the results as below
(1,[Ljava.lang.String;#1dff58b)
(2,[Ljava.lang.String;#167179a3)
(3,[Ljava.lang.String;#73e8c7d7)
(4,[Ljava.lang.String;#7bffa418)
(5,[Ljava.lang.String;#2d385beb)
(6,[Ljava.lang.String;#4f1ab87e)
Any suggestions? After this step, I am going to map them like (1, a), (1, rose), (1, by)...(2, love), (2, rose), ....
You can import org.apache.spark.rdd.PairRDDFunctions (documented here) to work more easily with key-value pairs.
At that point, you can use the flatMapValues method to obtain what you want; here is a minimal working example (just copy from the line containing val tweets if you are in the Spark console):
import org.apache.spark._
import org.apache.spark.rdd.PairRDDFunctions
val conf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(conf)
val tweets = sc.parallelize(Seq(
"this is my first tweet",
"and this is my second",
"ok this is getting boring"))
val results =
tweets.
zipWithIndex.
map(_.swap).
flatMapValues(_.split(" "))
results.collect.foreach(println)
This is the output of this few lines of code:
(0,this)
(0,is)
(0,my)
(0,first)
(0,tweet)
(1,and)
(1,this)
(1,is)
(1,my)
(1,second)
(2,ok)
(2,this)
(2,is)
(2,getting)
(2,boring)
If you are interested in seeing a small example showing how to analyze a live Twitter feed with Spark Streaming you can find one here.