Apache-Spark: method in foreach doesn't work

Apache-Spark: method in foreach doesn't work - scala

I read file from HDFS, which contains x1,x2,y1,y2 representing a envelope in JTS.
I would like to use those data to build STRtree in foreach.
val inputData = sc.textFile(inputDataPath).cache()
val strtree = new STRtree
inputData.foreach(line => {val array = line.split(",").map(_.toDouble);val e = new Envelope(array(0),array(1),array(2),array(3)) ;
println("envelope is " + e);
strtree.insert(e,
new Rectangle(array(0),array(1),array(2),array(3)))})
As you can see, I also print the e object.
To my surprise, when I log the size of strtree, it is zero! It seems that insert method make no senses here.
By the way, if I write hard code some test data line by line, the strtree can be built well.
One more thing, those project is packed into jar and submitted in the spark-shell.
So, why does the method in foreach not work ?

You will have to collect() to do this:
inputData.collect().foreach(line => {
... // your code
})
You can do this (for avoiding collecting all data):
val pairs = inputData.map(line => {
val array = line.split(",").map(_.toDouble);
val e = new Envelope(array(0),array(1),array(2),array(3)) ;
println("envelope is " + e);
(e, new Rectangle(array(0),array(1),array(2),array(3)))
}
pairs.collect().foreach(pair => {
strtree.insert(pair._1, pair._2)
}

Use .map() instead of .foreach() and reassign the outcome.
Foreach does not return the outcome of applyied function. It can be used for sending data somewhere, storing to db, printing, and so on.

Related

SLICK: To insert value in 3 different tables using one slick api call

The question is related to slick:
I have three tables:
1) Users
2)Team2members
3)Team2Owners
In my post request to users, I am passing values of memberOf and managerOf, these values will be inserted in the tables Team2members and Team2Owners respectively and not in Users table. Though other values of the post request will be inserted in 'Users' table.
My Post request looks like below:
{"kind": "via#user",
"userReference":{"userId":"priya16"},
"user":"preferredNameSpecialChar#domain1.com","memberOf":{"teamReference":{"organizationId":"airtel","teamId":"supportteam"}},
"managerOf":{"teamReference":{"organizationId":"airtel","teamId":"supportteam"}},
"firstName":"Special_fn1",
"lastName":"specialChar_ln1",
"preferredName":[{"locale":"employee1","value":"##$%^&*(Z0FH"}],
"description":" preferredNameSpecialChar test "}
I am forming the query which is shown below:
The query seems to work fine when only memberInsert is defined, when I try to define both the values i.e.memberInsert and managerInsert then insertion happens only for second value.
val query = config.api.customerTableDBIO(apiRequest.parameters.organizationId).flatMap { tables =>
val userInsert = tables.Users returning tables.Users += empRow
val memberInsert = inputObject.memberOf.map(m => m.copy(teamReference = m.teamReference.copy(organizationId = apiRequest.parameters.organizationId))).map { r =>
for {
team2MemberRow <- tables.Team2members returning tables.Team2members += Teams2MembersEntity.fromEmtToTeams2Members(r, empRow.id)
team <- tables.Teams.filter(_.id === r.teamReference.teamId.toLowerCase).map(_.friendlyName).result.headOption
} yield (team2MemberRow, team)
}
val managerInsert = inputObject.managerOf.map(m => m.copy(teamReference = m.teamReference.copy(organizationId = apiRequest.parameters.organizationId))).map { r =>
for {
team2OwnerRow <- tables.Team2owners returning tables.Team2owners += Teams2OwnersEntity.fromEmtToTeam2owners(r, empRow.id)
team <- tables.Teams.filter(_.id === r.teamReference.teamId.toLowerCase).map(_.friendlyName).result.headOption
} yield (team2OwnerRow, team)
}
userInsert.flatMap { userRow =>
val user = UserEntity.fromDbEntity(userRow)
if (memberInsert.isDefined) memberInsert.get
.map(r => user.copy(memberOf = Some(Teams2MembersEntity.fromEmtToMemberRef(r._1, r._2.map(TeamEntity.toApiFriendlyName).getOrElse(List.empty)))))
else DBIO.successful(user)
if (managerInsert.isDefined) managerInsert.get
.map(r => user.copy(managerOf = Some(Teams2OwnersEntity.fromEmtToManagerRef(r._1, r._2.map(TeamEntity.toApiFriendlyName).getOrElse(List.empty)))))
else DBIO.successful(user)
}
}

The query seems to work fine when only memberInsert is defined, when I try to define both the values i.e.memberInsert and managerInsert then insertion happens only for second value.
The problem looks to be with the final call to flatMap.
That should return a DBIO[T]. However, your expression generates a DBIO[T] in various branches, but only one value will be returned from flatMap. That would explain why you don't see all the actions being run.
Instead, what you could do is assign each step to a value and sequence them. There are lots of ways you could do that, such as using DBIO.seq or andThen.
Here's a sketch of one approach that might work for you....
val maybeInsertMemeber: Option[DBIO[User]] =
member.map( your code for constructing an action here )
val maybeInsertManager Option[DBIO[User]] =
manager.map( your code for constructing an action here )
DBIO.sequenceOption(maybeInsertMember) andThen
DBIO.sequenceOption(maybeInsertManager) andThen
DBIO.successful(user)
The result of that expression is a DBIO[User] which combines three queries together.

write immutable code for storing data in listBuffer in scala

I have the below code where I am using a mutable list buffer to store files recieved from kafka consumer , and then when the list size reached 15 I insert them into cassandra .
But Is their any way to do the same thing using immutable list.
val filesList = ListBuffer[SystemTextFile]()
storeservSparkService.configFilesTopicInBatch.subscribe.atLeastOnce(Flow[SystemTextFile].mapAsync(4) { file: SystemTextFile =>
filesList += file
if (filesList.size == 15) {
storeServSystemRepository.config.insertFileInBatch(filesList.toList)
filesList.clear()
}
Future(Done)
})

Something along these lines?
Flow[SystemTextFile].grouped(15).mapAsync(4){ files =>
storeServSystemRepository.config.insertFileInBatch(files)
}

Have you tried using Vector?
val filesList = Vector[SystemTextFile]()
storeservSparkService.configFilesTopicInBatch.subscribe.
atLeastOnce(Flow[SystemTextFile].mapAsync(4) { file: SystemTextFile =>
filesList = filesList :+ file
if (filesList.length == 15) {
storeServSystemRepository.config.insertFileInBatch(filesList.toList)
}
Future(Done)
})

Simplify Scala loop to one line

How do I simplify this loop to some function like foreach or map or other thing with Scala? I want to put hitsArray inside that filter shipList.filter.
val hitsArray: Array[String] = T.split(" ");
for (hit <- hitsArray) {
shipSize = shipList.length
shipList = shipList.filter(!_.equalsIgnoreCase(hit))
}
if (shipList.length == 0) {
shipSunk = shipSunk + 1
} else if (shipList.length < shipSize) {
shipHit = shipHit + 1
}

To be fair, I don't understand why you are calling shipSize = shipList.length as you don't use it anywhere.
T.split(" ").foreach{ hit =>
shipList = shipList.filter(!_.equalsIgnoreCase(hit))
}
which gets you to where you want to go. I've made it 3 lines because you want to emphasize you're working via side effect in that foreach. That said, I don't see any advantage to making it a one-liner. What you had before was perfectly readable.

Something like this maybe?
shipList.filter(ship => T.split(" ").forall(!_.equalsIgnoreCase(ship)))
Although cleaner if shipList is already all lower case:
shipList.filterNot(T.split(" ").map(_.toLowerCase) contains _)
Or if your T is large, move it outside the loop:
val hits = T.split(" ").map(_.toLowerCase)
shipList.filterNot(hits contains _)

Scala Set different operations

So here is my code snippet:
val check = collection.mutable.Set[String]()
if (check.isEmpty) {
println("There is no transfer data available yet, please use the 'load' command to initialize the application!")
}
val userSelection = scala.io.StdIn.readLine("Enter your command or type 'help' for more information:")
val loadCommand = """^(?:load)\s+(.*)$""".r
userSelection match {
case loadCommand(fileName) => check add fileName ;print(check);val l= loadOperation(fileName);println(l.length); if (check.contains(fileName)){
programSelector()
}else{ loadOperation(fileName)}
So first of all I have an input with the match and one case is the "load" input.
Now my program is able to load different .csv files and store it to a List. However I want my program whenever it loads a new .csv file to check if it was loaded previously. If it was loaded before, it should not be loaded twice!!
So i thought it could work with checking with an Set but it always overwrites my previous fileName entry...

Your set operations are fine. The issue is the check add fileName straight after case loadCommand(fileName) should be in the else block. (One of those "d'oh" moments that we have now and then!?) :)
Update
Code
val check = collection.mutable.Set[String]()
def process(userSelection: String): String = {
val loadCommand = """^(?:load)\s+(.*)$""".r
userSelection match {
case loadCommand(fileName) =>
if (check.contains(fileName))
s"Already processed $fileName"
else {
check add fileName
s"Process $fileName"
}
case _ =>
"User selection not recognised"
}
}
process("load 2013Q1.csv")
process("load 2013Q2.csv")
process("load 2013Q1.csv")
Output
res1: String = Process 2013Q1.csv
res2: String = Process 2013Q2.csv
res3: String = Already processed 2013Q1.csv

Inline parsing of IObservable<byte>

I have an observable query that produces an IObservable<byte> from a stream that I want to parse inline. I want to be able to use different strategies depending on the data source to parse discrete messages from this sequence. Bear in mind I am still on the upward learning curve of RX. I have come up with a solution, but am unsure if there is a way to accomplish this using out-of-the-box operators.
First, I wrote the following extension method to IObservable:
public static IObservable<IList<T>> Parse<T>(
this IObservable<T> source,
Func<IObservable<T>, IObservable<IList<T>>> parsingFunction)
{
return parsingFunction(source);
}
This allows me to specify the message framing strategy in use by a particular data source. One data source might be delimited by one or more bytes while another might be delimited by both start and stop block patterns while another might use a length prefixing strategy. So here is an example of the Delimited strategy that I have defined:
public static class MessageParsingFunctions
{
public static Func<IObservable<T>, IObservable<IList<T>>> Delimited<T>(T[] delimiter)
{
if (delimiter == null) throw new ArgumentNullException("delimiter");
if (delimiter.Length < 1) throw new ArgumentException("delimiter must contain at least one element.");
Func<IObservable<T>, IObservable<IList<T>>> parser =
(source) =>
{
var shared = source.Publish().RefCount();
var windowOpen = shared.Buffer(delimiter.Length, 1)
.Where(buffer => buffer.SequenceEqual(delimiter))
.Publish()
.RefCount();
return shared.Buffer(windowOpen)
.Select(bytes =>
bytes
.Take(bytes.Count - delimiter.Length)
.ToList());
};
return parser;
}
}
So ultimately, as an example, I can use the code in the following fashion to parse discrete messages from the sequence any time the byte pattern for the string '<EOF>' is encountered in the sequence:
var messages = ...operators that surface an IObservable<byte>
.Parse(MessageParsingFunctions.Delimited(Encoding.ASCII.GetBytes("<EOF>")))
...further operators to package discrete messages along with additional metadata
Questions:
Is there a more straight-forward way to accomplish this using just out of the box operators?
If not, would it be preferable to just define the different parsing functions (i.e. ParseDelimited, ParseLengthPrefixed, etc.) as local extensions instead of having a more generic Parse extension method that accepts a parsing function?
Thanks in advance!

Take a look at Rxx Parsers. Here's a related lab. For example:
IObservable<byte> bytes = ...;
var parsed = bytes.ParseBinary(parser =>
from next in parser
let magicNumber = parser.String(Encoding.UTF8, 3).Where(value => value == "RXX")
let header = from headerLength in parser.Int32
from header in next.Exactly(headerLength)
from headerAsString in header.Aggregate(string.Empty, (s, b) => s + " " + b)
select headerAsString
let message = parser.String(Encoding.UTF8)
let entry = from length in parser.Int32
from data in next.Exactly(length)
from value in data.Aggregate(string.Empty, (s, b) => s + " " + b)
select value
let entries = from count in parser.Int32
from entries in entry.Exactly(count).ToList()
select entries
select from _ in magicNumber.Required("The file's magic number is invalid.")
from h in header.Required("The file's header is invalid.")
from m in message.Required("The file's message is invalid.")
from e in entries.Required("The file's data is invalid.")
select new
{
Header = h,
Message = m,
Entries = e.Aggregate(string.Empty, (acc, cur) => acc + cur + Environment.NewLine)
});

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Apache-Spark: method in foreach doesn't work - scala

Use .map() instead of .foreach() and reassign the outcome. Foreach does not return the outcome of applyied function. It can be used for sending data somewhere, storing to db, printing, and so on.

Related

SLICK: To insert value in 3 different tables using one slick api call

write immutable code for storing data in listBuffer in scala

Simplify Scala loop to one line

Scala Set different operations

Inline parsing of IObservable<byte>

Categories

Resources