Access RDD elements by index

Access RDD elements by index - scala

I have an RDD as below and would like to access the elements within each row by its index number in a loop. Is this possible?
(98,(344,(Dead Man Walking (1995),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0)))
(50,(501,(Richard III (1995),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1)))
(1,(321,(Toy Story (1995),0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0)))
So far, I have come up with the below code
val combgenr = mvnmcnt.map(x => (x._2._2.productIterator.foreach{
var n = 4
i => (
if (i == "1") {println((x._1.toInt,x._2._1.toInt,x._2._2._1),n)
})
n += 1
}))
But the I get an extra line in my result(below), the reason for which I don't know.
((98,344,Dead Man Walking (1995)),13)
()
((50,501,Richard III (1995)),13)
((50,501,Richard III (1995)),22)
((50,501,Richard III (1995)),23)
()
((1,321,Toy Story (1995)),8)
((1,321,Toy Story (1995)),9)
((1,321,Toy Story (1995)),10)
()
Any ideas?

Related

Spark iterate over dataframe rows, cells

(Spark beginner) I wrote the code below to iterate over the rows and columns of a data frame (Spark 2.4.0 + Scala 2.12).
I have computed the row and cell counts as a sanity check.
I was surprised to find that the method returns 0, even though the counters are incremented during the iteration.
To be precise: while the code is running it prints messages showing that it has found
rows 10, 20, ..., 610 - as expected.
cells 100, 200, ..., 1580 -
as expected.
After the iteration is done, it prints "Found 0 cells", and returns 0.
I understand that Spark is a distributed processing engine, and that code is not executed exactly as written - but how should I think about this code?
The row/cell counts were just a sanity check; in reality I need to loop over the data and accumulate some results, but how do I prevent Spark from zeroing out my results as soon as the iteration is done?
def processDataFrame(df: sql.DataFrame): Int = {
var numRows = 0
var numCells = 0
df.foreach { row =>
numRows += 1
if (numRows % 10 == 0) println(s"Found row $numRows") // prints 10,20,...,610
row.toSeq.foreach { c =>
if (numCells % 100 == 0) println(s"Found cell $numCells") // prints 100,200,...,15800
numCells += 1
}
}
println(s"Found $numCells cells") // prints 0
numCells
}

Spark have accumulators variables which provides you functionality like counting in a distributed environment. You can use a simple long and int type of accumulator. Even custom datatype of accumulator can also be implemented quite easily in Spark.
In your code changing your counting variables to accumulator variables like below will give you the correct result.
val numRows = sc.longAccumulator("numRows Accumulator") // string name only for debug purpose
val numCells = sc.longAccumulator("numCells Accumulator")
df.foreach { row =>
numRows.add(1)
if (numRows.value % 10 == 0) println(s"Found row ${numRows.value}") // prints 10,20,...,610
row.toSeq.foreach { c =>
if (numCells.value % 100 == 0) println(s"Found cell ${numCells.value}") // prints 100,200,...,15800
numCells.add(1)
}
}
println(s"Found ${numCells.value} cells") // prints 0
numCells.value

ForAll in scala check skips some input and do not respect containers size

I am new to scala check and I want to test the following piece of my application. I want to generate 30 and 20 random events and check if my application code correctly computes a result
// generate 30 random events
val eventGenerator: Gen[Event] = for {
d <- Gen.oneOf[String](Seq("es1", "es2", "es3"))
t <- Gen.choose[Long](minEvent.getTime, maxEvent.getTime)
s <- Gen.oneOf[String](Seq("s1", "s2", "s3", "s4", "s5", "s6", "s7"))} yield Event(d, t, s)
val eventsGenerator: Gen[List[VpSearchLog]] = Gen.containerOfN[List, VpSearchLog](30, eventGenerator)
// generate 20 random instances
val instanceGenerator: Gen[Instance] = for {
d <- Gen.oneOf[String](Seq("es1", "es2", "es3"))
t <- Gen.choose[Long](minInstance.getTime, maxInstance.getTime)} yield Instance(d, new Timestamp(t))
val instancesGenerator: Gen[List[Instance]] = Gen.containerOfN[List, Instance](20, instanceGenerator)
val p: Prop = forAll(instancesGenerator, eventsGenerator) { (i, e) =>
println(i.size)
println(e.size)
println()
val instancesWithFeature = computeExpected(instance)
isEqual(transform(instance), instanceWithFeature)
}
For some reason I see this in the stdout
20
15
20
7
20
3
20
1
20
0
starting to compute expected:
Basically it looks like the forAll generates a couple of inputs with a certain size and then skips them. For some reaon, it starts to compute things when one of the input has size 0 and then it starts the proper check. My questions are:
why if I use containerofN or listOfN I don't get exactly input of that specific size? How can I then generate input like this?
is it normal that forAll starts to explore the space of the possible input and skips some of them? Am I missing something here? This behaviour is quite counter intuitive for me

You may need to use forAllNoShrink to avoid the known defect in ScalaCheck where shrinking fails to respect generators
val thirtyInts: Gen[List[Int]] =
Gen.listOfN[Int](30, Gen.const(99))
val twentyLongs: Gen[List[Long]] =
Gen.listOfN[Long](20, Gen.const(44L))
property("listOfN") = {
Prop.forAllNoShrink(thirtyInts, twentyLongs) { (ii, ll) =>
ii.size == 30 && ll.size == 20
}
}

Functional version of a typical nested while loop

I hope this question may please functional programming lovers. Could I ask for a way to translate the following fragment of code to a pure functional implementation in Scala with good balance between readability and execution speed?
Description: for each elements in a sequence, produce a sub-sequence contains the elements that comes after the current elements (including itself) with a distance smaller than a given threshold. Once the threshold is crossed, it is not necessary to process the remaining elements
def getGroupsOfElements(input : Seq[Element]) : Seq[Seq[Element]] = {
val maxDistance = 10 // put any number you may
var outerSequence = Seq.empty[Seq[Element]]
for (index <- 0 until input.length) {
var anotherIndex = index + 1
var distance = input(index) - input(anotherIndex) // let assume a separate function for computing the distance
var innerSequence = Seq(input(index))
while (distance < maxDistance && anotherIndex < (input.length - 1)) {
innerSequence = innerSequence ++ Seq(input(anotherIndex))
anotherIndex = anotherIndex + 1
distance = input(index) - input(anotherIndex)
}
outerSequence = outerSequence ++ Seq(innerSequence)
}
outerSequence
}

You know, this would be a ton easier if you added a description of what you're trying to accomplish along with the code.
Anyway, here's something that might get close to what you want.
def getGroupsOfElements(input: Seq[Element]): Seq[Seq[Element]] =
input.tails.map(x => x.takeWhile(y => distance(x.head,y) < maxDistance)).toSeq

Using big unicode signs in java

I'm writing a programm for school which should compress text. So at first I want to build a kind of dictionary from a huge number of texts for compressing later.
My idea was that if i have 2 signs, I want to replace it with only 1. So at first i am building a treemap with all the pairs I have in my String.
So for example: String s = "Hello";
He -> 1
el -> 1
ll -> 1
lo -> 1
at the end my Treemap values are different high, and at a given point i want to write a rule in my dictionary. For example:
He -> x
el -> y
lo -> z
So here is the point. I want to start with the "new signs" at the unicode number 65536 and want to increase it for every rule by 1.
When i want to reanalyze my text to pairs i think i got a error but i am not sure about this..
TreeMap<String, Integer> map = new TreeMap<String, Integer>();
char[] text = s.toCharArray();
String signPair = "";
// search sign in map
for (int i = 0; i < s.length()-1; i++) {
// 1.Zeichen prüfen ob >65535 ->2chars
if (Character.codePointAt(text, i) > 65535) {
// 2.sign checking >65535 ->2chars
if (Character.codePointAt(text, i + 2) > 65535) {
signPair = s.substring(i, i + 4);
// compensate additional chars
i += 2;
// if not there
if (!map.containsKey(signPair)) {
// Key anlegen, Value auf 1 setzen
map.put(signPair, 1);
} else {
// Key vorhanden -> Value um 1 erhöhen
int value = map.get(signPair);
value++;
map.put(signPair, value);
}
At the end when i want to print my map in the console i only got � signs with a second one.. or later i also have a lot of 𐃰-typ signs which i cant interpret. In my output text there are mostly signs between 5000 and 60000. No one is higher than 65535...
Is it wrong to look at the chars and substring like them or is it a mistake to get the codepoint at them?
Thanks for help!

coffeescript for loop refactoring

I am quite new to coffeescript and I am wondering if any more experienced users could point suggest a refactoring for the following code:
splitCollection: =>
maxLength = Math.ceil(#collection.length / 3)
sets = Math.ceil(#collection.length / maxLength)
start = 0
for x in [1..sets]
if x != sets
#render new BusinessUnits(#collection.models.slice(start, (maxLength + start)))
else
#render new BusinessUnits(#collection.models.slice(start, (#collection.length)))
start+= maxLength
There does not appear to be a while loop in coffeescript which seems to suggest a better mechanism.
Any suggestions appreciated.

Looks like you are using Backbone.js, which includes Underscore.js, which has the groupBy function.
You could create a "bucketNumber" function:
bucketNumber = (value, index) ->
Math.floor( index / #collection.length * 3 )
Then group your collection:
sets = #collection.groupBy bucketNumber
Now, assuming ten items, sets should look something like this:
{0: [{}, {}, {}], 1: [{}, {}, {}], 2: [{}, {}, {}, {}]}
From here, it becomes rather straight-forward
for bucketNumber, bucket of sets
#render new BusinessUnits( bucket )
Here is a jsFiddle showing it in action

You don't need to keep track of your position twice, x is enough:
splitCollection: =>
setSize = Math.ceil #collection.length / 3
sets = Math.ceil #collection.length / maxLength
for x in [1..sets]
#render new BusinessUnits #collection.models[x * setSize...(x+1) * setSize]
Note that there is nothing wrong with passing slice an end greater than the array length.

If I understand your code, you want to split an array in 3 parts (the last one can have less items). In this case write the reusable abstraction for the task. Using underscore:
splitCollection: =>
group_size = Math.ceil(#collection.size() / 3)
_.each _(#collection.models).inGroupsOf(group_size), (group) ->
#render(new BusinessUnits(group))
_.inGroupsOf can be written:
_.mixin({
inGroupsOf: function(array, n) {
var output = [];
for(var index=0; index < array.length; index += n) {
output.push(array.slice(index, index+n));
}
return output;
}
});

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Access RDD elements by index - scala

Related

Spark iterate over dataframe rows, cells

ForAll in scala check skips some input and do not respect containers size

Functional version of a typical nested while loop

Using big unicode signs in java

coffeescript for loop refactoring

Categories

Resources