Io - List Shuffling - iolanguage

I tried to shuffle a list in Io:
list(1, 2, 3, 4) shuffle println
However, when I tried to run the program, Io gave me an error:
Exception: List does not respond to 'shuffle'
---------
List shuffle .code.tio 1
CLI doFile Z_CLI.io 140
CLI run IoState_runCLI() 1
Is there an alternative for shuffling in Io? If not, how do I implement it?

IIRC, you need to load the Random addon, which will add the List shuffle method.

Related

using scastie and scalafiddle to evaluate code execution time

I'm trying to reduce execution time of some small scala routine, say, concatenation of strings, since I'm too lazy to setup local environment, I'm using online scala compilers, but found the comparison result differs between scastie and scalafiddle w/ the following code:
// routine 1
var startT1 = System.nanoTime()
(1 until 100 * 1000).foreach{ x=>
val sb = new StringBuilder("a")
sb.append("b").append("c").append("d").append("e").append("f")
}
println(System.nanoTime() - startT1)
// routine 2
var startT2 = System.nanoTime()
(1 until 100 * 1000).foreach{ x=>
val arr = Array[Char]('a', 'b', 'c', 'd', 'e', 'f')
}
println(System.nanoTime() - startT2)
In scalafiddle routine 1 is faster but in scastie routine 2 is faster.
I'v read this article https://medium.com/#otto.chrons/what-makes-scalafiddle-so-fast-9a3edf33ed4d, so it seems that scalafiddle actually runs JavaScript instead of scala. But the remaining question is, can I really use scastie for execution time benchmarks?
Short answer - NO I don't you can not rely on ANY online running tools like scastie and scalafiddle to verify performance.
Because, there is 1000 and more reasons why some benchmark will show X millis execution for some operation, and 99% of that reasons is running environment: hardware, operating system, CPU architecture, load on machine, used Scala compiler, used JVM etc. And we don't know environment change between runs on Scatie for instance, so you can get totally different numbers and don't know why, hence benchmark results won't be reliable.
If you would like to get some results, you would like rely one at least a bit, take a look at https://openjdk.java.net/projects/code-tools/jmh/ and it's sbt helper plugin https://github.com/ktoso/sbt-jmh and run known environment.
And along with posting benchmark results - please post environment details, where it was run.

Scala Is it possible to run loops asynchronously?

I have a for loop that loops through a Iterable[String] and puts data into a mutabe.Map, is it possible to run everything in the iterable at once/amount at a time?
Use .par (converting collection to parallel) and then map/foreach over it.
Or you can map to Future.
Don't forget about thread safety of map - you should use ConcurrentHashMap.

Scala immutable list internal implementation

Suppose I am having a huge list having elements from 1 to 1 million.
val initialList = List(1,2,3,.....1 million)
and
val myList = List(1,2,3)
Now when I apply an operation such as foldLeft on the myList giving initialList as the starting value such as
val output = myList.foldLeft(initialList)(_ :+ _)
// result ==>> List(1,2,3,.....1 million, 1 , 2 , 3)
Now my question comes here, both the lists being immutable the intermediate lists that were produced were
List(1,2,3,.....1 million, 1)
List(1,2,3,.....1 million, 1 , 2)
List(1,2,3,.....1 million, 1 , 2 , 3)
By the concept of immutability, every time a new list is being created and the old one being discarded. So isn't this operation a performance killer in scala as every time a new list of 1 million elements has to be copied to create a new list.
Please correct me if I am wrong as I am trying to understand the internal implementation of an immutable list.
Thanks in advance.
Yup, this is performance killer, but this is a cost of having immutable structures (which are amazing, safe and makes programs much less buggy). That's why you should often avoid appending list if you can. There is many tricks that can avoid this issue (try to use accumulators).
For example:
Instead of:
val initialList = List(1,2,3,.....1 million)
val myList = List(1,2,3,...,100)
val output = myList.foldLeft(initialList)(_ :+ _)
You can write:
val initialList = List(1,2,3,.....1 million)
val myList = List(1,2,3,...,100)
val output = List(initialList,myList).flatten
Flatten is implemented to copy first line only once instead of copying it for every single append.
P.S.
At least adding element to the front of list works fast (O(1)), cause sharing of old list is possible. Let's Look at this example:
You can see how memory sharing works for immutable linked lists. Computer only keeps one copy of (b,c,d) end. But if you want to append bar to the end of baz you cannot modify baz, cause you would destroy foo, bar and raz! That's why you have to copy first list.
Appending to a List is not a good idea because List has linear cost for appending. So, if you can
either prepend to the List (List have constant time prepend)
or choose another collection that is efficient for appending. That would be a Queue
For the list of performance characteristic per operation on most scala collections, See:
https://docs.scala-lang.org/overviews/collections/performance-characteristics.html
Note that, depending on your requirement, you may also make your own smarter collection, such as chain iterable for example

Count number of elements in a text or list using Spark

I know there are different ways to count number of elements in a text or list. But I am trying to understand why this one does not work. I am trying to write an equivalent code to
A_RDD=sc.parallelize(['a', 1.2, []])
acc = sc.accumulator(0)
acc.value
A_RDD.foreach(lambda _: acc.add(1))
acc.value
Where the result is 3.
To do so I defined the following function called my_count(_), but I don't know how to get the result. A_RDD.foreach(my_count) does not do anything. I didn't receive any error either. What did I do wrong?
counter = 0 #function that counts elements
def my_count(_):
global counter
counter += 1
A_RDD.foreach(my_count)
The A_RDD.foreach(my_count) operation doesnt run on your local Python Virtual machine. It runs in your remote executor node. So the drives ships your my_count method to each of the executor nodes along with variable counter since the method refers the variable. So each executor nodes gets its own definition of counter variable which is updated by the foreach method while the counter variable defined in your driver application is not incremented.
One easy but risky solution would be to collect the RDD on your driver and then compute the count like below. This is risky because the entire RDD content is downloaded to the memory of the driver which may cause MemoryError.
>>> len(A_RDD.collect())
3
So what if you were running local and not on a cluster. In spark/scala this behaviour changes between local and on a clust. It would have a value as expected locally but in the cluster it wouldn't have the same value it would happen just as you describe... In spark/python does the same thing happen? My guess is it does.

Why does partition parameter of SparkContext.textFile not take effect?

scala> val p=sc.textFile("file:///c:/_home/so-posts.xml", 8) //i've 8 cores
p: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at textFile at <console>:21
scala> p.partitions.size
res33: Int = 729
I was expecting 8 to be printed and I see 729 tasks in Spark UI
EDIT:
After calling repartition() as suggested by #zero323
scala> p1 = p.repartition(8)
scala> p1.partitions.size
res60: Int = 8
scala> p1.count
I still see 729 tasks in the Spark UI even though the spark-shell prints 8.
If you take a look at the signature
textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
you'll see that the argument you use is called minPartitions and this pretty much describes its function. In some cases even that is ignored but it is a different matter. Input format which is used behind the scenes still decides how to compute splits.
In this particular case you could probably use mapred.min.split.size to increase split size (this will work during load) or simply repartition after loading (this will take effect after data is loaded) but in general there should be no need for that.
#zero323 nailed it, but I thought I'd add a bit more (low-level) background on how this minPartitions input parameter influences the number of partitions.
tl;dr The partition parameter does have an effect on SparkContext.textFile as the minimum (not the exact!) number of partitions.
In this particular case of using SparkContext.textFile, the number of partitions are calculated directly by org.apache.hadoop.mapred.TextInputFormat.getSplits(jobConf, minPartitions) that is used by textFile. TextInputFormat only knows how to partition (aka split) the distributed data with Spark only following the advice.
From Hadoop's FileInputFormat's javadoc:
FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobConf, int). Subclasses of FileInputFormat can also override the isSplitable(FileSystem, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers.
It is a very good example how Spark leverages Hadoop API.
BTW, You may find the sources enlightening ;-)