Meaning of 2nd parameter in StringOps.split(String, Int) - scala

I was trying to split a string and keep the empty strings. Fortunately i found a promising solution which gave me my expected results as following REPL session depicts:
scala> val test = ";;".split(";",-1)
test: Array[String] = Array("", "", "")
I was curious what the second parameter actually does and dived into the scala documentation but found nothing except this:
Also inside the REPL interpreter i only get the following information:
scala> "asdf".split
TAB
def split(String): Array[String]
def split(String, Int): Array[String]
Question
Does anybody have an alternate source of documentation for such badly documented parameters?
Or can someone explain what this 2dn parameter does on this specific function?

This is the same split from java.lang.String, which as it so happens, has better documentation:
The limit parameter controls the number of times the pattern is
applied and therefore affects the length of the resulting array. If
the limit n is greater than zero then the pattern will be applied at
most n - 1 times, the array's length will be no greater than n, and
the array's last entry will contain all input beyond the last matched
delimiter. If n is non-positive then the pattern will be applied as
many times as possible and the array can have any length. If n is zero
then the pattern will be applied as many times as possible, the array
can have any length, and trailing empty strings will be discarded.

Related

How to access last argument in PowerShell script (args)

I have following code:
foreach ($arg in $args) {
Write-Host "Arg: $arg";
$param1=$args[0]
}
Write-host "Number of args: " $args.Length
write-host Last Arg is: "$($args.count)"
I get this, when I run it:
./print_last_arg.ps1 a b c
Arg: a
Arg: b
Arg: c
Number of args: 3
Last Arg is: 3
What I would like to have is name of last argument, so:
Last Arg is: 3
should be:
Last Arg is: c
Sorry for such a stupid question but I am totally begginer in PS and cannot google the result...
PowerShell supports negative indices to refer to elements from the end of a collection, starting with -1 to refer to the last element, -2 to the penultimate (second to last) one, and so on.
Therefore, use $args[-1] to refer to the last argument passed.
For more information, see the conceptual about_Arrays help topic.
Note that you can also use the results of expressions as indices; e.g., the equivalent of $args[-1] is $args[$args.Count-1] (assuming the array has at least one element).
Additionally, you may specify multiple indices to extract a sub-array of arbitrary elements. E.g., $args[0, -1] returns a (new) array comprising the input array's first and the last element (assuming the array has at least two elements).
.., the range operator is particularly useful for extracting a range of contiguous elements. E.g., $args[0..2] returns a (new) array comprising the first 3 elements (the elements with indices 0, 1, and 2).
You can even combine individual indices with ranges, courtesy of PowerShell's + operator performing (flat) array concatenation.
E.g., $args[0..2 + -1] extracts the first 3 elements as well as the last (assumes at least 4 elements).
Note: For syntactic reasons, if a single index comes first in the index expression, you need to make it an array with the unary form of , the array constructor operator, to make sure that + performs array concatention; e.g., $args[,-1 + 0..2] extracts the last element followed by the first 3.
Pitfall: Combining a positive .. start point with a negative end point for up-to-the-last-Nth-element logic does not work as intended:
Assume the following array:
$a = 'first', 'middle1', 'middle2', 'last'
It is tempting to use range expression 1..-2 to extract all elements "in the middle", i.e. starting with the 2nd and up to the penultimate element - but this does not work as expected:
# BROKEN attempt to extract 'middle1', 'middle2'
PS> $a[1..-2]
middle1
first
last
middle2
The reason is that 1..-2, as a purely arithmetic range expression, expanded to the following array (whose elements happen to be used as indices into another array): 1, 0, -1, -2. And it is these elements that were extracted: the 2nd, the first, the last, the penultimate.
To avoid this problem, you need to know the array's element count ahead of time, and use an expression to specify the end of the range as a positive number:
# OK: extract 'middle1', 'middle2'
# Note that the verbosity and the need to know $a's element count.
PS> $a[1..($a.Count-2)]
middle1
middle2
Unfortunately, this is both verbose and inconvenient, especially given that you may want to operate on a collection whose count you do not know in advance.
GitHub issue #7940 proposes a future enhancement to better support this use case with new syntax, analogous to C#'s indices-and-ranges feature, so that the above could be written more conveniently with syntax such as $a[1..^1]

What are the differences between these two Scala code segments regarding postfix toString method? [duplicate]

This question already has answers here:
toList on Range with suffix notation causes type mismatch
(2 answers)
Closed 5 years ago.
I am learning postfix unary operators in Scala.
The following can not compile:
val result = 43 toString
println(result)
However if I add one empty line inbetween the two lines, the code compile and produces right output:
val result = 43 toString
println(result)
What is the difference between these two segments?
BTW, I did not add "import scala.language.postfixOps".
Perhaps the issue is clearer if we use some other operator instead of toString.
// This parses as `List(1,2,3,4) ++ List(4,5,6)`
List(1,2,3,4) ++
List(4,5,6)
Basically, in order to make the above work, while also allowing things like foo ? (a postfix operator), Scala needs to know when it is OK to stop expecting a second argument (and accept that the expression is a postfix operator).
Its solution is give up on finding a second argument if there is an interceding new line.

scala - error: ')' expected but '(' found

I'm new to Scala and I cannot find out what is causing this error, I have searched similar topics but unfortunately, none of them worked for me. I've got a simple code to find the line from some README.md file with the most words in it. The code I wrote is:
val readme = sc.textFile("/PATH/TO/README.md")
readme.map(lambda line :len(line.split())).reduce(lambda a, b: a if (a > b) else b)
and the error is:
Name: Compile Error
Message: <console>:1: error: ')' expected but '(' found.
readme.map(lambda line :len(line.split()) ).reduce( lambda a, b: a
if (a > b) else b ) ^
<console>:1: error: ';' expected but ')' found.
readme.map(lambda line :len(line.split()) ).reduce( lambda a, b: a
if (a > b) else b ) ^
Your code isn't valid Scala.
I think what you might be trying to do is to determine the largest number of words on a single line in a README file using Spark. Is that right? If so, then you likely want something like this:
val readme = sc.textFile("/PATH/TO/README.md")
readme.map(_.split(' ').length).reduce(Math.max)
That last line uses some argument abbreviations. This alternative version is equivalent, but a little more explicit:
readme.map(line => line.split(' ').length).reduce((a, b) => Math.max(a, b))
The map function converts an RDD of Strings (each line in the file) into an RDD of Ints (the number of words on a single line, delimited - in this particular case - by spaces). The reduce function then returns the largest value of its two arguments - which will ultimately result in a single Int value representing the largest number of elements on a single line of the file.
After re-reading your question, it seems that you might want to know the line with the most words, rather than how many words are present. That's a little trickier, but this should do the trick:
readme.map(line => (line.split(' ').length, line)).reduce((a, b) => if(a._1 > b._1) a else b)._2
Now map creates an RDD of a tuple of (Int, String), where the first value is the number of words on the line, and the second is the line itself. reduce then retains whichever of its two tuple arguments has the larger integer value (._1 refers to the first element of the tuple). Since the result is a tuple, we then use ._2 to retrieve the corresponding line (the second element of the tuple).
I'd recommend you read a good book on Scala, such as Programming in Scala, 3rd Edition, by Odersky, Spoon & Venners. There's also some tutorials and an overview of the language on the main Scala language site. Coursera also has some free Scala training courses that you might want to sign up for.

Efficiency of flatMap vs map followed by reduce in Spark

I have a text file sherlock.txt containing multiple lines of text. I load it in spark-shell using:
val textFile = sc.textFile("sherlock.txt")
My purpose is to count the number of words in the file. I came across two alternative ways to do the job.
First using flatMap:
textFile.flatMap(line => line.split(" ")).count()
Second using map followed by reduce:
textFile.map(line => line.split(" ").size).reduce((a, b) => a + b)
Both yield the same result correctly. I want to know the differences in time and space complexity of the above two alternative implementations, if indeed there is any ?
Does the scala interpreter convert both into the most efficient form ?
I will argue that the most idiomatic way to handle this would be to map and sum:
textFile.map(_.split(" ").size).sum
but the end of the day a total cost will be dominated by line.split(" ").
You could probably do a little bit better by iterating over the string manually and counting consecutive whitespaces instead of building new Array but I doubt it is worth all the fuss in general.
If you prefer a little bit deeper insight count is defined as:
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
where Utils.getIteratorSize is pretty much a naive iteration over Iterator with a sum of ones and sum is equivalent to
_.fold(0.0)(_ + _)

ValueError when converting to int lines retrieved via getline.linecache

this is my first message here, I hope I will not commit any mistake.
I am writing a python 2.7 script which performs comparisons between lines from a long list of lines provided as an external input file. Some of these lines contain just numbers, and on those I perform simple sums after their retrieval via getline.linecache.
My problem is that after a certain number of lines I am getting the error:
ValueError: invalid literal for int() with base 10
I do understand that somehow this has to do with the fact that there is some problem when I try to convert the lines retrieved to the integer type, but according to what I read each line should be retrieved from a memory database as a string, and indeed if I try to print the type of the values retrieved I get str. I printed the problematic values in order to understand why they failed to be converted to int: at first i included some semantic mistakes (I was taking some wrong lines, which were containing letters, and this of course failed to be converted to int), but still I get the error on merely numerical strings. On all of those numerical strings, I tried len(linecache.getline('input', line_n)) to see if any extra characters were present, but I just found '\n', which does not give any problems when converting from str to int.
My input file is made by a series of lines, some numerical some not; here are few lines:
1
id3021-a
1
129485768
129485769
2
id2034
102
944709842
944709848
For examples, line 4 here can be retrieved, but not converted to int. How could I convert str to int without getting errors?
I found the solution! Adding a '0' to the beginning of the string fixes the problem (I do not know why, the problematic lines were not empty):
int('0' + linecache.getline('input', line_n))
See here: Trouble converting string to int in Django/Python