Spark / Scala Split - scala

I have this code:
rdd.map(_.split("-")).filter(row => { ... })
when I do row.length on:
This-is-a-test----on-split--
This-is-a-test-------
the output is 9 and 4 respectively. It doesn't count the trailing delimited characters if it is empty. What is the workaround here if I want both outputs to be 10?

You can accomplish what you want by passing -1 as limit parameter to split like this:
rdd.map(_.split("-", -1)).filter(row => { ... })
Btw, the expected result is 11, and not 10 (since if you want to keep empty tokens and your string ends with the delimiter, then it's interpreted as if there's an empty token after that delimiter). You can see this for more information.

Related

String interpolation of variable value

I want the variable value to be processed by string interpolation.
val temp = "1 to 10 by 2"
println(s"$temp")
output expected:
inexact Range 1 to 10 by 2
but getting
1 to 10 by 2
is there any way to get this way done?
EDIT
The normal case for using StringContext is:
$> s"${1 to 10 by 2}"
inexact Range 1 to 10 by 2
This return the Range from 1 to 10 with the step value of 2.
And String context won't work on variable, so can there be a way I can do like
$> val temp = "1 to 10 by 2"
$> s"${$temp}" //hypothetical
such that the interpreter will evaluate this as
s"${$temp}" => s"${1 to 10 by 2}" => Range from 1 to 10 by step of 2 = {1,3,5,7,9}
By setting a string value to temp you are doing just that - creating a flat String. If you want this to be actual code, then you need to drop the quotes:
val temp = 1 to 10 by 2
Then you can print the results:
println(s"$temp")
This will print the following output string:
inexact Range 1 to 10 by 2
This is the toString(...) output of a variable representing a Range. If you want to print the actual results of the 1 to 10 by 2 computation, you need to do something like this:
val resultsAsString = temp.mkString(",")
println(resultsAsString)
> 1,3,5,7,9
or even this (watch out: here the curly brackets { } are used not for string interpolation but simply as normal string characters):
println(s"{$resultsAsString}")
> {1,3,5,7,9}
Edit
If what you want is to actually interpret/compile Scala code on the fly (not recommended though - for security reasons, among others), then you may be interested in this:
https://ammonite.io/ - Ammonite, Scala scripting
In any case, to interpret your code from a String, you may try using this:
https://docs.scala-lang.org/overviews/repl/embedding.html
See these lines:
val scripter = new ScriptEngineManager().getEngineByName("scala")
scripter.eval("""println("hello, world")""")

How to strip extra spaces when writing from dataframe to csv

Read in multiple sheets (6) from an xlsx file and created individual dataframes. Want to write each one out to a pipe delimited csv.
ind_dim.to_csv (r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
Currently outputs like this:
1|value1 |value2 |word1 word2 word3 etc.
Want to strip trailing blanks
Suggestion
Include the method .apply(lambda x: x.str.rstrip()) to your output string (prior to the .to_csv() call) to strip the right trailing blank from each field across the DataFrame. It would look like:
Change:
ind_dim.to_csv(r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
To:
ind_dim.apply(lambda x: x.str.rstrip()).to_csv(r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
It can be easily inserted to the output code string using '.' referencing. To handle multiple data types, we can enforce the 'object' dtype on import by including the argument dtype='str':
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
Or on the DataFrame itself by:
df = pd.DataFrame(df, dtype='str')
Proof
I did a mock-up where the .xlsx document has 5 sheets, with each sheet having three columns: The first column with all numbers except an empty cell in row 2; the second column with both a leading blank and a trailing blank on strings, an empty cell in row 3, and a number in row 4; and the third column * with all strings having a leading blank, and an empty value in row 4*. Integer indexes and integer columns have been included. The text in each sheet is:
0 1 2
0 11111 valueB1 valueC1
1 valueB2 valueC2
2 33333 valueC3
3 44444 44444
4 55555 valueB5 valueC5
This code reads in our .xlsx testing_xlsx_dtype.xlsx to the DataFrame dictionary ind_dim.
Next, it loops through each sheet using a for loop to place the sheet name variable as a key to reference the individual sheet DataFrame. It applies the .str.rstrip() method to the entire sheet/DataFrame by passing the lambda x: x.str.rstrip() lambda function to the .apply() method called on the sheet/DataFrame.
Finally, it outputs the sheet/DataFrame as a .csv with the pipe delimiter using .to_csv() as seen in the OP post.
# reads xlsx in
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
# loops through sheets, applies rstrip(), output as csv '|' delimit
for sheet in ind_dim:
ind_dim[sheet].apply(lambda x: x.str.rstrip()).to_csv(sheet + '_ind_dim_out.csv', sep='|')
Returns:
|0|1|2
0|11111| valueB1| valueC1
1|| valueB2| valueC2
2|33333|| valueC3
3|44444|44444|
4|55555| valueB5| valueC5
(Note our column 2 strings no longer have the trailing space).
We can also reference each sheet using a loop that cycles through the dictionary items; the syntax would look like for k, v in dict.items() where k and v are the key and value:
# reads xlsx in
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
# loops through sheets, applies rstrip(), output as csv '|' delimit
for k, v in ind_dim.items():
v.apply(lambda x: x.str.rstrip()).to_csv(k + '_ind_dim_out.csv', sep='|')
Notes:
We'll still need to apply the correct arguments for selecting/ignoring indexes and columns with the header= and names= parameters as needed. For these examples I just passed =None for simplicity.
The other methods that strip leading and leading & trailing spaces are: .str.lstrip() and .str.strip() respectively. They can also be applied to an entire DataFrame using the .apply(lambda x: x.str.strip()) lambda function passed to the .apply() method called on the DataFrame.
Only 1 Column: If we only wanted to strip from one column, we can call the .str methods directly on the column itself. For example, to strip leading & trailing spaces from a column named column2 in DataFrame df we would write: df.column2.str.strip().
Data types not string: When importing our data, pandas will assume data types for columns with a similar data type. We can override this by passing dtype='str' to the pd.read_excel() call when importing.
pandas 1.0.1 documentation (04/30/2020) on pandas.read_excel:
"dtypeType name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion."
We can pass the argument dtype='str' when importing with pd.read_excel.() (as seen above). If we want to enforce a single data type on a DataFrame we are working with, we can set it equal to itself and pass it to pd.DataFrame() with the argument dtype='str like: df = pd.DataFrame(df, dtype='str')
Hope it helps!
The following trims left and right spaces fairly easily:
if (!require(dplyr)) {
install.packages("dplyr")
}
library(dplyr)
if (!require(stringr)) {
install.packages("stringr")
}
library(stringr)
setwd("~/wherever/you/need/to/get/data")
outputWithSpaces <- read.csv("CSVSpace.csv", header = FALSE)
print(head(outputWithSpaces), quote=TRUE)
#str_trim(string, side = c("both", "left", "right"))
outputWithoutSpaces <- outputWithSpaces %>% mutate_all(str_trim)
print(head(outputWithoutSpaces), quote=TRUE)
Starting Data:
V1 V2 V3 V4
1 "Something is interesting. " "This is also Interesting. " "Not " "Intereting "
2 " Something with leading space" " Leading" " Spaces with many words." " More."
3 " Leading and training Space. " " More " " Leading and trailing. " " Spaces. "
Resulting:
V1 V2 V3 V4
1 "Something is interesting." "This is also Interesting." "Not" "Intereting"
2 "Something with leading space" "Leading" "Spaces with many words." "More."
3 "Leading and training Space." "More" "Leading and trailing." "Spaces."

Count filtered records in scala

As I am new to scala ,This problem might look very basic to all..
I have a file called data.txt which contains like below:
xxx.lss.yyy23.com-->mailuogwprd23.lss.com,Hub,12689,14.98904563,1549
xxx.lss.yyy33.com-->mailusrhubprd33.lss.com,Outbound,72996,1.673717588,1949
xxx.lss.yyy33.com-->mailuogwprd33.lss.com,Hub,12133,14.9381027,664
xxx.lss.yyy53.com-->mailusrhubprd53.lss.com,Outbound,72996,1.673717588,3071
I want to split the line and find the records depending upon the numbers in xxx.lss.yyy23.com
val data = io.Source.fromFile("data.txt").getLines().map { x => (x.split("-->"))}.map { r => r(0) }.mkString("\n")
which gives me
xxx.lss.yyy23.com
xxx.lss.yyy33.com
xxx.lss.yyy33.com
xxx.lss.yyy53.com
This is what I am trying to count the exact value...
data.count { x => x.contains("33")}
How do I get the count of records who does not contain 33...
The following will give you the number of lines that contain "33":
data.split("\n").count(a => a.contains("33"))
The reason what you have above isn't working is that you need to split data into an array of strings again. Your previous statement actually concatenates the result into a single string using newline as a separator using mkstring, so you can't really run collection operations like count on it.
The following will work for getting the lines that do not contain "33":
data.split("\n").count(a => !a.contains("33"))
You simply need to negate the contains operation in this case.

How to split string with trailing empty strings in result?

I am a bit confused about Scala string split behaviour as it does not work consistently and some list elements are missing. For example, if I have a CSV string with 4 columns and 1 missing element.
"elem1, elem2,,elem 4".split(",") = List("elem1", "elem2", "", "elem4")
Great! That's what I would expect.
On the other hand, if both element 3 and 4 are missing then:
"elem1, elem2,,".split(",") = List("elem1", "elem2")
Whereas I would expect it to return
"elem1, elem2,,".split(",") = List("elem1", "elem2", "", "")
Am I missing something?
As Peter mentioned in his answer, "string".split(), in both Java and Scala, does not return trailing empty strings by default.
You can, however, specify for it to return trailing empty strings by passing in a second parameter, like this:
String s = "elem1,elem2,,";
String[] tokens = s.split(",", -1);
And that will get you the expected result.
You can find the related Java doc here.
I believe that trailing empty spaces are not included in a return value.
JavaDoc for split(String regex) says: "This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array."
So in your case split(String regex, int limit) should be used in order to get trailing empty string in a return value.

The expression prints itself in unexpected order

When I print a log information like this:
val idList = getIdList
log info s"\n\n-------idList: ${idList foreach println}"
It shows me:
1
2
3
4
5
-------idList: ()
That makes sense because foreach returns Unit. But why does it print the list of id first? idList is already evaluated in the previous line (if that's the cause)!
And how to make it print it in expected order - after idList:?
This is because you're not evaluating the log string to read what you want, you evaluate it to:
\n\n -------idList: ()
However, the members of the list appear in the output stream as a side effect, due to the println call in the string interpolation.
EDIT: since clarification was requested by the OP, what happens is that the output comes from two sources:
${idList foreach println} evaluates to (), since println itself doesn't return anything.
However, you can see the elements printed out, because when the string interpolation is evaluated, println is being called. And println prints all the elements into the output stream.
In other words:
//line with log.info() reached, starts evaluating string before method call
1 //println from foreach
2 //println from foreach
3 //println from foreach
4 //println from foreach
5 //println from foreach
//string argument log.info() evaluated from interpolation
-------idList: () //log info prints the resultant string
To solve your problem, modify the expression in the interpolated string to actually return the correct string, e.g.:
log info s"\n\n-------idList: ${idList.mkString("\n")}"
Interpolation works in a following way:
evaluate all arguments
substitute their results into resulting string
println is a Unit function that prints to the standard output, you should use mkstring instead that returns a string
log info s"\n\n-------idList: ${idList.mkString("(", ", ", ")")}"
As pointed out by #TheTerribleSwiftTomato , you need to give an expression that returns a value and has no other side-effect. So simply do it like this:
val idList = getIdList
log info s"\n\n-------idList: ${idList mkString " "}"
For example, this works for me:
val idList = List(1, 2, 3, 4, 5)
println(s"\n\n-------idList: ${idList mkString " "}")
Output:
-------idList: 1 2 3 4 5