Instantiate a map with default - scala

So, I have a list of characters and I have to create a list which says how many there are of each kind
( List(a,a,a,a,b,b) => List ( (a,4), (b,3))
I wanted to create a map that would list each character to 1 at first
( (a->1), (a->1), (a->1), (a->1), (b->1), (b->1))
and then use group by + tolist to return the final list. The problem is creating a map requires a PHD in this language.
I tried
val m = xs.foldLeft(Map[Char,Int]()){c => c->1}
Which doesn't work.
xs map (x=> x-> 1) toMap
Which compiles but I can't do anything with this map afterwards.
and xs.toMap(x,1)
Which doesn't work either.
Could somebody tell me how I should proceed please?

You can use groupBy to do the grouping and then find the count of each group:
list.groupBy(identity).mapValues(_.length)

Related

How to explode a struct column with a prefix?

My goal is to explode (ie, take them from inside the struct and expose them as the remaining columns of the dataset) a Spark struct column (already done) but changing the inner field names by prepending an arbitrary string. One of the motivations is that my struct can contain columns that have the same name as columns outside of it - therefore, I need a way to differentiate them easily. Of course, I do not know beforehand what are the columns inside my struct.
Here is what I have so far:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = df.select("*", column + ".*").drop(column)
}
This does the job alright - I use this writing:
df.explodeStruct("myColumn")
It returns all the columns from the original dataframe, plus the inner columns of the struct at the end.
As for prepending the prefix, my idea is to take the column and find out what are its inner columns. I browsed the documentation and could not find any method on the Column class that does that. Then, I changed my approach to taking the schema of the DataFrame, then filtering the result by the name of the column, and extracting the column found from the resulting array. The problem is that this element I find has the type StructField - which, again, presents no option to extract its inner field - whereas what I would really like is to get handled a StructType element - which has the .getFields method, that does exactly what I want (that is, showing me the name of the inner columns, so I can iterate over them and use them on my select, prepending the prefix I want to them). I know no way to convert a StructField to a StructType.
My last attempt would be to parse the output of StructField.toString - which contains all the names and types of the inner columns, although that feels really dirty, and I'd rather avoid that lowly approach.
Any elegant solution to this problem?
Well, after reading my own question again, I figured out an elegant solution to the problem - I just needed to select all the columns the way I was doing, and then compare it back to the original dataframe in order to figure out what were the new columns. Here is the final result - I also made this so that the exploded columns would show up in the same place as the original struct one, so not to break the flow of information:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = {
val prefix = column + "_"
val originalPosition = df.columns.indexOf(column)
val dfWithAllColumns = df.select("*", column + ".*")
val explodedColumns = dfWithAllColumns.columns diff df.columns
val prefixedExplodedColumns = explodedColumns.map(c => col(column + "." + c) as prefix + c)
val finalColumnsList = df.columns.map(col).patch(originalPosition, prefixedExplodedColumns, 1)
df.select(finalColumnsList: _*)
}
}
Of course, you can customize the prefix, the separator, and etc - but that is simple, anyone could tweak the parameters and such. The usage remains the same.
In case anyone is interested, here is something similar for PySpark:
def explode_struct(df: DataFrame, column: str) -> DataFrame:
original_position = df.columns.index(column)
original_columns = df.columns
new_columns = df.select(column + ".*").columns
exploded_columns = [F.col(column + "." + c).alias(column + "_" + c) for c in new_columns]
col_list = [F.col(c) for c in df.columns]
col_list.pop(original_position)
col_list[original_position:original_position] = exploded_columns
return df.select(col_list)

display words which words length is more than 8

I am trying to get from below list where the words have r and size should be more than 8 and converted to uppercase all the values in the list.
val names=List("sachinramesh","rahuldravid","viratkohli","mayank")
But I have tried with below but it is not giving anything. It throwing error.
names.map(s =>s.toUpperCase.contains("r").size(8)
It is throwing error.
can someone tell me how to resolve this issue.
Regards,
Kumar
If you are doing a combination of filter and map, think about using the collect method, which does both in one call. This is how to do what is described in the question:
names.collect{
case s if s.lengthCompare(8) > 0 && s.contains('r') =>
s.toUpperCase
}
collect works like filter because it only returns values that match a case statement. It works like map because you can make changes to the matching values before returning them.
you can try this :
names.filter(str => str.contains('r') && str.length > 8) // str contains an `r` and length > 8
.map(_.toUpperCase) // map the result to uppercase
names.filter(...).map(...) approach solves the problem, however requires iterating through the list twice. For a more optimal solution where we go through the list only once, consider #Tim's suggestion regarding collect, or perhaps consider lazy Iterator approach like so:
names
.iterator
.filter(_.size > 8)
.filter(_.contains('r'))
.map(_.toUpperCase)
.toList
You can also try this:
val result =for (x <- names if x.contains('r') && x.length > 8) yield x.toUpperCase
result.foreach(println)
cheers

Map word ngrams to counts in scala

I'm trying to create a map which goes through all the ngrams in a document and counts how often they appear. Ngrams are sets of n consecutive words in a sentence (so in the last sentence, (Ngrams, are) is a 2-gram, (are, sets) is the next 2-gram, and so on). I already have code that creates a document from a file and parses it into sentences. I also have a function to count the ngrams in a sentence, ngramsInSentence, which returns Seq[Ngram].
I'm getting stuck syntactically on how to create my counts map. I am iterating through all the ngrams in the document in the for loop, but don't know how to map the ngrams to the count of how often they occur. I'm fairly new to Scala and the syntax is evading me, although I'm clear conceptually on what I need!
def getNGramCounts(document: Document, n: Int): Counts = {
for (sentence <- document.sentences; ngram <- nGramsInSentence(sentence,n))
//I need code here to map ngram -> count how many times ngram appears in document
}
The type Counts above, as well as Ngram, are defined as:
type Counts = Map[NGram, Double]
type NGram = Seq[String]
Does anyone know the syntax to map the ngrams from the for loop to a count of how often they occur? Please let me know if you'd like more details on the problem.
If I'm correctly interpreting your code, this is a fairly common task.
def getNGramCounts(document: Document, n: Int): Counts = {
val allNGrams: Seq[NGram] = for {
sentence <- document.sentences
ngram <- nGramsInSentence(sentence, n)
} yield ngram
allNgrams.groupBy(identity).mapValues(_.size.toDouble)
}
The allNGrams variable collects a list of all the NGrams appearing in the document.
You should eventually turn to Streams if the document is big and you can't hold the whole sequence in memory.
The following groupBycreates a Map[NGram, List[NGram]] which groups your values by its identity (the argument to the method defines the criteria for "aggregate identification") and groups the corresponding values in a list.
You then only need to map the values (the List[NGram]) to its size to get how many recurring values there were of each NGram.
I took for granted that:
NGram has the expected correct implementation of equals + hashcode
document.sentences returns a Seq[...]. If not you should expect allNGrams to be of the corresponding collection type.
UPDATED based on the comments
I wrongly assumed that the groupBy(_) would shortcut the input value. Use the identity function instead.
I converted the count to a Double
Appreciate the help - I have the correct code now using the suggestions above. The following returns the desired result:
def getNGramCounts(document: Document, n: Int): Counts = {
val allNGrams: Seq[NGram] = (for(sentence <- document.sentences;
ngram <- ngramsInSentence(sentence,n))
yield ngram)
allNGrams.groupBy(l => l).map(t => (t._1, t._2.length.toDouble))
}

.pop() equivalent in scala

I have worked on python
In python there is a function .pop() which delete the last value in a list and return that
deleted value
ex. x=[1,2,3,4]
x.pop() will return 4
I was wondering is there is a scala equivalent for this function?
If you just wish to retrieve the last value, you can call x.last. This won't remove the last element from the list, however, which is immutable. Instead, you can call x.init to obtain a list consisting of all elements in x except the last one - again, without actually changing x. So:
val lastEl = x.last
val rest = x.init
will give you the last element (lastEl), the list of all bar the last element (rest), and you still also have the original list (x).
There are a lot of different collection types in Scala, each with its own set of supported and/or well performing operations.
In Scala, a List is an immutable cons-cell sequence like in Lisp. Getting the last element is not a well optimised solution (the head element is fast). Similarly Queue and Stack are optimised for retrieving an element and the rest of the structure from one end particularly. You could use either of them if your order is reversed.
Otherwise, Vector is a good performing general structure which is fast both for head and last calls:
val v = Vector(1, 2, 3, 4)
val init :+ last = v // uses pattern matching extractor `:+` to get both init and last
Where last would be the equivalent of your pop operation, and init is the sequence with the last element removed (you can also use dropRight(1) as suggested in the other answers). To just retrieve the last element, use v.last.
I tend to use
val popped :: newList = list
which assigns the first element of the list to popped and the remaining list to newList
The first answer is correct but you can achieve the same doing:
val last = x.last
val rest = x.dropRight(1)
If you're willing to relax your need for immutable structures, there's always Stack and Queue:
val poppable = scala.collection.mutable.Stack[String]("hi", "ho")
val popped = poppable.pop
Similar to Python's ability to pop multiple elements, Queue handles that:
val multiPoppable = scala.collection.mutable.Queue[String]("hi", "ho")
val allPopped = poppable.dequeueAll(_ => true)
If it is mutable.Queue, use dequeue function
/** Returns the first element in the queue, and removes this element
* from the queue.
*
* #throws java.util.NoSuchElementException
* #return the first element of the queue.
*/
def dequeue(): A =
if (isEmpty)
throw new NoSuchElementException("queue empty")
else {
val res = first0.elem
first0 = first0.next
decrementLength()
res
}

Scala count/access preceding item in a list from inside the list?

I'm looking for a way to access a preceding item in a list. The goal is to count a nested list and then append that nested lists' length in the next list item. Preferably a way to do this inline
Code example:
List(List[items], this.preceding.size)
To output:
List(List(item1,item2,item3), 3)
Thanks for your help!
This sort of thing is, in general, what folds and scans are good at. I'm not sure exactly what form you want, but here is something you can work off of:
val xs = List("salmon","cod","halibut")
xs.scanLeft((0,"")){ (prev, item) => (prev._2.length, item) }.tail
// List((0,salmon), (6,cod), (3,halibut))
You can substitute other lists for the strings.