get one random letter from each tuple then return them all as a string - scala

3 tuples in a list
val l = List(("a","b"),("c","d"),("e","f"))
choice one element from each tuple then return this 3 letters word every time
for example: fca or afd or cbf ...
how to realize it
the same as:
echo {a,b}{c,d}{e,f}|xargs -n1|shuf -n1|sed 's/\B/\n/g'|shuf|paste -sd ''

Working with tuples can be a bit of a pain. You can't easily index them and tuples of different sizes are considered different types in the type system.
val ts = List(("a","b"),("c","d"),("e","f"))
val str = ts.map{t =>
t.productElement(util.Random.nextInt(t.productArity))
}.mkString("")
Every time I run this I get a different result: bde, acf, bdf, etc.

Related

Filling in desired lines in Scala

I currently have a value of result that is a string which represents cycles in a graph
> scala result
String =
0:0->52->22;
5:5->70->77;
8:8->66->24;8->42->32;
. //
. // trimmed to get by point across
. //
71:71->40->45;
77:77->34->28;77->5->70;
84:84->22->29
However, I want to have the output have the numbers in between be included and up to a certain value included. The example code would have value = 90
0:0->52->22;
1:
2:
3:
4:
5:5->70->77;
6:
7:
8:8->66->24;8->42->32;
. //
. // trimmed
. //
83:
84:84->22->29;
85:
86:
87:
88:
89:
90:
If it helps or makes any difference, this value is changed to a list for later purposes, such like
list_result = result.split("\n").toList
List[String] = List(0:0->52->22;, 5:5->70->77;, 8:8->66->24;8->42->32;, 11:11->26->66;11->17->66;
My initial thought was to insert the missing numbers into the list and then sort it, but I had trouble with the sorting so I instead look here for a better method.
Turn your list_result into a Map with default values. Then walk through the desired number range, exchanging each for its Map value.
val map_result: Map[String,List[String]] =
list_result.groupBy("\\d+:".r.findFirstIn(_).getOrElse("bad"))
.withDefault(List(_))
val full_result: String =
(0 to 90).flatMap(n => map_result(s"$n:")).mkString("\n")
Here's a Scastie session to see it in action.
One option would be to use a Map as an intermediate data structure:
val l: List[String] = List("0:0->52->22;", "5:5->70->77;", "8:8->66->24;8->42->32;", "11:11->26->66;11->17->66;")
val byKey: List[Array[String]] = l.map(_.split(":"))
val stop = 90
val mapOfValues = (1 to stop).map(_->"").toMap
val output = byKey.foldLeft(mapOfValues)((acc, nxt) => acc + (nxt.head.toInt -> nxt.tail.head))
output.toList.sorted.map {case (key, value) => println(s"$key, $value")}
This will give you the output you are after. It breaks your input strings into pseudo key-value pairs, creates a map to hold the results, inserts the elements of byKey into the map, then returns a sorted list of the results.
Note: If you are using this in anything like production code you'd need to properly check that each Array in byKey does have two elements to prevent any nullPointerExceptions with the later calls to head and tail.head.
The provided solutions are fine, but I would like to suggest one that can process the data lazily and doesn't need to keep all data in memory at once.
It uses a nice function called unfold, which allows to "unfold" a collection from a starting state, up to a point where you deem the collection to be over (docs).
It's not perfectly polished but I hope it may help:
def readLines(s: String): Iterator[String] =
util.Using.resource(io.Source.fromString(s))(_.getLines)
def emptyLines(from: Int, until: Int): Iterator[(String)] =
Iterator.range(from, until).map(n => s"$n:")
def indexOf(line: String): Int =
Integer.parseInt(line.substring(0, line.indexOf(':')))
def withDefaults(from: Int, to: Int, it: Iterator[String]): Iterator[String] = {
Iterator.unfold((from, it)) { case (n, lines) =>
if (lines.hasNext) {
val next = lines.next()
val i = indexOf(next)
Some((emptyLines(n, i) ++ Iterator.single(next), (i + 1, lines)))
} else if (n < to) {
Some((emptyLines(n, to + 1), (to, lines)))
} else {
None
}
}.flatten
}
You can see this in action here on Scastie.
What unfold does is start from a state (in this case, the line number from and the iterator with the lines) and at every iteration:
if there are still elements in the iterator it gets the next item, identifies its index and returns:
as the next item an Iterator with empty lines up to the latest line number followed by the actual line
e.g. when 5 is reached the empty lines between 1 and 4 are emitted, terminated by the line starting with 5
as the next state, the index of the line after the last in the emitted item and the iterator itself (which, being stateful, is consumed by the repeated calls to unfold at each iteration)
e.g. after processing 5, the next state is 6 and the iterator
if there are no elements in the iterator anymore but the to index has not been reached, it emits another Iterator with the remaining items to be printed (in your example, those after 84)
if both conditions are false we don't need to emit anything anymore and we can close the "unfolding" collection, signalling this by returning a None instead of Some[(Item, State)]
This returns an Iterator[Iterator[String]] where every nested iterator is a range of values from one line to the next, with the default empty lines "sandwiched" in between. The call to flatten turns it into the desired result.
I used an Iterator to make sure that only the essential state is kept in memory at any time and only when it's actually used.

Scala - Not enough arguments for method count

I am fairly new to Scala and Spark RDD programming. The dataset I am working with is a CSV file containing list of movies (one row per movie) and their associated user ratings (comma delimited list of ratings). Each column in the CSV represents a distinct user and what rating he/she gave the movie. Thus, user 1's ratings for each movie are represented in the 2nd column from the left:
Sample Input:
Spiderman,1,2,,3,3
Dr.Sleep, 4,4,,,1
I am getting the following error:
Task4.scala:18: error: not enough arguments for method count: (p: ((Int, Int)) => Boolean)Int.
Unspecified value parameter p.
var moviePairCounts = movieRatings.reduce((movieRating1, movieRating2) => (movieRating1, movieRating2, movieRating1._2.intersect(movieRating2._2).count()
when I execute the few lines below. For the program below, the second line of code splits all values delimited by "," and produces this:
( Spiderman, [[1,0],[2,1],[-1,2],[3,3],[3,4]] )
( Dr.Sleep, [[4,0],[4,1],[-1,2],[-1,3],[1,4]] )
On the third line, taking the count() throws an error. For each movie (row), I am trying to get the number of common elements. In the above example, [-1, 2] is clearly a common element shared by both Spiderman and Dr.Sleep.
val textFile = sc.textFile(args(0))
var movieRatings = textFile.map(line => line.split(","))
.map(movingRatingList => (movingRatingList(0), movingRatingList.drop(1)
.map(ranking => if (ranking.isEmpty) -1 else ranking.toInt).zipWithIndex));
var moviePairCounts = movieRatings.reduce((movieRating1, movieRating2) => (movieRating1, movieRating2, movieRating1._2.intersect(movieRating2._2).count() )).saveAsTextFile(args(1));
My target output of line 3 is as follows:
( Spiderman, Dr.Sleep, 1 ) --> Between these 2 movies, there is 1 common entry.
Can somebody please advise ?
To get the number of elements in a collection, use length or size. count() returns number of elements which satisfy some additional condition.
Or you could avoid building the complete intersection by using count to count the elements of the first collection which the second contains:
movieRating1._2.count(movieRating2._2.contains(_))
The error message seems pretty clear: count takes one argument, but in your call, you are passing an empty argument list, i.e. zero arguments. You need to pass one argument to count.

How to explode a struct column with a prefix?

My goal is to explode (ie, take them from inside the struct and expose them as the remaining columns of the dataset) a Spark struct column (already done) but changing the inner field names by prepending an arbitrary string. One of the motivations is that my struct can contain columns that have the same name as columns outside of it - therefore, I need a way to differentiate them easily. Of course, I do not know beforehand what are the columns inside my struct.
Here is what I have so far:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = df.select("*", column + ".*").drop(column)
}
This does the job alright - I use this writing:
df.explodeStruct("myColumn")
It returns all the columns from the original dataframe, plus the inner columns of the struct at the end.
As for prepending the prefix, my idea is to take the column and find out what are its inner columns. I browsed the documentation and could not find any method on the Column class that does that. Then, I changed my approach to taking the schema of the DataFrame, then filtering the result by the name of the column, and extracting the column found from the resulting array. The problem is that this element I find has the type StructField - which, again, presents no option to extract its inner field - whereas what I would really like is to get handled a StructType element - which has the .getFields method, that does exactly what I want (that is, showing me the name of the inner columns, so I can iterate over them and use them on my select, prepending the prefix I want to them). I know no way to convert a StructField to a StructType.
My last attempt would be to parse the output of StructField.toString - which contains all the names and types of the inner columns, although that feels really dirty, and I'd rather avoid that lowly approach.
Any elegant solution to this problem?
Well, after reading my own question again, I figured out an elegant solution to the problem - I just needed to select all the columns the way I was doing, and then compare it back to the original dataframe in order to figure out what were the new columns. Here is the final result - I also made this so that the exploded columns would show up in the same place as the original struct one, so not to break the flow of information:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = {
val prefix = column + "_"
val originalPosition = df.columns.indexOf(column)
val dfWithAllColumns = df.select("*", column + ".*")
val explodedColumns = dfWithAllColumns.columns diff df.columns
val prefixedExplodedColumns = explodedColumns.map(c => col(column + "." + c) as prefix + c)
val finalColumnsList = df.columns.map(col).patch(originalPosition, prefixedExplodedColumns, 1)
df.select(finalColumnsList: _*)
}
}
Of course, you can customize the prefix, the separator, and etc - but that is simple, anyone could tweak the parameters and such. The usage remains the same.
In case anyone is interested, here is something similar for PySpark:
def explode_struct(df: DataFrame, column: str) -> DataFrame:
original_position = df.columns.index(column)
original_columns = df.columns
new_columns = df.select(column + ".*").columns
exploded_columns = [F.col(column + "." + c).alias(column + "_" + c) for c in new_columns]
col_list = [F.col(c) for c in df.columns]
col_list.pop(original_position)
col_list[original_position:original_position] = exploded_columns
return df.select(col_list)

Scala:Splitting a line and count the number of words

I am new to scala and learning scala...
val pair=("99","ABC",88)
pair.toString().split(",").foreach { x => println(x)}
This gives the splitted line. But How do I count the number of splitted words .
I am trying as below:
pair.toString().split(",").count { x => ??? }
I am not sure how can I get the count of splitted line. ie 3 ..
Any help appreciated....
Tuples are equipped with product functions such as productElement, productPrefix, productArity and productIteratorfor processing its elements.
Note that
pair.productArity
res0: Int = 3
and that
pair.productIterator foreach println
99
ABC
88
pair.toString().split(",").size will give you the number of elements. OTOH, you have a Tuple3, so its size will only ever be three. Asking for a size function on a tuple is rather redundant, their sizes are fixed by their type.
Plus, if any of the elements contain a comma, your function will break.

Map word ngrams to counts in scala

I'm trying to create a map which goes through all the ngrams in a document and counts how often they appear. Ngrams are sets of n consecutive words in a sentence (so in the last sentence, (Ngrams, are) is a 2-gram, (are, sets) is the next 2-gram, and so on). I already have code that creates a document from a file and parses it into sentences. I also have a function to count the ngrams in a sentence, ngramsInSentence, which returns Seq[Ngram].
I'm getting stuck syntactically on how to create my counts map. I am iterating through all the ngrams in the document in the for loop, but don't know how to map the ngrams to the count of how often they occur. I'm fairly new to Scala and the syntax is evading me, although I'm clear conceptually on what I need!
def getNGramCounts(document: Document, n: Int): Counts = {
for (sentence <- document.sentences; ngram <- nGramsInSentence(sentence,n))
//I need code here to map ngram -> count how many times ngram appears in document
}
The type Counts above, as well as Ngram, are defined as:
type Counts = Map[NGram, Double]
type NGram = Seq[String]
Does anyone know the syntax to map the ngrams from the for loop to a count of how often they occur? Please let me know if you'd like more details on the problem.
If I'm correctly interpreting your code, this is a fairly common task.
def getNGramCounts(document: Document, n: Int): Counts = {
val allNGrams: Seq[NGram] = for {
sentence <- document.sentences
ngram <- nGramsInSentence(sentence, n)
} yield ngram
allNgrams.groupBy(identity).mapValues(_.size.toDouble)
}
The allNGrams variable collects a list of all the NGrams appearing in the document.
You should eventually turn to Streams if the document is big and you can't hold the whole sequence in memory.
The following groupBycreates a Map[NGram, List[NGram]] which groups your values by its identity (the argument to the method defines the criteria for "aggregate identification") and groups the corresponding values in a list.
You then only need to map the values (the List[NGram]) to its size to get how many recurring values there were of each NGram.
I took for granted that:
NGram has the expected correct implementation of equals + hashcode
document.sentences returns a Seq[...]. If not you should expect allNGrams to be of the corresponding collection type.
UPDATED based on the comments
I wrongly assumed that the groupBy(_) would shortcut the input value. Use the identity function instead.
I converted the count to a Double
Appreciate the help - I have the correct code now using the suggestions above. The following returns the desired result:
def getNGramCounts(document: Document, n: Int): Counts = {
val allNGrams: Seq[NGram] = (for(sentence <- document.sentences;
ngram <- ngramsInSentence(sentence,n))
yield ngram)
allNGrams.groupBy(l => l).map(t => (t._1, t._2.length.toDouble))
}