I want the output of my first code to add 42 in each number from 1 to 100
and in second code I want it to change any lowercase in name to upper case
but the first only prints number from 1 to 100
and second only first letter upper case as it was.
import scala.collection.parallel.CollectionConverters._
object parallelpro1 extends App{
val list = (1 to 100).toList
list.par.map(_ + 42)
println(list)
// map
val lastNames = List("Smith","Jones","Frankenstein","Bach","Jackson","Rodin").par
lastNames.map(_.toUpperCase())
println(lastNames)
}
you need to understand that scala collections like List are immutable, list.par.map(_ + 42) does not change the list but creates a new one, you need to assign it to a new val, the same applies to lastNames.map(_.toUpperCase()).
Related
I have a map function which I would like to apply to specific columns in the dataframe . Say I have a dataframe ABC as follows:
A
B
foo
1
bar
4
biz
3
I want to multiply all the elements of the column B by 2 while retaining column A to get the output as follows:
A
B
foo
2
bar
8
biz
6
I am aware of how I can select the column B in the dataframe and use map to transform the elements as shown in the code below:
ABC.select("B").columns.map(c => c*2)
But my problem is that I am unable to get column A as well.
I first tried using something like this:
ABC.select("A", (ABC.select("B").columns.map(c => c*2):_*)
However, this throws an error that 'select' method cannot be overloaded - which is fair since it does not accept arguments of the type Array[Column]. I then tried this:
val arrayB = Array("B")
ABC.select(ABC.columns.map(c => if (arrayB.contains(c)) c*2 else col(c)):_*)
This does work and returns me the result but I wanted to know if there is a better way of doing this. Thanks in advance!
Try this:
import org.apache.spark.sql.functions.{col, expr}
ABC.select(col("A"), expr("B * 2").as("B"))
You can also use selectExpr("A, B * 2 as B") (or even col("B") * 2 maybe?), but I'd prefer the first one, It's cleaner and easier to maintain.
UPDATE (because of latest comments):
As far as I understood, you may want to apply a specific function to all the columns inside your df, if I'm correct then I would suggest you to do this:
import org.apache.spark.{Column, DataFrame}
import org.apache.spark.sql.functions._
// This is the base function which will be curried //
def applyColumnTransformation(column: Column)(transformation: Column => Column): Column = transformation(column)
val myTransformation: Column => Column = inputColumn => {
// your logic here, for instance:
inputColumn * 2 // you can pass name alias here, like (inputColumn * 2).as(inputColumn.toString)
}
def applyMyTransformation(column: Column): Column = applyColumnTransformation(column)(myTransformation)
Now you have the all you want, simply do this:
val allColumns: Seq[Column] = myDf.columns.map(myDf.col)
val transformedColumns: Seq[Column] = allColumns.map(applymyTransformation)
myDf.select(transformedColumns: _*)
By the way, make sure to take a look at functions package, it has actually LOTS of useful functions.
I am trying to add an element to a range in scala. Why does the following code snippet fail? What's the right way to do it?
import scala.collection.mutable.ListBuffer
val range = Range(1, 10)
val buffer = ListBuffer()
buffer.appendAll(range)
You haven't informed the compiler what type elements buffer will hold.
val buffer = ListBuffer[Int]()
After that the appendAll() should work fine. But there's nothing in your code that will "add an element to a range" (or a list, as the question title falsely indicates). That's a different operation.
You can pre-pend or append a new element but you get an IndexedSeq[Int] back.
0 +: range
range :+ 14
If you want a real Range you can build a new one.
val biggerRange = Range(range.start - 1 , range.end + 2, range.step)
ListBuffer appenAll need traversable object.
https://www.scala-lang.org/api/current/scala/collection/TraversableOnce.html
Workaround is to use -
val buffer = ListBuffer[Int]()
for (i <- range ) buffer.append(i)
Is it possible to convert an Int to a tuple identifier (in scala)? So for a working example suppose I had this:
val testTuple = ("Hector", "Jonas", "Javi")
val id = 2
println(testTuple._id) // does not work as it tries 'num' as a name parameter
I can see that tuple elements can be accessed by the order in which they appear - much like an index (except the first value is 1 rather than 0), e.g. testTuple._1 // is Hector would work as described here among other places.
So how can this be done? Many thanks
You can use testTuple.productElement(id - 1). But note that this returns Any.
NO, You can not do this. _n is a member of tuple<n> and it automatically equals to the size of tuple. According to the notes:
For each TupleN type, where 1 <= N <= 22, Scala defines a number of
element-access methods.
Ex:
val data = (4,3,2)
val sum = data._1 + data._2 + data._3
For more information, you can see Scala Tuples.
Thanks.
I am working on creating a k-mer frequency counter (similar to word count in Hadoop) written in Scala. I'm fairly new to Scala, but I have some programming experience.
The input is a text file containing a gene sequence and my task is to get the frequency of each k-mer where k is some specified length of the sequence.
Therefore, the sequence AGCTTTC has three 5-mers (AGCTT, GCTTT, CTTTC)
I've parsed through the input and created a huge string which is the entire sequence, the new lines throw off the k-mer counting as the end of one line's sequence should still form a k-mer with the beginning of the next line's sequence.
Now I am trying to write a function that will generate a list of maps List[Map[String, Int]] with which it should be easy to use scala's groupBy function to get the count of the common k-mers
import scala.io.Source
object Main {
def main(args: Array[String]) {
// Get all of the lines from the input file
val input = Source.fromFile("input.txt").getLines.toArray
// Create one huge string which contains all the lines but the first
val lines = input.tail.mkString.replace("\n","")
val mappedKmers: List[Map[String,Int]] = getMappedKmers(5, lines)
}
def getMappedKmers(k: Int, seq: String): List[Map[String, Int]] = {
for (i <- 0 until seq.length - k) {
Map(seq.substring(i, i+k), 1) // Map the k-mer to a count of 1
}
}
}
Couple of questions:
How to create/generate List[Map[String,Int]]?
How would you do it?
Any help and/or advice is definitely appreciated!
You're pretty close—there are three fairly minor problems with your code.
The first is that for (i <- whatever) foo(i) is syntactic sugar for whatever.foreach(i => foo(i)), which means you're not actually doing anything with the contents of whatever. What you want is for (i <- whatever) yield foo(i), which is sugar for whatever.map(i => foo(i)) and returns the transformed collection.
The second issue is that 0 until seq.length - k is a Range, not a List, so even once you've added the yield, the result still won't line up with the declared return type.
The third issue is that Map(k, v) tries to create a map with two key-value pairs, k and v. You want Map(k -> v) or Map((k, v)), either of which is explicit about the fact that you have a single argument pair.
So the following should work:
def getMappedKmers(k: Int, seq: String): IndexedSeq[Map[String, Int]] = {
for (i <- 0 until seq.length - k) yield {
Map(seq.substring(i, i + k) -> 1) // Map the k-mer to a count of 1
}
}
You could also convert either the range or the entire result to a list with .toList if you'd prefer a list at the end.
It's worth noting, by the way, that the sliding method on Seq does exactly what you want:
scala> "AGCTTTC".sliding(5).foreach(println)
AGCTT
GCTTT
CTTTC
I'd definitely suggest something like "AGCTTTC".sliding(5).toList.groupBy(identity) for real code.
I would like to remove duplicates from my data in my CSV file.
The first column is the year, and the second is the sentence. I would like to remove any duplicates of a sentence, regardless of the year information.
Is there a command that I can insert in val text = { } to remove these dupes?
My script is:
val source = CSVFile("science.csv");
val text = {
source ~>
Column(2) ~>
TokenizeWith(tokenizer) ~>
TermCounter() ~>
TermMinimumDocumentCountFilter(30) ~>
TermDynamicStopListFilter(10) ~>
DocumentMinimumLengthFilter(5)
}
Thank you!
Essentially you want a version of distinct where you can specify what makes an object (row) unique (the second column).
Given the code: (modified SeqLike.distinct)
type Row = (Int, String)
def distinct(rows:Seq[Row], f: Row => AnyRef) = {
val b = newBuilder
val seen = mutable.HashSet[AnyRef]()
val key = f(x)
for (x <- rows) {
if (!seen(key)) {
b += x
seen += key
}
}
b.result
}
If you had a list of rows (where a row is a tuple) you could get the filtered/unique ones based on the second column with
distinct(rows, (_._2))
Do you need to have your code reproducible? If not, then in excel, click on the "Data" tab, click the little box directly above "1" and to the left of "A" to highlight everything, click "Remove Duplicates", make sure "My data has headers" is selected if you have headers, and then unclick the column that has the years, only keeping the column that has the sentence with a check mark next to it. This will remove duplicate sentences but keep the first instance of the year occuring.
As sets naturally eliminate duplicates, a simple approach would be to fill the rows into a TreeSet, using a custom ordering which only takes into account the text part of each row.
Update
Here is a sample script to demonstrate the above:
import collection.immutable.TreeSet
import scala.io.Source
val lines = Source.fromFile("science.csv").getLines()
val uniques = lines.foldLeft(TreeSet[String]()(Ordering.by(_.split(',')(1)))) {
(s, l) =>
if (s contains l) s
else s + l
}
uniques.toList.sorted foreach println
The script folds the sequence of lines into a treeset with a custom ordering based on the 2nd part of the comma-separated line. The simplest fold function would be (s, l) => s + l; however, that would result in the lines with later year overwriting lines with the same text of earlier years. This is why I had to test for containment first.
Now we are almost ready, we just need to reorder the collection again by year before printing (this assuming the input was ordered by year).