Scala: expanding Array of arguments into List gives error - scala

I am attempting to pass a list of parameters to a function.
scala> val a = Array("col1", "col2")
a: Array[String] = Array(col1, col2)
I'm trying to use the :_* notation, but it's not working: and I cannot for the life of me work out why!
val edges = all_edges.select(a:_*)
<console>:27: error: overloaded method value select with alternatives:
(col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
(cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
cannot be applied to (String)
This, however, does work:
val edges = all_edges.select("col1", "col2")
Not sure if it is relevant, but all_edges is a spark dataframe which I am attempting to only keep columns by specifying them in a list.
scala> all_edges
res4: org.apache.spark.sql.DataFrame
Any ideas? I've been trying to work out the syntax from eg. Passing elements of a List as parameters to a function with variable arguments but don't seem to be getting far
Edit: Just found How to "negative select" columns in spark's dataframe - but I am confused as to why the syntax twocol.select(selectedCols.head, selectedCols.tail: _*) is necessary?

If you want to pass strings, the signature of the function indicates that you have to pass at least one:
(col: String,cols: String*)org.apache.spark.sql.DataFrame
So you have to single out the first argument of your list : Spark cannot from the type of a Traversable alone determine that it is not empty.
val edges = all_edges.select(a.head, a.tail: _*)
Now, that's the dirty version of it. If you want to do this rigorously, you should check the list is not empty yourself:
val edges = a.headOption.map( (fst) => all_edges.select(fst, a.drop(1))

Related

How to implement GetResult[List[String]] in scala slick?

I'm use SQLActionBuilder, such as seq"""select ...""", to create a common/wide sql query and I not care about the result column count it is.
Document example use as[TupleX] to decided result type,in my stage, I want use List[String] replace TupleX type.
I have attempted with sQLActionBuilder.as[List[String]] but a compile error encounter:
Error:(8, 187) could not find implicit value for parameter rconv: slick.jdbc.GetResult[List[String]]
val x = reportUtilRepository.find(List())("td_report_region")(1469635200000L, 1475251200000L)("region" :: Nil, "clicks" :: "CPC" :: Nil)(List("1", "2"), List("regionType" -> "1"))(_.as[List[String]]).map(x => println(x.toString))
and sQLActionBuilder.as[List[(String, String, String)]] works well. So how can I use List[String] to match a common result.
I think a straight way is implement a GetResult[List[String]] as compiler tips but I don't know how to do it.Other ways also welcome.
Thanks.
First of all querying database always returns a list of tuples, so the result type will be List[TupleX] because each row is represented as a list record and then columns in each row are respectively tuple elements.
Therefore, your data will look like List((1,2,3),(3,4,5)) where data type is List[(Int, Int, Int)]. To produce List[Int] you might do following:
val a = List((1,2,3),(3,4,5))
a map {x => List(x._1, x._2, x._3)} flatten
res0: List[Int] = List(1, 2, 3, 3, 4, 5)

(Scala) How to convert List into a Seq

I have a function like
def myFunction(i:Int*) = i.map(a => println(a))
but I have a List of Int's.
val myList:List[Int] = List(1,2,3,4)
Desired output:
1
2
3
4
How can I programmatically convert myList so it can be inserted into myFunction?
Looking at your desired input and output, you want to pass a List where a varargs argument is expected.
A varargs method can receive zero or more arguments of same type.
The varargs parameter has type of Array[T] actually.
But you can pass any Seq to it, by using "varargs ascription/expansion" (IDK if there is an official name for this):
myFunction(myList: _*)

The argument types of an anonymous function must be fully known. (SLS 8.5) when word2vec applied for dataframe

I apply Spark's word2vec by using a dataframe. Here is my code:
val df2 = df.groupBy("LABEL").agg(collect_list("TERM").alias("TERM"))
val word2Vec = new Word2Vec()
.setInputCol("TERM")
.setOutputCol("result")
.setMinCount(0)
val model = word2Vec.fit(df2)
val result = model.transform(df2)
val synonyms = model.findSynonyms("4", 10)
//synonyms.foreach(println)
for((synonym, cosineSimilarity) <- synonyms) {
println(s"$synonym $cosineSimilarity")
}
When I use synonyms.foreach(println) the code works, however, the returned results are not ordered based on their similarity scores. Instead I have tried the for loop seen at bottom of the code. When applying it the following error has been thrown:
Error:(52, 40) missing parameter type for expanded function
The argument types of an anonymous function must be fully known. (SLS 8.5)
Expected type was: ?
for((synonym, cosineSimilarity) <- synonyms) {
^
From other similar stackoverflow questions and the error, it seems the exact types of arguments are needed. In the for loop synonyms is a dataframe and the returned values have types String and Double, respectively. So all my trials have failed. How can I remedy this?
The result of findSynonyms is a non-materialized Spark-internal DataFrame. You can not simply iterate on the result.
def findSynonyms(word: Vector, num: Int): DataFrame = {
..
sc.parallelize(wordVectors.findSynonyms(word, num)).toDF("word", "similarity")
}
Note the reason foreach worked is that is a materialization method clearly defined on DataFrame's

How to generate tab separated output using saveastextfile in Spark?

I'm using Scala, and I want saveAsTextFile to directly save the result as tab separated, like, for example:
a 1
b 4
c 5
(space is tab)
I just want to use saveAsTextFile (not print), and when I have like RDD[(String, Double)], I cannot use
ranks = ranks.map( f => f._1 +"\t"+f._2)
It says the type does not match, I guess is because f._1 is string and f._2 is a double?
The only mistake in your code is trying to re-assign the result of the mapping into the same ranks variable - I'm assuming ranks has type RDD[(String, Double)] so indeed you can't assign it with a value of type RDD[String]. Simply use a separate variable:
val ranks: RDD[(String, Double)] = sc.parallelize(Seq(("a", 1D), ("b", 4D)))
val tabSeparated: RDD[String] = ranks.map(f => f._1 +"\t"+f._2)
tabSeparated.saveAsTextFile("./test.tsv")
In general, it's almost always better to use vals and not vars to prevent such mistakes.
NOTE: a perhaps cleaner way to convert a tuple (of any size) into a tab-delimited string:
ranks.map(_.productIterator.mkString("\t"))

How to unpack a map/list in scala to tuples for a variadic function?

I'm trying to create a PairRDD in spark. For that I need a tuple2 RDD, like RDD[(String, String)]. However, I have an RDD[Map[String, String]].
I can't work out how to get rid of the iterable so I'm just left with RDD[(String, String)] rather than e.g. RDD[List[(String, String)]].
A simple demo of what I'm trying to make work is this broken code:
val lines = sparkContext.textFile("data.txt")
val pairs = lines.map(s => Map(s -> 1))
val counts = pairs.reduceByKey((a, b) => a + b)
The last line doesn't work because pairs is an RDD[Map[String, Int]] when it needs to be an RDD[(String, Int)].
So how can I get rid of the iterable in pairs above to convert the Map to just a tuple2?
You can actually just run:
val counts = pairs.flatMap(identity).reduceByKey(_ + _)
Note that the usage of the identity function that replicates the functionality of flatten on an RDD and the reduceByKey() function has a nifty underscore notation for conciseness.