How to use Spark's PrefixSpan on real-world data (text file or sql)? - scala

I'm trying to use Spark's PrefixSpan algorithm but it is comically difficult to get the data in the right shape to feed to the algo. It feels like a Monty Python skit where the API is actively working to confuse the programmer.
My data is a list of rows, each of which contains a list of text items.
a b c c c d
b c d e
a b
...
I have made this data available two ways, an sql table in Hive (where each row has an array of items) and text files where each line contains the items above.
The official example creates a Seq of Array(Array).
If I use sql, I get the following type back:
org.apache.spark.sql.DataFrame = [seq: array<string>]
If I read in text, I get this type:
org.apache.spark.sql.Dataset[Array[String]] = [value: array<string>]
Here is an example of an error I get (if I feed it data from sql):
error: overloaded method value run with alternatives:
[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: org.apache.spark.api.java.JavaRDD[Sequence])org.apache.spark.mllib.fpm.PrefixSpanModel[Item] <and>
[Item](data: org.apache.spark.rdd.RDD[Array[Array[Item]]])(implicit evidence$1: scala.reflect.ClassTag[Item])org.apache.spark.mllib.fpm.PrefixSpanModel[Item]
cannot be applied to (org.apache.spark.sql.DataFrame)
new PrefixSpan().setMinSupport(0.5).setMaxPatternLength(5).run( sql("select seq from sequences limit 1000") )
^
Here is an example if I feed it text files:
error: overloaded method value run with alternatives:
[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: org.apache.spark.api.java.JavaRDD[Sequence])org.apache.spark.mllib.fpm.PrefixSpanModel[Item] <and>
[Item](data: org.apache.spark.rdd.RDD[Array[Array[Item]]])(implicit evidence$1: scala.reflect.ClassTag[Item])org.apache.spark.mllib.fpm.PrefixSpanModel[Item]
cannot be applied to (org.apache.spark.sql.Dataset[Array[String]])
new PrefixSpan().setMinSupport(0.5).setMaxPatternLength(5).run(textfiles.map( x => x.split("\u0002")).limit(3))
^
I've tried to mold the data by using casting and other unnecessarily complicated logic.
This can't be so hard. Given a list of items (of the very reasonable format described above), how the heck do I fed it to PrefixSpan?
edit:
I'm on spark 2.2.1
Resolved:
A column in the table I was querying had collections in each cell. This was causing the returned result to be inside a WrappedArray. I changed my query so the result column only contained a string (by concat_ws). This made it MUCH easier to deal with the type error.

Related

How to create a Column expression from collection of column names?

I have a list of strings, which represents the names of various columns I want to add together to make another column:
val myCols = List("col1", "col2", "col3")
I want to convert the list to columns, then add the columns together to make a final column. I've looked for a number of ways to do this, and the closest I can come to the answer is:
df.withColumn("myNewCol", myCols.foldLeft(lit(0))(col(_) + col(_)))
I get a compile error where it says it is looking for a string, when all I really want is a column. What's wrong? How to fix it?
When I tried it out in spark-shell it gave me the error that says exactly what the error is and where.
scala> myCols.foldLeft(lit(0))(col(_) + col(_))
<console>:26: error: type mismatch;
found : org.apache.spark.sql.Column
required: String
myCols.foldLeft(lit(0))(col(_) + col(_))
^
Just think of the first pair that is given to the function of foldLeft. It's going to be lit(0) of type Column and col1 of type String. There's no col function that accepts a Column.
Try reduce instead:
myCols.map(col).reduce(_ + _)
From the official documentation of reduce:
Applies a binary operator to all elements of this collection, going right to left.
the result of inserting op between consecutive elements of this collection, going right to left:
op(x_1, op(x_2, ..., op(x_{n-1}, x_n)...))
where x1, ..., xn are the elements of this collection.
Here is how you can add columns dynamically based on the column names on a List. When all columns are numeric the result is a number. The 1st variable on foldLeft is of same type as return. foldLeft would work as much as reduce.
val employees = //a dataframe with 2 numeric columns "salary","exp"
val initCol = lit(0)
val cols = Seq("salary","exp")
val col1 = cols.foldLeft(initCol)((x,y) => x + col(y))
employees.select(col1).show()

Specifying sorting order for MappedColumnType based column in Slick

I have custom MappedColumnType for Java8 LocalDateTime, defined like this:
implicit val localDTtoDate = MappedColumnType.base[LocalDateTime, Timestamp] (
l => Timestamp.valueOf(l),
d => d.toLocalDateTime
)
Columns of that type are used in table mappings in that way:
def timestamp = column[LocalDateTime]("ts")
Everything looks good, but i'm not able to sort on that column with different directions, cause it lacks .asc and .desc (and, actually, is not a ColumnOrdered type). How can i add sorting functionality for that type?
You can use sort and do .desc and .asc. But, ensure the mapping implicit val is in scope of the query where you are using .desc and .asc, if not you will get compilation error.

matching keys of a hashmap to entries of a spark RDD in scala and adding value to if match found and writing rdd back to hbase

I am trying to read a HBase table using scala and then add a new column as tags based on the content in the rows in HBase Table. I have read the table as spark RDD. I also have a hashmap of which key value pairs are as follows:
keys are to be matched with the entries of spark rdd(generated from HBase table) and if match is found, the value from the hashmap is to be added into a new column.
The function to write to hbase table in the a new column name is this:
def convert (a:Int,s:String) : Tuple2[ImmutableBytesWritable,Put]={
val p = new Put(a.toString.getBytes())
p.add(Bytes.toBytes("columnfamily"),Bytes.toBytes("col_2"), s.toString.getBytes())//a.toString.getBytes())
println("the value of a is: " + a)
new Tuple2[ImmutableBytesWritable,Put](new ImmutableBytesWritable(Bytes.toBytes(a)), p);
}
new PairRDDFunctions(newrddtohbaseLambda.map(x=>convert(x, ranjan))).saveAsHadoopDataset(jobConfig)
Then to read string from hashmap and compare and add back the code is this:
csvhashmap.keys.foreach{i=> if (arrayRDD.zipWithIndex.foreach{case(a,j) => a.split(" ").exists(i contains _); p = j.toInt}==true){new PairRDDFunctions(convert(p,csvhashmap(i))).saveAsHadoopDataset(jobConfig)}}
here csvhashmap is the above described hashmap, "words" is the rdd where we are trying to match the string. When the above command is run, I get the following error:
error: type mismatch;
found : (org.apache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Put)
required: org.apache.spark.rdd.RDD[(?, ?)]
How to get rid of it? I have tried many things to change the data type but each time I get some error. Also, I have checked for the individual functions inside the above snippet and they are just fine. When I integrate them together, I got the above error. Any help would be appreciated.

Aggregate a Spark data frame using an array of column names, retaining the names

I would like to aggregate a Spark data frame using an array of column names as input, and at the same time retain the original names of the columns.
df.groupBy($"id").sum(colNames:_*)
This works but fails to preserve the names. Inspired by the answer found here I unsucessfully tried this:
df.groupBy($"id").agg(sum(colNames:_*).alias(colNames:_*))
error: no `: _*' annotation allowed here
It works to take a single element like
df.groupBy($"id").agg(sum(colNames(2)).alias(colNames(2)))
How can make this happen for the entire array?
Just provide an sequence of columns with aliases:
val colNames: Seq[String] = ???
val exprs = colNames.map(c => sum(c).alias(c))
df.groupBy($"id").agg(exprs.head, exprs.tail: _*)

Slick 2 aggregation - how to get a scalar result?

I have a table with an Int column TIME in it:
def time = column[Int]("TIME")
The table is mapped to a custom type. I want to find a maximum time value, i.e. to perform a simple aggregation. The example in the documentation seems easy enough:
val q = coffees.map(_.price)
val q1 = q.min
val q2 = q.max
However, when I do this, the type of q1 and q2 is Column[Option[Int]]. I can perform a get or getOrElse on this to get a result of type Column[Int] (even this seems somewhat surprising to me - is get a member of Column, or is the value converted from Option[Int] to Int and then wrapped to Column again? Why?), but I am unable to use the scalar value, when I attempt to assign it to Int, I get an error message saying:
type mismatch;
found : scala.slick.lifted.Column[Int]
required: Int
How can I get the scala value from the aggregated query?
My guess is that you are not calling the invoker that's the reason why you get a Column object. Try this:
val q1 = q.min.run
Should return an Option[Int] and then you can get or getOrElse.