How to vectorize a set of columns for a multilable classification problem in Pyspark - pyspark

I am currently working on a multi class classification problem where the classes to be predicted are spread among multiple columns.
+--------------------+--------------+-------------+------------+---------------------------- | features| Label_1 |Label_2 |Label_3 |Label_4 | +--------------------+--------------+-------------+------------+---------------------------- |(195,[0,1,2,3,4,5...| 0| 1| 0| 0| |(195,[0,1,2,3,4,5...| 1| 0| 1| 0| |(195,[0,1,2,5,6,1...| 0| 0| 1| 0| |(195,[0,1,2,3,4,5...| 0| 0| 0| 0| |(195,[0,1,2,3,4,5...| 0| 1| 0| 0| +--------------------+--------------+-------------+------------+----------------------------------------------+
I am trying to find a method that can vectorize these labels into a single unified array (somthing like VectorAssembler but as far as I know it is supposed to be used only for feature variables and not target variables) and a classifier model that can take a vector using Pyspark. Any help or guidance will be appreciated.
Thanks
I searched the web but I cant seem to find documentation about it.

Related

Spark question: if I do not cache the dataframes then it will be ran multiple times?

If I do not cache a dataframe which is generated using spark SQL with limit option, will I have unstable results whenever I edit the resulted dataframe and show it?
Description.
I have a table like below which is generated by using spark SQL with limit option:
+---------+---+---+---+---+
|partition| | 0| 1| 2|
+---------+---+---+---+---+
| 0| 0| 0| 10| 18|
| 1| 0| 0| 10| 17|
| 2| 0| 0| 13| 17|
+---------+---+---+---+---+
And if I add a column to get the row sum, and show() that again, the dataframe has different items like below:
+---------+---+---+---+---+-------+-----------+------------+------------------+------------------+
|partition| | 0| 1| 2|row_sum|percent of |percent of 0| percent of 1| percent of 2|
+---------+---+---+---+---+-------+-----------+------------+------------------+------------------+
| 0| 0| 0| 10| 13| 23| 0.0| 0.0| 43.47826086956522| 56.52173913043478|
| 1| 0| 0| 13| 16| 29| 0.0| 0.0|44.827586206896555|55.172413793103445|
| 2| 0| 0| 15| 14| 29| 0.0| 0.0|51.724137931034484|48.275862068965516|
+---------+---+---+---+---+-------+-----------+------------+------------------+------------------+
I suspect that the code for editing the original dataframe obtained from the first spark SQL query will result in re-running the very first spark SQL query and edit from the new result.
Is this true?
Cache() in spark is a transformation and is lazily evaluated when you call any action on that dataframe.
Now if your are writing a query to fetch only 10 records using limit then when you call an action like show on it would materialize the code and get 10 records at that time. Now if you have not cache the dataframe and if you perform multiple transformation and again perform an action on the newly created dataframe then spark would perform the transformations from the root of the graph and that is the reason you would see different output every time if you don't cache that dataframe.

How to create a distributed sparse matrix in Spark from DataFrame in Scala

Question
Please help finding the ways to create a distributed matrix from the (user, feature, value) records in a DataFrame where features and their values are stored in a column.
Excerpts of the data is below but there are large number of users and features, and no all features are tested for users. Hence lots of feature values are null and to be imputed to 0.
For instance, a blood test may have sugar level, cholesterol level, etc as features. If those levels are not acceptable, then 1 is set as the value. But not all the features will be tested for the users (or patients).
+----+-------+-----+
|user|feature|value|
+----+-------+-----+
| 14| 0| 1|
| 14| 222| 1|
| 14| 200| 1|
| 22| 0| 1|
| 22| 32| 1|
| 22| 147| 1|
| 22| 279| 1|
| 22| 330| 1|
| 22| 363| 1|
| 22| 162| 1|
| 22| 811| 1|
| 22| 290| 1|
| 22| 335| 1|
| 22| 681| 1|
| 22| 786| 1|
| 22| 789| 1|
| 22| 842| 1|
| 22| 856| 1|
| 22| 881| 1|
+----+-------+-----+
If features are alredy columns, then there are ways explained.
Spark - How to create a sparse matrix from item ratings
Calculate Cosine Similarity Spark Dataframe
How to convert a DataFrame to a Vector.dense in scala
But this is not the case. So one way could be pivoting the dataframe to apply those methods.
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|user| 0| 32|147|162|200|222|279|290|330|335|363|681|786|789|811|842|856|881|
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 14| 1| 0| 0| 0| 1| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 22| 1| 1| 1| 1| 0| 0| 1| 1| 1| 1| 1| 1| 1| 1| 1| 1| 1| 1|
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
Then use row to vector conversion. I suppose using one of these:
VectorAssembler
org.apache.spark.mllib.linalg.Vectors.fromML
org.apache.spark.mllib.linalg.distributed.MatrixEntry
However, since there will be many null values to be imputed to 0, the pivoted dataframe will consume far more memory space. Also pivoting a large dataframe distributed among multiple nodes would be causing large shuffling.
Hence, seek for advices, ideas, suggestions.
Related
Spark - How to create a sparse matrix from item ratings
Calculate Cosine Similarity Spark Dataframe
How to convert a DataFrame to a Vector.dense in scala
VectorAssembler
Scalable Sparse Matrix Multiplication in Apache Spark
Spark MLlib Data Types | Apache Spark Machine Learning
Linear Algebra and Distributed Machine Learning in Scala using Breeze and MLlib
Environment
Spark 2.4.4
Solution
Create a RDD[(user, feature)] for each input line.
groupByKey to create a RDD[(user, [feature+])].
Create a RDD[IndexedRow] where each IndexedRow represents below for all the features existing.
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|user| 0| 32|147|162|200|222|279|290|330|335|363|681|786|789|811|842|856|881|
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 14| 1| 0| 0| 0| 1| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
Convert the RDD[IndexedRow] into IndexedRowMatrix.
For product operation, convert RowIndexedMatrix into BlockMatrix which supports product operation in distributed manner.
Convert each original record into IndexedRow
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Row
def toIndexedRow(userToFeaturesMap:(Int, Iterable[Int]), maxFeatureId: Int): IndexedRow = {
userToFeaturesMap match {
case (userId, featureIDs) => {
val featureCountKV = featureIDs.map(i => (i, 1.0)).toSeq
new IndexedRow (
userId,
Vectors.sparse(maxFeatureId + 1, featureCountKV)
)
}
}
}
val userToFeatureCounters= featureData.rdd
.map(rowPF => (rowPF.getInt(0), rowPF.getInt(1))) // Out from ROW[(userId, featureId)]
.groupByKey() // (userId, Iterable(featureId))
.map(
userToFeatureIDsMap => toIndexedRow(userToFeatureIDsMap, maxFeatureId)
) // IndexedRow(userId, Vector((featureId, 1)))
Created IndexedRowMatrix
val userFeatureIndexedMatrix = new IndexedRowMatrix(userToFeatureCounters)
Trasponsed IndexedRowMatrix via BlockMatrix as IndexedRowMatrix does not support transpose
val userFeatureBlockMatrixTransposed = userFeatureBlockMatrix
.transpose
Created product with BlockMatrix as IndexedRowMatrix requires Local DenseMatrix on the right.
val featuresTogetherIndexedMatrix = userFeatureBlockMatrix
.multiply(userFeatureBlockMatrixTransposed)
.toIndexedRowMatrix
Maybe you could transform each row into json representation, e.g:
{
"user": 14
"features" : [
{
"feature" : 0
"value" : 1
},
{
"feature" : 222
"value" : 1
}
]
}
But all depends on how you would use your "distributed matrix" later on.

Improve the efficiency of Spark SQL in repeated calls to groupBy/count. Pivot the outcome

I have a Spark DataFrame consisting of columns of integers. I want to tabulate each column and pivot the outcome by the column names.
In the following toy example, I start with this DataFrame df
+---+---+---+---+---+
| a| b| c| d| e|
+---+---+---+---+---+
| 1| 1| 1| 0| 2|
| 1| 1| 1| 1| 1|
| 2| 2| 2| 3| 3|
| 0| 0| 0| 0| 1|
| 1| 1| 1| 0| 0|
| 3| 3| 3| 2| 2|
| 0| 1| 1| 1| 0|
+---+---+---+---+---+
Each cell can only contain one of {0, 1, 2, 3}. Now I want to tabulate the counts in each column. Ideally, I would have a column for each label (0, 1, 2, 3), and a row for each column. I do:
val output = df.columns.map(cs => df.select(cs).groupBy(cs).count().orderBy(cs).
withColumnRenamed(cs, "severity").
withColumnRenamed("count", "counts").withColumn("window", lit(cs))
)
I get an Array of DataFrames, one for each row of the df. Each of these dataframes has 4 rows (one for each outcome). Then I do:
val longOutput = output.reduce(_ union _) // flatten the array to produce one dataframe
longOutput.show()
to collapse the Array.
+--------+------+------+
|severity|counts|window|
+--------+------+------+
| 0| 2| a|
| 1| 3| a|
| 2| 1| a|
| 3| 1| a|
| 0| 1| b|
| 1| 4| b|
| 2| 1| b|
| 3| 1| b|
...
And finally, I pivot on the original column names
longOutput.cache()
val results = longOutput.groupBy("window").pivot("severity").agg(first("counts"))
results.show()
+------+---+---+---+---+
|window| 0| 1| 2| 3|
+------+---+---+---+---+
| e| 2| 2| 2| 1|
| d| 3| 2| 1| 1|
| c| 1| 4| 1| 1|
| b| 1| 4| 1| 1|
| a| 2| 3| 1| 1|
+------+---+---+---+---+
However the reduction piece took 8 full seconds on the toy example. It ran for over 2 hours on my actual data which had 1000 columns and 400,000 rows before I terminated it. I am running locally on a machine with 12 cores and 128G of RAM. But clearly, what I'm doing is slow on even a small amount of data, so machine size is not in itself the problem. The column groupby/count took only 7 minutes on the full data set. But then I can't do anything with that Array[DataFrame].
I tried several ways of avoiding union. I tried writing out my array to disk, but that failed due to a memory problem after several hours of effort. I also tried to adjust memory allowances on Zeppelin
So I need a way of doing the tabulation that does not give me an Array of DataFrames, but rather a simple data frame.
The problem with your code is that you trigger one spark job per column and then a big union. In general, it's much faster to try and keep everything within the same one.
In your case, instead of dividing the work, you could explode the dataframe to do everything in one pass like this:
df
.select(array(df.columns.map(c => struct(lit(c) as "name", col(c) as "value") ) : _*) as "a")
.select(explode('a))
.select($"col.name" as "name", $"col.value" as "value")
.groupBy("name")
.pivot("value")
.count()
.show()
This first line is the only one that's a bit tricky. It creates an array of tuples where each column name is mapped to its value. Then we explode it (one line per element of the array) and finally compute a basic pivot.

Spark Dataframe maximum on Several Columns of a Group

How can I get the maximum value for different (string and numerical) types of columns in a DataFrame in Scala using Spark?
Let say that is my data
+----+-----+-------+------+
|name|value1|value2|string|
+----+-----+-------+------+
| A| 7| 9| "a"|
| A| 1| 10| null|
| B| 4| 4| "b"|
| B| 3| 6| null|
+----+-----+-------+------+
and the desired outcome is:
+----+-----+-------+------+
|name|value1|value2|string|
+----+-----+-------+------+
| A| 7| 10| "a"|
| B| 4| 6| "b"|
+----+-----+-------+------+
Is there a function like in pandas with apply(max,axis=0) or do I have to write a UDF?
What I can do is a df.groupBy("name").max("value1") but I canot perform two max in a row neither does a Sequence work in max() function.
Any ideas to solve the problem quickly?
Use this
df.groupBy("name").agg(max("value1"), max("value2"))

Counting both occurrences and cooccurrences in a DF

I would like to compute the mutual information (MI) between two variables x and y that I have in a Spark dataframe which looks like this:
scala> df.show()
+---+---+
| x| y|
+---+---+
| 0| DO|
| 1| FR|
| 0| MK|
| 0| FR|
| 0| RU|
| 0| TN|
| 0| TN|
| 0| KW|
| 1| RU|
| 0| JP|
| 0| US|
| 0| CL|
| 0| ES|
| 0| KR|
| 0| US|
| 0| IT|
| 0| SE|
| 0| MX|
| 0| CN|
| 1| EE|
+---+---+
In my case, x happens to be whether an event is occurring (x = 1) or not (x = 0), and y is a country code, but these variables could represent anything. To compute the MI between x and y I would like to have the above dataframe grouped by x, y pairs with the following three additional columns:
The number of occurrences of x
The number of occurrences of y
The number of occurrences of x, y
In the short example above, it would look like
x, y, count_x, count_y, count_xy
0, FR, 17, 2, 1
1, FR, 3, 2, 1
...
Then I would just have to compute the mutual information term for each x, y pair and sum them.
So far, I have been able to group by x, y pairs and aggregate a count(*) column but I couldn't find an efficient way to add the x and y counts. My current solution is to convert the DF into an array and count the occurrences and cooccurrences manually. It works well when y is a country but it takes forever when the cardinality of y gets big. Any suggestions as to how I could do it in a more Sparkish way?
Thanks in advance!
I would go with RDDs, generate a key for each use case, count by key and join the results. This way I know exactly what are the stages.
rdd.cache() // rdd is your data [x,y]
val xCnt:RDD[Int, Int] = rdd.countByKey
val yCnt:RDD[String, Int] = rdd.countByValue
val xyCnt:RDD[(Int,String), Int] = rdd.map((x, y) => ((x,y), x,y)).countByKey
val tmp = xCnt.cartsian(yCnt).map(((x, xCnt),(y, yCnt)) => ((x,y),xCnt,yCnt))
val miReady = tmp.join(xyCnt).map(((x,y), ((xCnt, yCnt), xyCnt)) => ((x,y), xCnt, yCnt, xyCnt))
another option would be to use map Partition and simply work on iterables and merge the resolutes across partitions.
Also new to Spark but I have an idea what to do. I do not know if this is the perfect solution but I thought sharing this wouldnt harm.
What I would do is probably filter() for the value 1 to create a Dataframe and filter() for the value 0 for a second Dataframe
You would get something like
1st Dataframe
DO 1
DO 1
FR 1
In the next step i would groupBy(y)
So you would get for the 1st Dataframe
DO 1 1
FR 1
As GroupedData https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/GroupedData.html
This also has a count() function which should be counting the rows per group. Unfortunately I do not have the time to try this out by myself right now but I wanted to try and help anyway.
Edit: Please let me know if this helped, otherwise I'll delete the answer so other people still take a look at this!
Recently, I had the same task to compute probabilities and here I would like to share my solution based on Spark's window aggregation functions:
// data is your DataFrame with two columns [x,y]
val cooccurrDF: DataFrame = data
.groupBy(col("x"), col("y"))
.count()
.toDF("x", "y", "count-x-y")
val windowX: WindowSpec = Window.partitionBy("x")
val windowY: WindowSpec = Window.partitionBy("y")
val countsDF: DataFrame = cooccurrDF
.withColumn("count-x", sum("count-x-y") over windowX)
.withColumn("count-y", sum("count-x-y") over windowY)
countsDF.show()
First you groups every possible combination of two columns and use count to get the cooccurrences number. The windowed aggregates windowX and windowY allow summing over aggregated rows, so you will get counts for either column x or y.
+---+---+---------+-------+-------+
| x| y|count-x-y|count-x|count-y|
+---+---+---------+-------+-------+
| 0| MK| 1| 17| 1|
| 0| MX| 1| 17| 1|
| 1| EE| 1| 3| 1|
| 0| CN| 1| 17| 1|
| 1| RU| 1| 3| 2|
| 0| RU| 1| 17| 2|
| 0| CL| 1| 17| 1|
| 0| ES| 1| 17| 1|
| 0| KR| 1| 17| 1|
| 0| US| 2| 17| 2|
| 1| FR| 1| 3| 2|
| 0| FR| 1| 17| 2|
| 0| TN| 2| 17| 2|
| 0| IT| 1| 17| 1|
| 0| SE| 1| 17| 1|
| 0| DO| 1| 17| 1|
| 0| JP| 1| 17| 1|
| 0| KW| 1| 17| 1|
+---+---+---------+-------+-------+