Spark 2.0 - How to obtain Cluster ID associated with Cluster Center - scala

I want to know what is the ID associated with the Cluster Centers. model.transform(dataset) will assign a predicted cluster ID to my data points, and model.clusterCenters.foreach(println) will print these cluster centers, but I cannot figure out how to associate the cluster centers with their ID.
import org.apache.spark.ml.clustering.KMeans
// Loads data.
val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
// Trains a k-means model.
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)
val prediction = model.transform(dataset)
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
Ideally, I want an output such as:
|I.D |cluster center |
==========================
|0 |[0.0,...,0.3] |
|2 |[1.0,...,1.3] |
|1 |[2.0,...,1.3] |
|3 |[3.0,...,1.3] |
It does not seem to me that the println order is sorted by ID. I tried converting model.clusterCenters into a DF to transform() on it, but I couldn't figure out how to convert Array[org.apache.spark.ml.linalg.Vector] to org.apache.spark.sql.Dataset[_]

Once you saved the data it will write cluster_id and Cluster_center. You can read the file, can see the desired output
model.save(sc, "/user/hadoop/kmeanModel")
val parq = sqlContext.read.parquet("/user/hadoop/kmeanModel/data/*")
parq.collect.foreach(println)

Related

Spark UDF not giving rolling counts properly

I have a Spark UDF to calculate rolling counts of a column, precisely wrt time. If I need to calculate a rolling count for 24 hours, for example for entry with time 2020-10-02 09:04:00, I need to look back until 2020-10-01 09:04:00 (very precise).
The Rolling count UDF works fine and gives correct counts, if I run locally, but when I run on a cluster, its giving incorrect results. Here is the sample input and output
Input
+---------+-----------------------+
|OrderName|Time |
+---------+-----------------------+
|a |2020-07-11 23:58:45.538|
|a |2020-07-12 00:00:07.307|
|a |2020-07-12 00:01:08.817|
|a |2020-07-12 00:02:15.675|
|a |2020-07-12 00:05:48.277|
+---------+-----------------------+
Expected Output
+---------+-----------------------+-----+
|OrderName|Time |Count|
+---------+-----------------------+-----+
|a |2020-07-11 23:58:45.538|1 |
|a |2020-07-12 00:00:07.307|2 |
|a |2020-07-12 00:01:08.817|3 |
|a |2020-07-12 00:02:15.675|1 |
|a |2020-07-12 00:05:48.277|1 |
+---------+-----------------------+-----+
Last two entry values are 4 and 5 locally, but on cluster they are incorrect. My best guess is data is being distributed across executors and udf is also being called in parallel on each executor. As one of the parameter to UDF is column (Partition key - OrderName in this example), how could I control/correct the behavior for cluster if thats the case. So that it calculates proper counts for each partition in a right way. Any suggestion please
As per your comment , you want to count the total no of records of every record for the last 24 hours
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.LongType
//A sample data (Guessing from your question)
val df = Seq(("a","2020-07-10 23:58:45.438","1"),("a","2020-07-11 23:58:45.538","1"),("a","2020-07-11 23:58:45.638","1")).toDF("OrderName","Time","Count")
// Extract the UNIX TIMESTAMP for your time column
val df2 = df.withColumn("unix_time",concat(unix_timestamp($"Time"),split($"Time","\\.")(1)).cast(LongType))
val noOfMilisecondsDay : Long = 24*60*60*1000
//Create a window per `OrderName` and select rows from `current time - 24 hours` to `current time`
val winSpec = Window.partitionBy("OrderName").orderBy("unix_time").rangeBetween(Window.currentRow - noOfMilisecondsDay, Window.currentRow)
// Final you perform your COUNT or SUM(COUNT) as per your need
val finalDf = df2.withColumn("tot_count", count("OrderName").over(winSpec))
//or val finalDf = df2.withColumn("tot_count", sum("Count").over(winSpec))

How to find membership of vertices using Graphframes or igraph or networx in pyspark

my input dataframe is df
valx valy
1: 600060 09283744
2: 600131 96733110
3: 600194 01700001
and I want to create the graph treating above two columns are edgelist and then my output should have list of all vertices of graph with its membership .
I have tried Graphframes in pyspark and networx library too, but not getting desired results
My output should look like below (its basically all valx and valy under V1 (as vertices) and their membership info under V2)
V1 V2
600060 1
96733110 1
01700001 3
I tried below
import networkx as nx
import pandas as pd
filelocation = r'Pathtodataframe df csv'
Panda_edgelist = pd.read_csv(filelocation)
g = nx.from_pandas_edgelist(Panda_edgelist,'valx','valy')
g2 = g.to_undirected(g)
list(g.nodes)
``
I'm not sure if you are violating any rules here by asking the same question two times.
To detect communities with graphframes, at first you have to create graphframes object. Give your example dataframe the following code snippet shows you the necessary transformations:
from graphframes import *
sc.setCheckpointDir("/tmp/connectedComponents")
l = [
( '600060' , '09283744'),
( '600131' , '96733110'),
( '600194' , '01700001')
]
columns = ['valx', 'valy']
#this is your input dataframe
edges = spark.createDataFrame(l, columns)
#graphframes requires two dataframes: an edge and a vertice dataframe.
#the edge dataframe has to have at least two columns labeled with src and dst.
edges = edges.withColumnRenamed('valx', 'src').withColumnRenamed('valy', 'dst')
edges.show()
#the vertice dataframe requires at least one column labeled with id
vertices = edges.select('src').union(edges.select('dst')).withColumnRenamed('src', 'id')
vertices.show()
g = GraphFrame(vertices, edges)
Output:
+------+--------+
| src| dst|
+------+--------+
|600060|09283744|
|600131|96733110|
|600194|01700001|
+------+--------+
+--------+
| id|
+--------+
| 600060|
| 600131|
| 600194|
|09283744|
|96733110|
|01700001|
+--------+
You wrote in the comments of your other question that the community detection algorithmus doesn't matter for you currently. Therefore I will pick the connected components:
result = g.connectedComponents()
result.show()
Output:
+--------+------------+
| id| component|
+--------+------------+
| 600060|163208757248|
| 600131| 34359738368|
| 600194|884763262976|
|09283744|163208757248|
|96733110| 34359738368|
|01700001|884763262976|
+--------+------------+
Other community detection algorithms (like LPA) can be found in the user guide.

Filling blank field in a DataFrame with previous field value

I am working with Scala and Spark and I am relatively new to programming in Scala, so maybe my question has a simple solution.
I have one DataFrame that keeps information about the active and deactivate clients in some promotion. That DataFrame shows the Client Id, the action that he/she took (he can activate or deactivate from the promotion at any time) and the Date that he or she took this action. Here is an example of that format:
Example of how the DataFrame works
I want a daily monitoring of the clients that are active and wish to see how this number varies through the days, but I am not able to code anything that works like that.
My idea was to make a crossJoin of two Dataframes; one that has only the Client Ids and another with only the dates, so I would have all the Dates related to all the Client IDs and I only needed to see the Client Status in each of the Dates (if the Client is active or desactive). So after that I made a left join of these new Dataframe with the DataFrame that related the Client ID and the events, but the result is a lot of dates that have a "null" status and I don't know how to fill it with the correct status. Here's the example:
Example of the final DataFrame
I have already tried to use the lag function, but it did not solve my problem. Does anyone have any idea that could help me?
Thank You!
A slightly expensive operation due to Spark SQL having restrictions on correlated sub-queries with <, <= >, >=.
Starting from your second dataframe with NULLs and assuming that large enough system and volume of data manageable:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
// My sample input
val df = Seq(
(1,"2018-03-12", "activate"),
(1,"2018-03-13", null),
(1,"2018-03-14", null),
(1,"2018-03-15", "deactivate"),
(1,"2018-03-16", null),
(1,"2018-03-17", null),
(1,"2018-03-18", "activate"),
(2,"2018-03-13", "activate"),
(2,"2018-03-14", "deactivate"),
(2,"2018-03-15", "activate")
).toDF("ID", "dt", "act")
//df.show(false)
val w = Window.partitionBy("ID").orderBy(col("dt").asc)
val df2 = df.withColumn("rank", dense_rank().over(w)).select("ID", "dt","act", "rank") //.where("rank == 1")
//df2.show(false)
val df3 = df2.filter($"act".isNull)
//df3.show(false)
val df4 = df2.filter(!($"act".isNull)).toDF("ID2", "dt2", "act2", "rank2")
//df4.show(false)
val df5 = df3.join(df4, (df3("ID") === df4("ID2")) && (df4("rank2") < df3("rank")),"inner")
//df5.show(false)
val w2 = Window.partitionBy("ID", "rank").orderBy(col("rank2").desc)
val df6 = df5.withColumn("rank_final", dense_rank().over(w2)).where("rank_final == 1").select("ID", "dt","act2").toDF("ID", "dt", "act")
//df6.show
val df7 = df.filter(!($"act".isNull))
val dfFinal = df6.union(df7)
dfFinal.show(false)
returns:
+---+----------+----------+
|ID |dt |act |
+---+----------+----------+
|1 |2018-03-13|activate |
|1 |2018-03-14|activate |
|1 |2018-03-16|deactivate|
|1 |2018-03-17|deactivate|
|1 |2018-03-12|activate |
|1 |2018-03-15|deactivate|
|1 |2018-03-18|activate |
|2 |2018-03-13|activate |
|2 |2018-03-14|deactivate|
|2 |2018-03-15|activate |
+---+----------+----------+
I solved this step-wise and in a rush, but no so apparent.

Flip each bit in Spark dataframe calling a custom function

I have a spark Dataframe that looks like
ID |col1|col2|col3|col4.....
A |0 |1 |0 |0....
C |1 |0 |0 |0.....
E |1 |0 |1 |1......
ID is a unique key and other columns have binary values 0/1
now,I want to iterate over each row and if the column value is 0 i want to apply some function passing this single row as a data frame to that function
like col1 ==0 in above data frame for ID A
now the DF of line should look like
newDF.show()
ID |col1|col2|col3|col4.....
A |1 |1 |0 |0....
myfunc(newDF)
next 0 is encountered at col3 for ID A so new DF look like
newDF.show()
ID |col1|col2|col3|col4.....
A |0 |1 |1 |0....
val max=myfunc(newDF) //function returns a double.
so on...
Note:- Each 0 bit is flipped once at row level for function
calling resetting last flipped bit effect
P.S:- I tried using withcolumn calling a UDF but serialisation issues of Df inside DF
actually the myfunc i'm calling is send for scoring for ML model that i have that returns probability for that user if a particular bit is flipped .So i have to iterate through each 0 set column ad set it 1 for that particular instance .
I'm not sure you need anything particularly complex for this. Given that you have imported the SQL functions and the session implicits
val spark: SparkSession = ??? // your session
import spark.implicits._
import org.apache.spark.sql.functions._
you should be able to "flip the bits" (although I'm assuming those are actually encoded as numbers) by applying the following function
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
as in this example
df.select($"ID", flip($"col1") as "col1", flip($"col2") as "col2")
You can easily rewrite the flip function to deal with edge cases or use different type (if, for example, the "bits" are encoded with booleans or strings).

Fitting pipeline and processing the data

I've got a file that contains text. What I want to do is to use a pipeline for tokenising the text, removing the stop-words and producing 2-grams.
What I've done so far:
Step 1: Read the file
val data = sparkSession.read.text("data.txt").toDF("text")
Step 2: Build the pipeline
val pipe1 = new Tokenizer().setInputCol("text").setOutputCol("words")
val pipe2 = new StopWordsRemover().setInputCol("words").setOutputCol("filtered")
val pipe3 = new NGram().setN(2).setInputCol("filtered").setOutputCol("ngrams")
val pipeline = new Pipeline().setStages(Array(pipe1, pipe2, pipe3))
val model = pipeline.fit(data)
I know that pipeline.fit(data) produces a PipelineModel however I don't know how to use a PipelineModel.
Any help would be much appreciated.
When you run the val model = pipeline.fit(data) code, all Estimator stages (ie: Machine Learning tasks like Classifications, Regressions, Clustering, etc) are fit to the data and a Transformer stage is created. You only have Transformer stages, since you're doing Feature creation in this pipeline.
In order to execute your model, now consisting of just Transformer stages, you need to run val results = model.transform(data). This will execute each Transformer stage against your dataframe. Thus at the end of the model.transform(data) process, you will have a dataframe consisting of the original lines, the Tokenizer output, the StopWordsRemover output, and finally the NGram results.
Discovering the top 5 ngrams after the feature creation is completed can be performed through a SparkSQL query. First explode the ngram column, then count groupby ngrams, ordering by the counted column in a descending fashion, and then performing a show(5). Alternatively, you could use a "LIMIT 5 method instead of show(5).
As an aside, you should probably change your Object name to something that isn't a standard class name. Otherwise you're going to get an ambigious scope error.
CODE:
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.Tokenizer
import org.apache.spark.sql.SparkSession._
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.NGram
import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.ml.{Pipeline, PipelineModel}
object NGramPipeline {
def main() {
val sparkSession = SparkSession.builder.appName("NGram Pipeline").getOrCreate()
val sc = sparkSession.sparkContext
val data = sparkSession.read.text("quangle.txt").toDF("text")
val pipe1 = new Tokenizer().setInputCol("text").setOutputCol("words")
val pipe2 = new StopWordsRemover().setInputCol("words").setOutputCol("filtered")
val pipe3 = new NGram().setN(2).setInputCol("filtered").setOutputCol("ngrams")
val pipeline = new Pipeline().setStages(Array(pipe1, pipe2, pipe3))
val model = pipeline.fit(data)
val results = model.transform(data)
val explodedNGrams = results.withColumn("explNGrams", explode($"ngrams"))
explodedNGrams.groupBy("explNGrams").agg(count("*") as "ngramCount").orderBy(desc("ngramCount")).show(10,false)
}
}
NGramPipeline.main()
OUTPUT:
+-----------------+----------+
|explNGrams |ngramCount|
+-----------------+----------+
|quangle wangle |9 |
|wangle quee. |4 |
|'mr. quangle |3 |
|said, -- |2 |
|wangle said |2 |
|crumpetty tree |2 |
|crumpetty tree, |2 |
|quangle wangle, |2 |
|crumpetty tree,--|2 |
|blue babboon, |2 |
+-----------------+----------+
only showing top 10 rows
Notice that there is syntax (commas, dashes, etc) which are causing lines to be duplicated. When performing ngrams, it's often a good idea to filter our the syntax. You can typically do this with a regex.