Fitting pipeline and processing the data - scala

I've got a file that contains text. What I want to do is to use a pipeline for tokenising the text, removing the stop-words and producing 2-grams.
What I've done so far:
Step 1: Read the file
val data = sparkSession.read.text("data.txt").toDF("text")
Step 2: Build the pipeline
val pipe1 = new Tokenizer().setInputCol("text").setOutputCol("words")
val pipe2 = new StopWordsRemover().setInputCol("words").setOutputCol("filtered")
val pipe3 = new NGram().setN(2).setInputCol("filtered").setOutputCol("ngrams")
val pipeline = new Pipeline().setStages(Array(pipe1, pipe2, pipe3))
val model = pipeline.fit(data)
I know that pipeline.fit(data) produces a PipelineModel however I don't know how to use a PipelineModel.
Any help would be much appreciated.

When you run the val model = pipeline.fit(data) code, all Estimator stages (ie: Machine Learning tasks like Classifications, Regressions, Clustering, etc) are fit to the data and a Transformer stage is created. You only have Transformer stages, since you're doing Feature creation in this pipeline.
In order to execute your model, now consisting of just Transformer stages, you need to run val results = model.transform(data). This will execute each Transformer stage against your dataframe. Thus at the end of the model.transform(data) process, you will have a dataframe consisting of the original lines, the Tokenizer output, the StopWordsRemover output, and finally the NGram results.
Discovering the top 5 ngrams after the feature creation is completed can be performed through a SparkSQL query. First explode the ngram column, then count groupby ngrams, ordering by the counted column in a descending fashion, and then performing a show(5). Alternatively, you could use a "LIMIT 5 method instead of show(5).
As an aside, you should probably change your Object name to something that isn't a standard class name. Otherwise you're going to get an ambigious scope error.
CODE:
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.Tokenizer
import org.apache.spark.sql.SparkSession._
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.NGram
import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.ml.{Pipeline, PipelineModel}
object NGramPipeline {
def main() {
val sparkSession = SparkSession.builder.appName("NGram Pipeline").getOrCreate()
val sc = sparkSession.sparkContext
val data = sparkSession.read.text("quangle.txt").toDF("text")
val pipe1 = new Tokenizer().setInputCol("text").setOutputCol("words")
val pipe2 = new StopWordsRemover().setInputCol("words").setOutputCol("filtered")
val pipe3 = new NGram().setN(2).setInputCol("filtered").setOutputCol("ngrams")
val pipeline = new Pipeline().setStages(Array(pipe1, pipe2, pipe3))
val model = pipeline.fit(data)
val results = model.transform(data)
val explodedNGrams = results.withColumn("explNGrams", explode($"ngrams"))
explodedNGrams.groupBy("explNGrams").agg(count("*") as "ngramCount").orderBy(desc("ngramCount")).show(10,false)
}
}
NGramPipeline.main()
OUTPUT:
+-----------------+----------+
|explNGrams |ngramCount|
+-----------------+----------+
|quangle wangle |9 |
|wangle quee. |4 |
|'mr. quangle |3 |
|said, -- |2 |
|wangle said |2 |
|crumpetty tree |2 |
|crumpetty tree, |2 |
|quangle wangle, |2 |
|crumpetty tree,--|2 |
|blue babboon, |2 |
+-----------------+----------+
only showing top 10 rows
Notice that there is syntax (commas, dashes, etc) which are causing lines to be duplicated. When performing ngrams, it's often a good idea to filter our the syntax. You can typically do this with a regex.

Related

Operating in parallel on a Spark Dataframe Rows

Environment: Scala,spark,structured streaming,kafka
i have a DF coming from a kafka stream with the following schema
DF:
BATCH ID: 0
+-----------------------+-----+---------+------+
| value|topic|partition|offset|
+-----------------------+-----+---------+------+
|{"big and nested json"}| A | 0| 0|
|{"big and nested json"}| B | 0| 0|
+-----------------------+-----+---------+------+
i want to process each row in parallel by using spark, and i manage to split them to my executors using
DF.repartition(Number).foreach(row=> processRow(row))
i need to extract the value from the value column into its own dataframe to process it.
Im having difficulties working with the Dataframe generic Row object..
is there a way to turn the single row in each executor to its very own Dataframe (using a fixed schema?) and write in a fixed location?
Is there a better approach to solve my problem?
EDIT + Clarification:
The DF im receiving is coming as a Batch using a forEachBatch function of the writeStream functionality that exists since spark2.4
Currently splitting the DF into ROWS makes it that the rows will be split equally into all my executors, i would like to turn a single GenericRow object into a DataFrame so i can process using a function i made
for example i would send the row to the function
processRow(row:row)
take the value and the topic and turn it back into a single-lined DF
+-----------------------+-----+
| value|topic|
+-----------------------+-----+
|{"big and nested json"}| A |
+-----------------------+-----+
for further processing
I guess you are consuming multiple kafka data at a time.
First you need to prepare schema for all kafka topics, here for example I have used two different JSON in value column.
scala> val df = Seq(("""{"name":"Srinivas"}""","A"),("""{"age":20}""","B")).toDF("value","topic")
scala> df.show(false)
+-------------------+-----+
|value |topic|
+-------------------+-----+
|{"name":"Srinivas"}|A |
|{"age":20} |B |
+-------------------+-----+
scala> import org.apache.spark.sql.types._
Schema for topic A
scala> val topicASchema = DataType.fromJson("""{"type":"struct","fields":[{"name":"name","type":"string","nullable":true,"metadata":{}}]}""").asInstanceOf[StructType]
Schema for topic B
scala> val topicBSchema = DataType.fromJson("""{"type":"struct","fields":[{"name":"age","type":"long","nullable":true,"metadata":{}}]}""").asInstanceOf[StructType]
Combining topic & its schema.
scala> val topicSchema = Seq(("A",topicASchema),("B",topicBSchema)) // Adding Topic & Its Schema.
Processing DataFrame
scala> topicSchema
.par
.map(d => df.filter($"topic" === d._1).withColumn("value",from_json($"value",d._2)))
.foreach(_.show(false)) // Using .par & filtering dataframe based on topic & then applying schema to value column.
+----------+-----+
|value |topic|
+----------+-----+
|[Srinivas]|A |
+----------+-----+
+-----+-----+
|value|topic|
+-----+-----+
|[20] |B |
+-----+-----+
Writing to hdfs
scala> topicSchema
.par
.map(d => df.filter($"topic" === d._1).withColumn("value",from_json($"value",d._2)).write.format("json").save(s"/tmp/kafka_data/${d._1}"))
Final Data stored in hdfs
scala> import sys.process._
import sys.process._
scala> "tree /tmp/kafka_data".!
/tmp/kafka_data
├── A
│   ├── part-00000-1e854106-49de-44b3-ab18-6c98a126c8ca-c000.json
│   └── _SUCCESS
└── B
├── part-00000-1bd51ad7-cfb6-4187-a374-4e2d4ce9cc50-c000.json
└── _SUCCESS
2 directories, 4 files
In this case it's better suited to use .map instead of .foreach. The reason is that the map returns a new dataset while foreach just a function and doesn't return anything.
One other thing that can help you is to parse the schema located in JSON.
I had a similar requirement recently.
My JSON object has a "similar" schema for both topic A and B. If that is not the case for you, you might need to create multiple dataframes in the solution below by grouping them by topic.
val sanitiseJson: String => String = value => value
.replace("\\\"", "\"")
.replace("\\\\", "\\")
.replace("\n", "")
.replace("\"{", "{")
.replace("}\"", "}")
val parsed = df.toJSON
.map(sanitiseJson)
This will give you something like:
{
"value": { ... },
"topic": "A"
}
Then you can pass that into a new read function:
var dfWithSchema = spark.read.json(parsed)
At this point you would access the value in the nested JSON:
dfWithSchema.select($"value.propertyInJson")
There are some optimizations you can do when it comes to sanitiseJson if needed.

Filling blank field in a DataFrame with previous field value

I am working with Scala and Spark and I am relatively new to programming in Scala, so maybe my question has a simple solution.
I have one DataFrame that keeps information about the active and deactivate clients in some promotion. That DataFrame shows the Client Id, the action that he/she took (he can activate or deactivate from the promotion at any time) and the Date that he or she took this action. Here is an example of that format:
Example of how the DataFrame works
I want a daily monitoring of the clients that are active and wish to see how this number varies through the days, but I am not able to code anything that works like that.
My idea was to make a crossJoin of two Dataframes; one that has only the Client Ids and another with only the dates, so I would have all the Dates related to all the Client IDs and I only needed to see the Client Status in each of the Dates (if the Client is active or desactive). So after that I made a left join of these new Dataframe with the DataFrame that related the Client ID and the events, but the result is a lot of dates that have a "null" status and I don't know how to fill it with the correct status. Here's the example:
Example of the final DataFrame
I have already tried to use the lag function, but it did not solve my problem. Does anyone have any idea that could help me?
Thank You!
A slightly expensive operation due to Spark SQL having restrictions on correlated sub-queries with <, <= >, >=.
Starting from your second dataframe with NULLs and assuming that large enough system and volume of data manageable:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
// My sample input
val df = Seq(
(1,"2018-03-12", "activate"),
(1,"2018-03-13", null),
(1,"2018-03-14", null),
(1,"2018-03-15", "deactivate"),
(1,"2018-03-16", null),
(1,"2018-03-17", null),
(1,"2018-03-18", "activate"),
(2,"2018-03-13", "activate"),
(2,"2018-03-14", "deactivate"),
(2,"2018-03-15", "activate")
).toDF("ID", "dt", "act")
//df.show(false)
val w = Window.partitionBy("ID").orderBy(col("dt").asc)
val df2 = df.withColumn("rank", dense_rank().over(w)).select("ID", "dt","act", "rank") //.where("rank == 1")
//df2.show(false)
val df3 = df2.filter($"act".isNull)
//df3.show(false)
val df4 = df2.filter(!($"act".isNull)).toDF("ID2", "dt2", "act2", "rank2")
//df4.show(false)
val df5 = df3.join(df4, (df3("ID") === df4("ID2")) && (df4("rank2") < df3("rank")),"inner")
//df5.show(false)
val w2 = Window.partitionBy("ID", "rank").orderBy(col("rank2").desc)
val df6 = df5.withColumn("rank_final", dense_rank().over(w2)).where("rank_final == 1").select("ID", "dt","act2").toDF("ID", "dt", "act")
//df6.show
val df7 = df.filter(!($"act".isNull))
val dfFinal = df6.union(df7)
dfFinal.show(false)
returns:
+---+----------+----------+
|ID |dt |act |
+---+----------+----------+
|1 |2018-03-13|activate |
|1 |2018-03-14|activate |
|1 |2018-03-16|deactivate|
|1 |2018-03-17|deactivate|
|1 |2018-03-12|activate |
|1 |2018-03-15|deactivate|
|1 |2018-03-18|activate |
|2 |2018-03-13|activate |
|2 |2018-03-14|deactivate|
|2 |2018-03-15|activate |
+---+----------+----------+
I solved this step-wise and in a rush, but no so apparent.

Stringbuilder to RDD

I have a string builder(sb) with data as below in Scala IDE
CellId,Date,Time,MeasType,MeasResult
251498240,2016-12-02,20:45:00,RRC.ConnEstabAtt.emergency,0
251498240,2016-12-02,20:45:00,RRC.ConnEstabAtt.highPriorityAccess,0
251498240,2016-12-02,20:45:00,RRC.ConnEstabAtt.mt-Access,4
Now I want to convert this string into RDD by using scala. Please help me.
I am using this code. But no luck. Thanks in advance
val headerFile = sc.parallelize(sb)
headerFile.collect()
StringBuilder is used to build strings from mutable sequence of characters. So what ever you add to the builder would be appended to become as one string.
You would need to separate the strings added to be used as list of strings in sparkcontext
Assuming that the string are added with trailing line feed, you can split the string builder with line feed and use it to be transformed as rdd
val headerFile = sc.parallelize(sb.toString.split("\n"))
headerFile.collect()
To visualize the data, you would have to print them or save them to file
Now if you want to convert to dataframe before saving then you can perform as below
val data = sb.toString.split("\n")
import org.apache.spark.sql.types._
val schema = StructType(data.head.split(",").map(StructField(_, StringType, true)))
val rdd = sc.parallelize(sb.toString.split("\n").tail.map(line => Row.fromSeq(line.split(","))))
spark.createDataFrame(rdd, schema).show(false)
which should give you
+---------+----------+--------+-----------------------------------+----------+
|CellId |Date |Time |MeasType |MeasResult|
+---------+----------+--------+-----------------------------------+----------+
|251498240|2016-12-02|20:45:00|RRC.ConnEstabAtt.emergency |0 |
|251498240|2016-12-02|20:45:00|RRC.ConnEstabAtt.highPriorityAccess|0 |
|251498240|2016-12-02|20:45:00|RRC.ConnEstabAtt.mt-Access |4 |
+---------+----------+--------+-----------------------------------+----------+

Spark 2.0 - How to obtain Cluster ID associated with Cluster Center

I want to know what is the ID associated with the Cluster Centers. model.transform(dataset) will assign a predicted cluster ID to my data points, and model.clusterCenters.foreach(println) will print these cluster centers, but I cannot figure out how to associate the cluster centers with their ID.
import org.apache.spark.ml.clustering.KMeans
// Loads data.
val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
// Trains a k-means model.
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)
val prediction = model.transform(dataset)
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
Ideally, I want an output such as:
|I.D |cluster center |
==========================
|0 |[0.0,...,0.3] |
|2 |[1.0,...,1.3] |
|1 |[2.0,...,1.3] |
|3 |[3.0,...,1.3] |
It does not seem to me that the println order is sorted by ID. I tried converting model.clusterCenters into a DF to transform() on it, but I couldn't figure out how to convert Array[org.apache.spark.ml.linalg.Vector] to org.apache.spark.sql.Dataset[_]
Once you saved the data it will write cluster_id and Cluster_center. You can read the file, can see the desired output
model.save(sc, "/user/hadoop/kmeanModel")
val parq = sqlContext.read.parquet("/user/hadoop/kmeanModel/data/*")
parq.collect.foreach(println)

How to normalize or standardize the data having multiple columns/variables in spark using scala?

I am new to apache spark and scala. I have data set like this which I am taking from csv file and converting it into RDD using scala.
+-----------+-----------+----------+
| recent | Freq | Monitor |
+-----------+-----------+----------+
| 1 | 1234 | 199090|
| 4 | 2553| 198613|
| 6 | 3232 | 199090|
| 1 | 8823 | 498831|
| 7 | 2902 | 890000|
| 8 | 7991 | 081097|
| 9 | 7391 | 432370|
| 12 | 6138 | 864981|
| 7 | 6812 | 749821|
+-----------+-----------+----------+
I want to calculate z-score value or to standardize the data. So I am calculating the z-score for each column and then try to combine them so I get standard scale.
Here is my code for calculating the z-score for first column
val scores1 = sorted.map(_.split(",")(0)).cache
val count = scores.count
val mean = scores.sum / count
val devs = scores.map(score => (score - mean) * (score - mean))
val stddev = Math.sqrt(devs.sum / count)
val zscore = sorted.map(x => math.round((x.toDouble - mean)/stddev))
How do I calculate for each column ? Or is there any other way to normalize or standardize the data ?
My requirement is to assign the rank(or scale).
Thanks
If you want to standardize the columns, you can use the StandardScaler class from Spark MLlib. Data should be in the form of RDD[Vectors[Double], where Vectors are a part of MLlib Linalg package. You can choose to use mean or standard deviation or both to standardize your data.
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vectors
val data = sc.parallelize(Array(
Array(1.0,2.0,3.0),
Array(4.0,5.0,6.0),
Array(7.0,8.0,9.0),
Array(10.0,11.0,12.0)))
// Converting RDD[Array] to RDD[Vectors]
val features = data.map(a => Vectors.dense(a))
// Creating a Scaler model that standardizes with both mean and SD
val scaler = new StandardScaler(withMean = true, withStd = true).fit(features)
// Scale features using the scaler model
val scaledFeatures = scaler.transform(features)
This scaledFeatures RDD contains the Z-score of all columns.
Hope this answer helps. Check the Documentation for more info.
You may want to use below code to perform Standard Scaling on required columns.Vector Assembler is used to select required columns that need to be transformed. StandardScaler constructor also provides you an option to select values of Mean and Standard deviation
Code:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature
import org.apache.spark.ml.feature.StandardScaler
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/user/hadoop/data/your_dataset.csv")
df.show(Int.MaxValue)
val assembler = new VectorAssembler().setInputCols(Array("recent","Freq","Monitor")).setOutputCol("features")
val transformVector = assembler.transform(df)
val scaler = new StandardScaler().setInputCol("features").setOutputCol("scaledFeatures").setWithStd(true).setWithMean(false)
val scalerModel = scaler.fit(transformVector)
val scaledData = scalerModel.transform(transformVector)
scaledData.show() 20, False
scaledData.show(Int.MaxValue)
scaledData.show(20, false)