Overlap in the data - Spark Scala - scala

I want to find the overlap between each brand like when I compare HORLICKS VS BOOST, I will have to find whats HORLICKS only percentage , BOOST ONLY and their intersection.Its basically a venn diagram problem.I have computed for one combination. But I want to compute for all combination , like HORLICKS vs BOOST,HORLICKS VS nESTLE , HORLICKS vs BOURNVITA etc.
Can somebody help me ? I am new to spark
Below is my code:
val A_storeList = sourceDf.where(col("CATEGORY").equalTo(item1(0)) and col("SUBCATEGORY").equalTo(item1(1)) and col("product_char_name").equalTo(item1(2)) and col("product_charval_dsc").equalTo(item1(3))).select("store_id").collect().map(_(0)).distinct
val B_storeList = sourceDf.where(col("CATEGORY").equalTo(item2(0)) and col("SUBCATEGORY").equalTo(item2(1)) and col("product_char_name").equalTo(item2(2)) and col("product_charval_dsc").equalTo(item2(3))).select("store_id").collect().map(_(0)).distinct
val aAndBstoreList = A_storeList.intersect(B_storeList)
val AunionB_storeList = A_storeList.union(B_storeList).distinct
val AOnly_storeList = A_storeList.diff(B_storeList)
val Bonly_storeList = B_storeList.diff(A_storeList)
val subSetOfSourceDf = sourceDf.withColumn("Versus",lit(item1(3)+"Vs"+item2(3)))
val A = subSetOfSourceDf.where(col("store_id").isin(A_storeList:_*)).withColumn("Venn",lit("A"))
val B = subSetOfSourceDf.where(col("store_id").isin(B_storeList:_*)).withColumn("Venn",lit("B"))
val AandB = subSetOfSourceDf.where(col("store_id").isin(aAndBstoreList:_*)).withColumn("product_charval_dsc",when(col("product_charval_dsc").equalTo(item1(3)),item1(3)+" and "+item2(3)).when(col("product_charval_dsc").equalTo(item2(3)),item1(3)+" and "+item2(3)).otherwise(col("product_charval_dsc"))).withColumn("Venn",lit("AintersectB"))
val AunionB = subSetOfSourceDf.where(col("store_id").isin(AunionB_storeList:_*)).withColumn("product_charval_dsc",when(col("product_charval_dsc").equalTo(item2(3)),item1(3)+" and "+item2(3)).when(col("product_charval_dsc").equalTo(item1(3)),item1(3)+" and "+item2(3)).otherwise(col("product_charval_dsc"))).withColumn("Venn",lit("AunionB"))
val AOnly = subSetOfSourceDf.where(col("store_id").isin(AOnly_storeList:_*)).withColumn("Venn",lit("AOnly"))
val BOnly = subSetOfSourceDf.where(col("store_id").isin(Bonly_storeList:_*)).withColumn("Venn",lit("BOnly"))
val allInOne = A.union(B.union(AandB.union(AunionB).union(AOnly.union(BOnly))))
val divisor = allInOne.where((col("Venn").equalTo("A").and(col("product_charval_dsc").equalTo(item1(3)))) or (col("Venn").equalTo("B").and(col("product_charval_dsc").equalTo(item2(3)))) )
.groupBy("CATEGORY","SUBCATEGORY","product_char_name").agg(sum("SALVAL") as "TOTAL")
val finalDf1 = allInOne.groupBy("CATEGORY","SUBCATEGORY","product_char_name","product_charval_dsc","Venn")
.agg(sum("SALVAL") as "SALVAL")
.where(col("product_charval_dsc").equalTo(item1(3)) or col("product_charval_dsc").equalTo(item2(3)) or col("product_charval_dsc").equalTo(item1(3)+" and "+item2(3)))
val outputDf =finalDf1.join(divisor,Seq("CATEGORY","SUBCATEGORY","product_char_name"))
.withColumn("SALE_PERCENT",col("SALVAL")/col("TOTAL") multiply(100)).withColumn("Versus",lit(item1(3)+" Vs "+item2(3)))
With this Code I have generated output. But I want to how to do this for all combination.
Generated Result:

Related

Load a csv file into a Breeze DenseMatrix[Double]

I have a csv file and I want to load into a Breeze DenseMatrix[Double]
This code eventually will work but I think it's not the scala way of doing things:
val resource = Source.fromResource("data/houses.txt")
val lines: Iterator[String] = resource.getLines
val tmp = lines.toArray
val numRows: Int = tmp.size
val numCols: Int = tmp(0).split(",").size
val m = DenseMatrix.zeros[Double](numRows, numCols)
//Now do some for loops and fill the matrix
Is there a more elegant and functional way of doing this?
val resource = Source.fromResource("data/houses.txt")
val lines: Iterator[String] = resource.getLines
val tmp = lines.map(l => l.split(",").map(str => str.toDouble)).toList
val m = DenseMatrix(tmp:_*)
much better

How to iterate over files and perform action on them - Scala Spark

I am reading 1000 of .eml files (message/email files) one by one from a directory and parsing them and extracting values from them using javax.mail api's and in end storing them into a Dataframe. Sample code below:
var x = Seq[DataFrame]()
val emlFiles = getListOfFiles("tmp/sample")
val fileCount = emlFiles.length
val fs = FileSystem.get(sc.hadoopConfiguration)
for (i <- 0 until fileCount){
var emlData = spark.emptyDataFrame
val f = new File(emlFiles(i))
val fileName = f.getName()
val path = Paths.get(emlFiles(i))
val session = Session.getInstance(new Properties())
val messageIn = new FileInputStream(path.toFile())
val mimeJournal = new MimeMessage(session, messageIn)
// Extracting Metadata
val Receivers = mimeJournal.getHeader("From")(0)
val Senders = mimeJournal.getHeader("To")(0)
val Date = mimeJournal.getHeader("Date")(0)
val Subject = mimeJournal.getHeader("Subject")(0)
val Size = mimeJournal.getSize
emlData =Seq((fileName,Receivers,Senders,Date,Subject,Size)).toDF("fileName","Receivers","Senders","Date","Subject","Size")
x = emlData +: x
}
Problem is that I am using a for loop to do the same and its taking a lot of time. Is there a way to break the for loop and read the files?

SparkML - Creating a df(feature, feature_importance) of a RandomForestRegressionModel

I am training a Random Forest model in the following way:
//Indexer
val stringIndexers = categoricalColumns.map { colName =>
new StringIndexer()
.setInputCol(colName)
.setOutputCol(colName + "Idx")
.setHandleInvalid("keep")
.fit(training)
}
//HotEncoder
val encoders = featuresEnconding.map { colName =>
new OneHotEncoderEstimator()
.setInputCols(Array(colName + "Idx"))
.setOutputCols(Array(colName + "Enc"))
.setHandleInvalid("keep")
}
//Adding features into a feature vector column
val assembler = new VectorAssembler()
.setInputCols(featureColumns)
.setOutputCol("features")
val rf = new RandomForestRegressor()
.setLabelCol("label")
.setFeaturesCol("features")
val stepsRF = stringIndexers ++ encoders ++ Array(assembler, rf)
val pipelineRF = new Pipeline()
.setStages(stepsRF)
val paramGridRF = new ParamGridBuilder()
.addGrid(rf.maxBins, Array(800))
.addGrid(rf.featureSubsetStrategy, Array("all"))
.addGrid(rf.minInfoGain, Array(0.05))
.addGrid(rf.minInstancesPerNode, Array(1))
.addGrid(rf.maxDepth, Array(28,29,30))
.addGrid(rf.numTrees, Array(20))
.build()
//Defining the evaluator
val evaluatorRF = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
//Using cross validation to train the model
//Start with TrainSplit -Cross Validations taking so long so far
val cvRF = new CrossValidator()
.setEstimator(pipelineRF)
.setEvaluator(evaluatorRF)
.setEstimatorParamMaps(paramGridRF)
.setNumFolds(10)
.setParallelism(3)
//Fitting the model with our training dataset
val cvRFModel = cvRF.fit(training)
What I would like now is to get the importance of each of the features in the model after the training.
I am able to get the importance of each feature as an Array[Double] doing like this:
val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]
val size = bestModel.stages.size-1
val ftrImp = bestModel.stages(size).asInstanceOf[RandomForestRegressionModel].featureImportances.toArray
But I only get the importance of each feature and a numerical index, but I don't know what is the feature name inside my model which correspond to each importance value.
I also would like to mention that since I am using a hotencoder, the final amount of feature is much larger than the original featureColumns array.
How can I extract the features names used during the training of my model?
I found this possible solution:
import org.apache.spark.ml.attribute._
val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]
val lstModel = bestModel.stages.last.asInstanceOf[RandomForestRegressionModel]
val schema = predictions.schema
val featureAttrs = AttributeGroup.fromStructField(schema(lstModel.getFeaturesCol)).attributes.get
val mfeatures = featureAttrs.map(_.name.get)
val mdf = sc.parallelize(mfeatures zip ftrImp).toDF("featureName","Importance")
.orderBy(desc("Importance"))
display(mdf)

Spark Task not serializable (Array[Vector])

I am new to Spark, and I'm studying the "Advanced Analytics with Spark" book. The code is from the examples in the book. When I try to run the following code, I get Spark Task not serializable exception.
val kMeansModel = pipelineModel.stages.last.asInstanceOf[KMeansModel]
val centroids: Array[Vector] = kMeansModel.clusterCenters
val clustered = pipelineModel.transform(data)
val threshold = clustered.
select("cluster", "scaledFeatureVector").as[(Int, Vector)].
map { case (cluster, vec) => Vectors.sqdist(centroids(cluster), vec) }.
orderBy($"value".desc).take(100).last
Also, this is how I build the model:
def oneHotPipeline(inputCol: String): (Pipeline, String) = {
val indexer = new StringIndexer()
.setInputCol(inputCol)
.setOutputCol(inputCol + "_indexed")
val encoder = new OneHotEncoder()
.setInputCol(inputCol + "_indexed")
.setOutputCol(inputCol + "_vec")
val pipeline = new Pipeline()
.setStages(Array(indexer, encoder))
(pipeline, inputCol + "_vec")
}
val k = 180
val (protoTypeEncoder, protoTypeVecCol) = oneHotPipeline("protocol_type")
val (serviceEncoder, serviceVecCol) = oneHotPipeline("service")
val (flagEncoder, flagVecCol) = oneHotPipeline("flag")
// Original columns, without label / string columns, but with new vector encoded cols
val assembleCols = Set(data.columns: _*) --
Seq("label", "protocol_type", "service", "flag") ++
Seq(protoTypeVecCol, serviceVecCol, flagVecCol)
val assembler = new VectorAssembler().
setInputCols(assembleCols.toArray).
setOutputCol("featureVector")
val scaler = new StandardScaler()
.setInputCol("featureVector")
.setOutputCol("scaledFeatureVector")
.setWithStd(true)
.setWithMean(false)
val kmeans = new KMeans().
setSeed(Random.nextLong()).
setK(k).
setPredictionCol("cluster").
setFeaturesCol("scaledFeatureVector").
setMaxIter(40).
setTol(1.0e-5)
val pipeline = new Pipeline().setStages(
Array(protoTypeEncoder, serviceEncoder, flagEncoder, assembler, scaler, kmeans))
val pipelineModel = pipeline.fit(data)
I am assuming the problem is with the line Vectors.sqdist(centroids(cluster), vec). For some reason, I cannot use centroids in my Spark calculations. I have done some Googling, and I know this error happens when "I initialize a variable on the master, but then try to use it on the workers", which in my case is centroids. However, I do not know how to address this problem.
In case you got interested here is the entire code for this tutorial in the book. and here is the link to the dataset that the tutorial uses.

Convert RDF4J stream filter (lambda?) from Java to Scala

A follow-up to Are typed literals "tricky" in RDF4J?
I have some triples abut the weight of dump trucks, using literal objects with different data types. I'm only interested in the integer values, so I want to filter based on the data type. Jeen Broekstra sent a Java solution about a week ago, and I'm having trouble converting it into Scala, my team's preferred language.
This is what I have so far. Eclipse is complaining
not found: value l
val rdf4jServer = "http://host.domain:7200"
val repositoryID = "trucks"
val MyRepo = new HTTPRepository(rdf4jServer, repositoryID)
MyRepo.initialize()
var con = MyRepo.getConnection()
val f = MyRepo.getValueFactory()
val DumpTruck = f.createIRI("http://example.com/dumpTruck")
val Weight = f.createIRI("http://example.com/weight")
val m = QueryResults.asModel(con.getStatements(DumpTruck, Weight, null))
val intValuesStream = Models.objectLiterals(m).stream()
# OK up to here
# errors start below
val intValuesFiltered =
intValuesStream.filter(l -> l.getDatatype().equals(XMLSchema.INTEGER))
val intValues = intValuesFiltered.collect(Collectors.toList())
Replace the -> with =>:
val intValuesFiltered = intValuesStream.filter(l => l.getDatatype().equals(XMLSchema.INTEGER))