Write scala filter on data frame, a column should have more than two words - scala

val tempDf = Df.filter(Df("column_1")==="200")
now wanted to filter tempDf on basis of one column (column_2) which should have more than 2 words.
val extractedDf = tempDf.filter(*)
How we can write the filter in scala at *.

You can use the size and split function.
val extractedDf = tempDf.filter(size(split($"column_2"," ")) > 2)

Related

How to filter an rdd by data type?

I have an rdd that i am trying to filter for only float type. Do Spark rdds provide any way of doing this?
I have a csv where I need only float values greater than 40 into a new rdd. To achieve this, i am checking if it is an instance of type float and filtering them. When I filter with a !, all the strings are still there in the output and when i dont use !, the output is empty.
val airports1 = airports.filter(line => !line.split(",")(6).isInstanceOf[Float])
val airports2 = airports1.filter(line => line.split(",")(6).toFloat > 40)
At the .toFloat , i run into NumberFormatException which I've tried to handle in a try catch block.
Since you have a plain string and you are trying to get float values from it, you are not actually filtering by type. But, if they can be parsed to float instead.
You can accomplish that using a flatMap together with Option.
import org.apache.spark.sql.SparkSession
import scala.util.Try
val spark = SparkSession.builder.master("local[*]").appName("Float caster").getOrCreate()
val sc = spark.sparkContext
val data = List("x,10", "y,3.3", "z,a")
val rdd = sc.parallelize(data) // rdd: RDD[String]
val filtered = rdd.flatMap(line => Try(line.split(",")(1).toFloat).toOption) // filtered: RDD[Float]
filtered.collect() // res0: Array[Float] = Array(10.0, 3.3)
For the > 40 part you can either, perform another filter after or filter the inner Option.
(Both should perform more or less equals due spark laziness, thus choose the one is more clear for you).
// Option 1 - Another filter.
val filtered2 = filtered.filter(x => x > 40)
// Option 2 - Filter the inner option in one step.
val filtered = rdd.flatMap(line => Try(line.split(",")(1).toFloat).toOption.filter(x => x > 40))
Let me know if you have any question.

How to select several element from an RDD file line using Spark in Scala

I'm new in spark and scala and I would like to select several columns from a dataset.
I transformed my data in RDD a file using :
val dataset = sc.textFile(args(0))
Then I split my line
val resu = dataset.map(line => line.split("\001"))
But I in my dataset I have a lot of features and I just want to keep some of then (colums 2 and 3)
I tried this (which works with Pyspark) but It does'nt work.
val resu = dataset.map(line => line.split("\001")[2,3])
I know this is a newbie question but is there someone who can help me ? thanks.
I just want to keep some of then (colums 2 and 3)
If you want columns 2 and 3 in tuple form you can do
val resu = dataset.map(line => {
val array = line.split("\001")
(array(2), array(3))
})
But if you want column 2 and 3 in array form then you can do
val resu = dataset.map(line => {
val array = line.split("\001")
Array(array(2), array(3))
})
In Scala, in order to access to specific list elements you have to use parentheses.
In your case, you want a sublist, so you can try the slice(i, j) function. It extracts the elements from the index i to the j-1. So in your case, you may use:
val resu = dataset.map(line => line.split("\001").slice(2,4))
Hope it helps.

Counting all the characters in the file using Spark/scala?

How can I calculate all the characters in the file using Spark/Scala? Here is what I am doing in the spark shell :
scala> val logFile = sc.textFile("ClasspathLength.txt")
scala> val counts = logFile.flatMap(line=>line.split("").map(char=>(char,1))).reduceByKey(_ + _)
scala> println(counts.count())
scala> 62
I am getting incorrect count here. Could someone help me fix this?
What you're doing here is:
Counting the number of times each unique character appears in the input:
val counts = logFile.flatMap(line=>line.split("").map(char=>(char,1))).reduceByKey(_ + _)
and then:
Counting the number of records in this result (using counts.count()), which ignores the actual values you calculated in the previous step
If you're interested in displaying the total number characters in the file - there's no need for grouping at all - you can map each line to its length and then use the implicit conversion into DoubleRDDFunctions to call sum():
logFile.map(_.length).sum()
Alternatively you can flatMap into separate record per character and then use count:
logFile.flatMap(_.toList).count
val spark=SparkSession.builder()
.master("local[4]")
.appName("Nos of Word Count")
.getOrCreate()
val sparkConfig=spark.sparkContext
sparkConfig.setLogLevel("ERROR")
val rdd1=sparkConfig.textFile("data/mini.txt")
println(rdd1.count())
val rdd2=rdd1.flatMap(f=>{f.split(" ")})//.map(x=>x.toInt)
println(rdd2.count())
val rdd3=rdd2
.map(w=>(w.count(p=>true))).map(w=>w.toInt)
println(rdd3.sum().round)
All you need here is flatMap + count
logFile.flatMap(identity).count

Spark Dataframe Scala: groupby doesn't work after UnionAll

I used unionAll to combine the source DF (with negative weights) and the target DF (with positive weights) into a node DF. Then I perform groupby to sum all the weights of the same nodes, but i don't know why groupby didn't work for the unioned DF at all. Did anyone face the same problem ?:
val src = file.map(_.split("\t")).map(p => node(p(0), (0-p(2).trim.toInt))).toDF()
val target = file.map(_.split("\t")).map(p => node(p(1), p(2).trim.toInt)).toDF()
val srcfl = src.filter(src("weight") != -1)
val targetfl = target.filter(target("weight") != 1)
val nodes = srcfl.unionAll(targetfl)
nodes.groupBy("name").sum()
nodes.map(x => x.mkString("\t")).saveAsTextFile("hdfs://localhost:8020" + args(1))
You're simply ignoring the result of the groupBy operation: just like all DataFrame transformations, .groupBy(...).sum() doesn't mutate the original DataFrame (nodes), it produces a new one. I suspect that if you actually use the return value from sum() you'll see the result you're looking for:
val result = nodes.groupBy("name").sum()
result.map(x => x.mkString("\t")).saveAsTextFile("hdfs://localhost:8020" + args(1))

Spark ML VectorAssembler() dealing with thousands of columns in dataframe

I was using spark ML pipeline to set up classification models on really wide table. This means that I have to automatically generate all the code that deals with columns instead of literately typing each of them. I am pretty much a beginner on scala and spark. I was stuck at the VectorAssembler() part when I was trying to do something like following:
val featureHeaders = featureHeader.collect.mkString(" ")
//convert the header RDD into a string
val featureArray = featureHeaders.split(",").toArray
val quote = "\""
val featureSIArray = featureArray.map(x => (s"$quote$x$quote"))
//count the element in headers
val featureHeader_cnt = featureHeaders.split(",").toList.length
// Fit on whole dataset to include all labels in index.
import org.apache.spark.ml.feature.StringIndexer
val labelIndexer = new StringIndexer().
setInputCol("target").
setOutputCol("indexedLabel")
val featureAssembler = new VectorAssembler().
setInputCols(featureSIArray).
setOutputCol("features")
val convpipeline = new Pipeline().
setStages(Array(labelIndexer, featureAssembler))
val myFeatureTransfer = convpipeline.fit(df)
Apparently it didn't work. I am not sure what should I do to make the whole thing more automatic or ML pipeline does not take that many columns at this moment(which I doubt)?
I finally figured out one way, which is not very pretty. It is to create vector.dense for the features, and then create data frame out of this.
import org.apache.spark.mllib.regression.LabeledPoint
val myDataRDDLP = inputData.map {line =>
val indexed = line.split('\t').zipWithIndex
val myValues = indexed.filter(x=> {x._2 >1770}).map(x=>x._1).map(_.toDouble)
val mykey = indexed.filter(x=> {x._2 == 3}).map(x=>(x._1.toDouble-1)).mkString.toDouble
LabeledPoint(mykey, Vectors.dense(myValues))
}
val training = sqlContext.createDataFrame(myDataRDDLP).toDF("label", "features")
You shouldn't use quotes (s"$quote$x$quote") unless column names contain quotes. Try
val featureAssembler = new VectorAssembler().
setInputCols(featureArray).
setOutputCol("features")
For pyspark, you can first create a list of the column names:
df_colnames = df.columns
Then you can use that in vectorAssembler:
assemble = VectorAssembler(inputCols = df_colnames, outputCol = 'features')
df_vectorized = assemble.transform(df)