Pyspark equivalent of Scala Spark - scala

I have the following code in Scala:
val checkedValues = inputDf.rdd.map(row => {
val size = row.length
val items = for (i <- 0 until size) yield {
val fieldName = row.schema.fieldNames(i)
val sourceField = sourceFields(fieldName) // sourceField is a map which returns another object
val value = Option(row.get(i))
sourceField.checkType(value)
}
items
})
Basically, the above snippet takes a Spark DataFrame, converts into an rdd and applies the map function to return an rdd which is just an collection of object with datatype and other information for each of the values in the DataFrame.
How would I go about writing something equivalent in Pyspark because schema is not an attribute of Row in Pyspark among other things?

Related

Recursively calculate columns and add to Spark Dataframe in Scala

I am new to Scala and Apache Spark. I am trying to calculate mean and standard deviation for a few columns in a Spark dataframe and append the result to the source dataframe. I am trying to do this recursively. Following is my function.
def get_meanstd_data(mergedDF: DataFrame, grpByList: Seq[String]): DataFrame = {
val normFactors = Iterator("factor_1", "factor_2", "factor_3", "factor_4")
def meanStdCalc(df: DataFrame, column: String): DataFrame = {
val meanDF = df.select("column_1", column).groupBy(grpByList.head, grpByList.tail: _*).
agg(mean(column).as("mean_" + column))
val stdDF = df.select("column_1", column).groupBy(grpByList.head, grpByList.tail: _*).
agg(stddev_pop(column).as("stddev_" + column))
val finalDF = meanDF.join(stdDF, usingColumns = grpByList, joinType = "left")
finalDF
}
def recursorFunc(df: DataFrame): DataFrame = {
#tailrec
def recursorHelper(acc: DataFrame): DataFrame = {
if (!normFactors.hasNext) acc
else recursorHelper(meanStdCalc(acc, normFactors.next()))
}
recursorHelper(df)
}
val finalDF = recursorFunc(mergedDF)
finalDF
}
But when I call the function, the resulting dataframe only contains mean and standard deviation of "factor_4". How do I get a dataframe with the mean and standard deviation of all factors appended to the original dataframe?
Any help is much appreciated.
Probably you don't need to use a custom recursive method and you could use fold.
Something like creating normFactors as List and using foldLeft:
val normFactors = Iterator("factor_1", "factor_2", "factor_3", "factor_4")
normFactors.foldLeft(mergedDF)((df, column) => meanStdCalc(df, column))
foldLeft allows you to use the DataFrame as an accumulator

How to filter RDD with sequence list in scala

Hi i am trying to filter rdd with dynamic in sequence in single statement so that filter can be applied
i have the rdd and list of restriction as below
val contain_string = ("keymustexist1,alsokeyexmple2").split(",");
var rdd2 = contain_string.map(each_value=>
rdd.filter(l=>l.rdd.contains(each_value))
)
You should do filter and in filter and do contains,rather than doing map and then filter.
Here is the code.
val contain_string = ("keymustexist1,alsokeyexmple2").split(",");
val lst = List("alsokeyexmple2","Hey","Hi")
val inputrdd = sc.parallelize(lst)
val op = inputrdd.filter(x => contain_string.contains(x)).collect

How I can deal with Tuple in Spark Streaming?

I have a problem with Spark Scala which I want to multiply Tuple elements in Spark streaming,I get data from kafka to dstream ,my RDD data is like this,
(2,[2,3,4,6,5])
(4,[2,3,4,6,5])
(7,[2,3,4,6,5])
(9,[2,3,4,6,5])
I want to do this operate using multiplication like this,
(2,[2*2,3*2,4*2,6*2,5*2])
(4,[2*4,3*4,4*4,6*4,5*4])
(7,[2*7,3*7,4*7,6*7,5*7])
(9,[2*9,3*9,4*9,6*9,5*9])
Then,I get the rdd like this,
(2,[4,6,8,12,10])
(4,[8,12,16,24,20])
(7,[14,21,28,42,35])
(9,[18,27,36,54,45])
Finally,I get Tuple the second element by smallest like this,
(2,4)
(4,8)
(7,14)
(9,18)
How can I do this with scala from dstream? I use spark version 1.6
Give you a demo with scala
// val conf = new SparkConf().setAppName("ttt").setMaster("local")
//val sc = new SparkContext(conf)
// val data =Array("2,2,3,4,6,5","4,2,3,4,6,5","7,2,3,4,6,5","9,2,3,4,6,5")
//val lines = sc.parallelize(data)
//change to your data (each RDD in streaming)
lines.map(x => (x.split(",")(0).toInt,List(x.split(",")(1).toInt,x.split(",")(2).toInt,x.split(",")(3).toInt,x.split(",")(4).toInt,x.split(",")(5).toInt) ))
.map(x =>(x._1 ,x._2.min)).map(x => (x._1,x._2* x._1)).foreach(x => println(x))
here is the result
(2,4)
(4,8)
(7,14)
(9,18)
Each RDD in DStream contains data at a specific time interval, and you can manipulate each RDD as you want
Let's say, you are getting tuple rdd in variable input:
import scala.collection.mutable.ListBuffer
val result = input
.map(x => { // for each element
var l = new ListBuffer[Int]() // create a new list for storing the multiplication result
for(i <- x._1){ // for each element in the array
l += x._0 * i // append the multiplied result to the new list
}
(x._0, l.toList) // return the new tuple
})
.map(x => {
(x._0, x._1.min) // return the new tuple with the minimum element in it from the list
})
result.foreach(println) should result in:
(2,4)
(4,8)
(7,14)
(9,18)

Get Json Key value from List[Row] with Scala

Let's say that I have a List[Row] such as {"name":"abc,"salary","somenumber","id":"1"},{"name":"xyz","salary":"some_number_2","id":"2"}
How do I get the JSON key value pair with scala. Let's assume that I want to get the value of the key "salary". IS the below one right ?
val rows = List[Row] //Assuming that rows has the list of rows
for(row <- rows){
row.get(0).+("salary")
}
If you have a List[Row] I assume that you've had a DataFrame and you did collectAsList. If you collect/collectAsList that means that you
Can no longer use that Spark SQL operations
Can not run your calculations in parallel on the nodes in your cluster. At this point everything is executed in your driver.
I would recommend keeping it as a DataFrame and then doing:
val salaries = df.select("salary")
Then you can do further calculations on the salaries, show them or collect or persist them somewhere.
If you choose to use DataSet (which is like a typed DataFrame) then you could do
val salaries = dataSet.map(_.salary)
Using Spray Json:
import spray.json._
import DefaultJsonProtocol._
object sprayApp extends App {
val list = List("""{"name":"abc","salary":"somenumber","id":"1"}""", """{"name":"xyz","salary":"some_number_2","id":"2"}""")
val jsonAst = list.map(_.parseJson)
for(l <- jsonAst) {
println(l.asJsObject.getFields("salary")(0))
}
}

Spark Scala: Generating list of DataFrame based on values in RDD

I have a rdd containing values, each of those values will be passed to a function generate_df(num:Int) to create a dataframe. So essentially in the end we will have a list of dataframes stored in a list buffer like this var df_list_example = new ListBuffer[org.apache.spark.sql.DataFrame]().
First I will show the code and result of doing it using a list instead of RDD:
var df_list = new ListBuffer[org.apache.spark.sql.DataFrame]()
for (i <- list_values) //list_values contains values
{
df_list += generate_df(i)
}
Result:
df_list:
scala.collection.mutable.ListBuffer[org.apache.spark.sql.DataFrame] =
ListBuffer([value: int], [value: int], [value: int])
However, when I am using RDD which is very essential for my use case I am having issue:
var df_rdd_list = new ListBuffer[org.apache.spark.sql.DataFrame]()
//rdd_values contains values
rdd_values.map( i => df_rdd_list += generate_df(i) )
Result:
df_rdd_list:
scala.collection.mutable.ListBuffer[org.apache.spark.sql.DataFrame] =
ListBuffer()
Basically the list buffer remains empty and cannot store dataframes unlike when I am using list of values instead of rdd of values. Mapping using rdd is essential for my original use case.