Spark Scala: Generating list of DataFrame based on values in RDD - scala

I have a rdd containing values, each of those values will be passed to a function generate_df(num:Int) to create a dataframe. So essentially in the end we will have a list of dataframes stored in a list buffer like this var df_list_example = new ListBuffer[org.apache.spark.sql.DataFrame]().
First I will show the code and result of doing it using a list instead of RDD:
var df_list = new ListBuffer[org.apache.spark.sql.DataFrame]()
for (i <- list_values) //list_values contains values
{
df_list += generate_df(i)
}
Result:
df_list:
scala.collection.mutable.ListBuffer[org.apache.spark.sql.DataFrame] =
ListBuffer([value: int], [value: int], [value: int])
However, when I am using RDD which is very essential for my use case I am having issue:
var df_rdd_list = new ListBuffer[org.apache.spark.sql.DataFrame]()
//rdd_values contains values
rdd_values.map( i => df_rdd_list += generate_df(i) )
Result:
df_rdd_list:
scala.collection.mutable.ListBuffer[org.apache.spark.sql.DataFrame] =
ListBuffer()
Basically the list buffer remains empty and cannot store dataframes unlike when I am using list of values instead of rdd of values. Mapping using rdd is essential for my original use case.

Related

Pyspark equivalent of Scala Spark

I have the following code in Scala:
val checkedValues = inputDf.rdd.map(row => {
val size = row.length
val items = for (i <- 0 until size) yield {
val fieldName = row.schema.fieldNames(i)
val sourceField = sourceFields(fieldName) // sourceField is a map which returns another object
val value = Option(row.get(i))
sourceField.checkType(value)
}
items
})
Basically, the above snippet takes a Spark DataFrame, converts into an rdd and applies the map function to return an rdd which is just an collection of object with datatype and other information for each of the values in the DataFrame.
How would I go about writing something equivalent in Pyspark because schema is not an attribute of Row in Pyspark among other things?

Get Json Key value from List[Row] with Scala

Let's say that I have a List[Row] such as {"name":"abc,"salary","somenumber","id":"1"},{"name":"xyz","salary":"some_number_2","id":"2"}
How do I get the JSON key value pair with scala. Let's assume that I want to get the value of the key "salary". IS the below one right ?
val rows = List[Row] //Assuming that rows has the list of rows
for(row <- rows){
row.get(0).+("salary")
}
If you have a List[Row] I assume that you've had a DataFrame and you did collectAsList. If you collect/collectAsList that means that you
Can no longer use that Spark SQL operations
Can not run your calculations in parallel on the nodes in your cluster. At this point everything is executed in your driver.
I would recommend keeping it as a DataFrame and then doing:
val salaries = df.select("salary")
Then you can do further calculations on the salaries, show them or collect or persist them somewhere.
If you choose to use DataSet (which is like a typed DataFrame) then you could do
val salaries = dataSet.map(_.salary)
Using Spray Json:
import spray.json._
import DefaultJsonProtocol._
object sprayApp extends App {
val list = List("""{"name":"abc","salary":"somenumber","id":"1"}""", """{"name":"xyz","salary":"some_number_2","id":"2"}""")
val jsonAst = list.map(_.parseJson)
for(l <- jsonAst) {
println(l.asJsObject.getFields("salary")(0))
}
}

How can I map a function on dataFrame column values which returns a dataFrame?

I have a spark DataFrame, df1, which contains several columns, one of them is with IDs of patients. I want to take this column and perform a function that sends http request for information regarding every ID, say medical test. This information is then parsed from json and returned by the function as DataFrame of multiple tests. I want to do this for all the IDs so that I have a second DataFrame, df2, with all medical tests information for the IDs in df1.
I tried the following code, which I think is not optimal especially for large number of patients. My problem is that I cannot handle the results in the form of Array[org.apache.spark.sql.DataFrame]. Note this is a sample code, in real life I might have 100 tests for one ID and only 3 for another.
import scala.util.Random._
val df1 = Seq(
("9031x", 32),
("1102z", 12),
("3048o", 54)
).toDF("ID", "age")
// a function that takes the string and returns a DataFrame
def getPatientInfo(ID: String): org.apache.spark.sql.DataFrame = {
val r = scala.util.Random
val df2 = Seq(
("test1", r.nextInt(100), r.nextInt(40)+1980, r.nextString(4)),
("test2", r.nextInt(100), r.nextInt(40)+1980, r.nextString(3)),
("test3", r.nextInt(100), r.nextInt(40)+1980, r.nextString(5))
).toDF("testName", "year", "result", "Notes")
df2
}
// convert the ID to Array[String]
val ID = df1.collect().map(row => row.getString(0))
// apply the function foreach ID
val medicalRecords = for (i <- ID) yield {getPatientInfo(i)}
Are there any other optimal approaches?
TL;DR;
It is not possible DataFrame.map (or equivalent method) cannot use SparkSession or distributed data structures.
If you want make it work, use your favorite JSON parser instead and redefine getPatient as either:
def getPatientInfo(ID: String): Seq[Row]
or
def getPatientInfo(ID: String): T
where T is a case class and replace:
df1.flatMap(row => getPatientInfo(row.getString(0)))
(adding Encoder if necessary).

Scala - Converting RDD to map

I am a beginner in scala.
I have a class User containing a userId as one of the attributes.
I would like to convert RDD of users to a map with the userId as key and user as value.
Thanks!
let suppose you have the RDD myUsers:RDD[Users]. Each record of the RDD contains the attributes userId. You can transform it into a newRdd this way:
val newRdd = myUsers.map(x => (x.userId, x))
If You want to convert newRdd to a Map:
val myMap = newRdd.toMap
You can do these two computations in one line, I splitted them just for explanation

How to add source file name to each row in Spark?

I'm new to Spark and am trying to insert a column to each input row with the file name that it comes from.
I've seen others ask a similar question, but all their answers used wholeTextFile, but I'm trying to do this for larger CSV files (read using the Spark-CSV library), JSON files, and Parquet files (not just small text files).
I can use the spark-shell to get a list of the filenames:
val df = sqlContext.read.parquet("/blah/dir")
val names = df.select(inputFileName())
names.show
but that's a dataframe.
I am not sure how to add it as a column to each row (and if that result is ordered the same as the initial data either, though I assume it always is) and how to do this as a general solution for all input types.
Another solution I just found to add file name as one of the columns in DataFrame
val df = sqlContext.read.parquet("/blah/dir")
val dfWithCol = df.withColumn("filename",input_file_name())
Ref:
spark load data and add filename as dataframe column
When you create a RDD from a text file, you probably want to map the data into a case class, so you could add the input source in that stage:
case class Person(inputPath: String, name: String, age: Int)
val inputPath = "hdfs://localhost:9000/tmp/demo-input-data/persons.txt"
val rdd = sc.textFile(inputPath).map {
l =>
val tokens = l.split(",")
Person(inputPath, tokens(0), tokens(1).trim().toInt)
}
rdd.collect().foreach(println)
If you do not want to mix "business data" with meta data:
case class InputSourceMetaData(path: String, size: Long)
case class PersonWithMd(name: String, age: Int, metaData: InputSourceMetaData)
// Fake the size, for demo purposes only
val md = InputSourceMetaData(inputPath, size = -1L)
val rdd = sc.textFile(inputPath).map {
l =>
val tokens = l.split(",")
PersonWithMd(tokens(0), tokens(1).trim().toInt, md)
}
rdd.collect().foreach(println)
and if you promote the RDD to a DataFrame:
import sqlContext.implicits._
val df = rdd.toDF()
df.registerTempTable("x")
you can query it like
sqlContext.sql("select name, metadata from x").show()
sqlContext.sql("select name, metadata.path from x").show()
sqlContext.sql("select name, metadata.path, metadata.size from x").show()
Update
You can read the files in HDFS using org.apache.hadoop.fs.FileSystem.listFiles() recursively.
Given a list of file names in a value files (standard Scala collection containing org.apache.hadoop.fs.LocatedFileStatus), you can create one RDD for each file:
val rdds = files.map { f =>
val md = InputSourceMetaData(f.getPath.toString, f.getLen)
sc.textFile(md.path).map {
l =>
val tokens = l.split(",")
PersonWithMd(tokens(0), tokens(1).trim().toInt, md)
}
}
Now you can reduce the list of RDDs into a single one: The function for reduce concats all RDDs into a single one:
val rdd = rdds.reduce(_ ++ _)
rdd.collect().foreach(println)
This works, but I cannot test if this distributes/performs well with large files.