How can I map a function on dataFrame column values which returns a dataFrame? - scala

I have a spark DataFrame, df1, which contains several columns, one of them is with IDs of patients. I want to take this column and perform a function that sends http request for information regarding every ID, say medical test. This information is then parsed from json and returned by the function as DataFrame of multiple tests. I want to do this for all the IDs so that I have a second DataFrame, df2, with all medical tests information for the IDs in df1.
I tried the following code, which I think is not optimal especially for large number of patients. My problem is that I cannot handle the results in the form of Array[org.apache.spark.sql.DataFrame]. Note this is a sample code, in real life I might have 100 tests for one ID and only 3 for another.
import scala.util.Random._
val df1 = Seq(
("9031x", 32),
("1102z", 12),
("3048o", 54)
).toDF("ID", "age")
// a function that takes the string and returns a DataFrame
def getPatientInfo(ID: String): org.apache.spark.sql.DataFrame = {
val r = scala.util.Random
val df2 = Seq(
("test1", r.nextInt(100), r.nextInt(40)+1980, r.nextString(4)),
("test2", r.nextInt(100), r.nextInt(40)+1980, r.nextString(3)),
("test3", r.nextInt(100), r.nextInt(40)+1980, r.nextString(5))
).toDF("testName", "year", "result", "Notes")
df2
}
// convert the ID to Array[String]
val ID = df1.collect().map(row => row.getString(0))
// apply the function foreach ID
val medicalRecords = for (i <- ID) yield {getPatientInfo(i)}
Are there any other optimal approaches?

TL;DR;
It is not possible DataFrame.map (or equivalent method) cannot use SparkSession or distributed data structures.
If you want make it work, use your favorite JSON parser instead and redefine getPatient as either:
def getPatientInfo(ID: String): Seq[Row]
or
def getPatientInfo(ID: String): T
where T is a case class and replace:
df1.flatMap(row => getPatientInfo(row.getString(0)))
(adding Encoder if necessary).

Related

Spark Scala: build a new column using a function using another dataframe

Here's my issue: I have a first dataframe which is basically a list of cities, and the country they reside in. I have a second dataframe, with a list of users, and the cities they reside in.
I'd like to add a "country" column to the second dataframe, where its value would be based on the "city" column of course, but the city names can me typed differently (for example Washington and washington would both have to give me USA).
I though the best way to do that would be to create a foo(country: String) : String which would return the country by parsing the first dataframe, but I can't find a way to use this function while creating my new column.
first put in lower case the city column of both dataframes, since you go to join on key city and after, effect the capitalize of the first letter, this code should do what you are looking for:
object Main {
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
import sparkSession.implicits._
val citiesDF = Seq(
("London", "England"), ("Washington", "USA")
)
.toDF("city", "country")
.withColumn("city", lower(col("city")))
val usersDF = Seq(
("Andy", "London"), ("Mark", "Washington"), ("Bob", "washington")
)
.toDF("name", "city")
.withColumn("city", lower(col("city")))
val resultDF = citiesDF.join(usersDF, Seq("city"))
.withColumn("city", initcap(col("city")))
resultDF.show()
}
}

How to pass DataSet(s) to a function that accepts DataFrame(s) as arguments in Apache Spark using Scala?

I have a library in Scala for Spark which contains many functions.
One example is the following function to unite two dataframes that have different columns:
def appendDF(df2: DataFrame): DataFrame = {
val cols1 = df.columns.toSeq
val cols2 = df2.columns.toSeq
def expr(sourceCols: Seq[String], targetCols: Seq[String]): Seq[Column] = {
targetCols.map({
case x if sourceCols.contains(x) => col(x)
case y => lit(null).as(y)
})
}
// both df's need to pass through `expr` to guarantee the same order, as needed for correct unions.
df.select(expr(cols1, cols1): _*).union(df2.select(expr(cols2, cols1): _*))
}
I would like to use this function (and many more) to Dataset[CleanRow] and not DataFrames. CleanRow is a simple class here that defines the names and types of the columns.
My educated guess is to convert the Dataset into Dataframe using .toDF() method. However, I would like to know whether there are better ways to do it.
From my understanding, there shouldn't be many differences between Dataset and Dataframe since Dataset are just Dataframe[Row]. Plus, I think that from Spark 2.x the APIs for DF and DS have been unified, so I was thinking that I could pass either of them interchangeably, but that's not the case.
If changing signature is possible:
import spark.implicits._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Dataset
def f[T](d: Dataset[T]): Dataset[T] = {d}
// You are able to pass a dataframe:
f(Seq(0,1).toDF()).show
// res1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
// You are also able to pass a dataset:
f(spark.createDataset(Seq(0,1)))
// res2: org.apache.spark.sql.Dataset[Int] = [value: int]

Using map() and filter() in Spark instead of spark.sql

I have two datasets that I want to INNER JOIN to give me a whole new table with the desired data. I used SQL and manage to get it. But now I want to try it with map() and filter(), is it possible?
This is my code using the SPARK SQL:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object hello {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local")
.setAppName("quest9")
val sc = new SparkContext(conf)
val spark = SparkSession.builder().appName("quest9").master("local").getOrCreate()
val zip_codes = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/zip.csv")
val census = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/census.csv")
census.createOrReplaceTempView("census")
zip_codes.createOrReplaceTempView("zip")
//val query = spark.sql("SELECT * FROM census")
val query = spark.sql("SELECT DISTINCT census.Total_Males AS male, census.Total_Females AS female FROM census INNER JOIN zip ON census.Zip_Code=zip.Zip_Code WHERE zip.City = 'Inglewood' AND zip.County = 'Los Angeles'")
query.show()
query.write.parquet("/home/hdfs/Documents/population/census/IDE/census.parquet")
sc.stop()
}
}
The only sensible way, in general to do this would be to use the join() method of `Dataset̀€. I would urge you to question the need to use only map/filter to do this, as this is not intuitive, and will probably confuse any experienced spark developer (or simply put, make him roll his eyes). It may also lead to scalability issues should the dataset grow.
That said, in your use case, it is pretty simple to avoid using join. Another possibility would be to issue two separate jobs to spark :
fetch the zip code(s) that interests you
filter on the census data on that (those) zip code(s)
Step 1 collect the zip codes of interest (not sure of the exact syntax as I do not have a spark shell at hand, but it should be trivial to find the right one).
var codes: Seq[String] = zip_codes
// filter on the city
.filter(row => row.getAs[String]("City").equals("Inglewood"))
// filter on the county
.filter(row => row.getAs[String]("County").equals("Los Angeles"))
// map to zip code as a String
.map(row => row.getAs[String]("Zip_Code"))
.as[String]
// Collect on the driver side
.collect()
Then again, writing it this way instead of using select/where is pretty strange to anyone being used to spark.
Yet, the reason this will work is because we can be sure that zip codes matching a given town and county will be really small. So it is safe to perform driver side collcetion of the result.
Now on to step 2 :
census.filter(row => codes.contains(row.getAs[String]("Zip_Code")))
.map( /* whatever to get your data out */ )
What you need is a join, your query roughly translates to :
census.as("census")
.join(
broadcast(zip_codes
.where($"City"==="Inglewood")
.where($"County"==="Los Angeles")
.as("zip"))
,Seq("Zip_Code"),
"inner" // "leftsemi" would also be sufficient
)
.select(
$"census.Total_Males".as("male"),
$"census.Total_Females".as("female")
).distinct()

How to handle missing columns in spark sql

We are dealing with schema free JSON data and sometimes the spark jobs are failing as some of the columns we refer in spark SQL are not available for certain hours in the day. During these hours the spark job fails as the column being referred is not available in the data frame. How to handle this scenario? I have tried UDF but we have too many columns missing so can't really check each and every column for availability. I have also tried inferring a schema on a larger data set and applied it on the data frame expecting that missing columns will be filled with null but the schema application fails with weird errors.
Please suggest
This worked for me. Created a function to check all expected columns and add columns to dataframe if it is missing
def checkAvailableColumns(df: DataFrame, expectedColumnsInput: List[String]) : DataFrame = {
expectedColumnsInput.foldLeft(df) {
(df,column) => {
if(df.columns.contains(column) == false) {
df.withColumn(column,lit(null).cast(StringType))
}
else (df)
}
}
}
val expectedColumns = List("newcol1","newcol2","newcol3")
val finalDf = checkAvailableColumns(castedDateSessions,expectedColumns)
Here is an improved version of the answer #rads provided
#tailrec
def addMissingFields(fields: List[String])(df: DataFrame): DataFrame = {
def addMissingField(field: String)(df: DataFrame): DataFrame =
df.withColumn(field, lit(null).cast(StringType))
fields match {
case Nil =>
df
case c :: cs if c.contains(".") && !df.columns.contains(c.split('.')(0)) =>
val fields = c.split('.')
// it just supports one level of nested, but it can extend
val schema = StructType(Array(StructField(fields(1), StringType)))
addMissingFields(cs)(addMissingField(fields(0), schema)(df))
case ::(c, cs) if !df.columns.contains(c.split('.')(0)) =>
addMissingFields(cs)(addMissingField(c)(df))
case ::(_, cs) =>
addMissingFields(cs)(df)
}
}
Now you can use it as a transformation:
val df = ...
val expectedColumns = List("newcol1","newcol2","newcol3")
df.transform(addMissingFields(expectedColumns))
I haven't tested it in production yet to see if there is any performance issue. I doubt it. But if there was any, I'll update my post.
Here are the steps to add missing columns:
val spark = SparkSession
.builder()
.appName("Spark SQL json example")
.master("local[1]")
.getOrCreate()
import spark.implicits._
val df = spark.read.json
val schema = df.schema
val columns = df.columns // enough for flat tables
You can traverse the auto generated schema. If it is flat table just do
df.columns.
Compare the found columns to the expected columns and add the missing fields like this:
val dataframe2 = df.withColumn("MissingString1", lit(null).cast(StringType) )
.withColumn("MissingString2", lit(null).cast(StringType) )
.withColumn("MissingDouble1", lit(0.0).cast(DoubleType) )
Maybe there is a faster way to add the missing columns in one operation, instead of one by one, but the with withColumns() method which does that is private.
Here's a pyspark solution based on this answer which checks for a list of names (from a configDf - transformed into a list of columns it should have - parameterColumnsToKeepList) - this assumes all missing columns are ints but you could look this up in configdDf dynamically too. My default is null but you could also use 0.
from pyspark.sql.types import IntegerType
for column in parameterColumnsToKeepList:
if column not in processedAllParametersDf.columns:
print('Json missing column: {0}' .format(column))
processedAllParametersDf = processedAllParametersDf.withColumn(column, lit(None).cast(IntegerType()))

How to add source file name to each row in Spark?

I'm new to Spark and am trying to insert a column to each input row with the file name that it comes from.
I've seen others ask a similar question, but all their answers used wholeTextFile, but I'm trying to do this for larger CSV files (read using the Spark-CSV library), JSON files, and Parquet files (not just small text files).
I can use the spark-shell to get a list of the filenames:
val df = sqlContext.read.parquet("/blah/dir")
val names = df.select(inputFileName())
names.show
but that's a dataframe.
I am not sure how to add it as a column to each row (and if that result is ordered the same as the initial data either, though I assume it always is) and how to do this as a general solution for all input types.
Another solution I just found to add file name as one of the columns in DataFrame
val df = sqlContext.read.parquet("/blah/dir")
val dfWithCol = df.withColumn("filename",input_file_name())
Ref:
spark load data and add filename as dataframe column
When you create a RDD from a text file, you probably want to map the data into a case class, so you could add the input source in that stage:
case class Person(inputPath: String, name: String, age: Int)
val inputPath = "hdfs://localhost:9000/tmp/demo-input-data/persons.txt"
val rdd = sc.textFile(inputPath).map {
l =>
val tokens = l.split(",")
Person(inputPath, tokens(0), tokens(1).trim().toInt)
}
rdd.collect().foreach(println)
If you do not want to mix "business data" with meta data:
case class InputSourceMetaData(path: String, size: Long)
case class PersonWithMd(name: String, age: Int, metaData: InputSourceMetaData)
// Fake the size, for demo purposes only
val md = InputSourceMetaData(inputPath, size = -1L)
val rdd = sc.textFile(inputPath).map {
l =>
val tokens = l.split(",")
PersonWithMd(tokens(0), tokens(1).trim().toInt, md)
}
rdd.collect().foreach(println)
and if you promote the RDD to a DataFrame:
import sqlContext.implicits._
val df = rdd.toDF()
df.registerTempTable("x")
you can query it like
sqlContext.sql("select name, metadata from x").show()
sqlContext.sql("select name, metadata.path from x").show()
sqlContext.sql("select name, metadata.path, metadata.size from x").show()
Update
You can read the files in HDFS using org.apache.hadoop.fs.FileSystem.listFiles() recursively.
Given a list of file names in a value files (standard Scala collection containing org.apache.hadoop.fs.LocatedFileStatus), you can create one RDD for each file:
val rdds = files.map { f =>
val md = InputSourceMetaData(f.getPath.toString, f.getLen)
sc.textFile(md.path).map {
l =>
val tokens = l.split(",")
PersonWithMd(tokens(0), tokens(1).trim().toInt, md)
}
}
Now you can reduce the list of RDDs into a single one: The function for reduce concats all RDDs into a single one:
val rdd = rdds.reduce(_ ++ _)
rdd.collect().foreach(println)
This works, but I cannot test if this distributes/performs well with large files.