Getting datatype of all columns in a dataframe using scala - scala

I have a data frame in which following data is loaded:
enter image description here
I am trying to develop a code which will read data from any source, load the data into data frame and return the following o/p:
enter image description here

You can use the schema property and then iterate over the fields.
Example:
Seq(("A", 1))
.toDF("Field1", "Field2")
.schema
.fields
.foreach(field => println(s"${field.name}, ${field.dataType}"))
Results:
Make sure to take a look at the Spark ScalaDoc.

Thats the closest I could get to the output.
Create schema from case class and and there after created DF from the list of schema columns and mapped them to a Dataframe
import java.sql.Date
object GetColumnDf {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val map = Map("Emp_ID" -> "Dimension","Cust_Name" -> "Dimension","Cust_Age" -> "Measure",
"Salary" -> "Measure","DoJ" -> "Dimension")
import spark.implicits._
val lsit = Seq(Bean123("C-1001","Jack",25,3000,new Date(2000000))).toDF().schema.fields
.map( col => (col.name,col.dataType.toString,map.get(col.name))).toList.toDF("Headers","Data_Type","Type")
lsit.show()
}
}
case class Bean123(Emp_ID: String,Cust_Name: String,Cust_Age: Int, Salary : Int,DoJ: Date)

Related

Aggregating all Column values within a Map after groupBy in Apache Spark

I've been trying this all day long with a Dataframe but no luck so far. Already did it with a RDD but it isn't really readable, so this approach would be much better when it comes to code readability.
Take this initial and result DF, both the starting DF and what I would like to obtain after peforming .groupBy().
case class SampleRow(name:String, surname:String, age:Int, city:String)
case class ResultRow(name: String, surnamesAndAges: Map[String, (Int, String)])
val df = List(
SampleRow("Rick", "Fake", 17, "NY"),
SampleRow("Rick", "Jordan", 18, "NY"),
SampleRow("Sandy", "Sample", 19, "NY")
).toDF()
val resultDf = List(
ResultRow("Rick", Map("Fake" -> (17, "NY"), "Jordan" -> (18, "NY"))),
ResultRow("Sandy", Map("Sample" -> (19, "NY")))
).toDF()
What I've tried so far is performing the following .groupBy...
val resultDf = df
.groupBy(
Name
)
.agg(
functions.map(
selectColumn(Surname),
functions.array(
selectColumn(Age),
selectColumn(City)
)
)
)
However, the following is prompt into console.
Exception in thread "main" org.apache.spark.sql.AnalysisException: expression '`surname`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
However, doing that would result in a single entry per surname and I would like to accumulate those in a single Map as you can see in resultDf. Is there an easy way to achieve this using DFs?
you can achieve it with a single UDF to convert your data to map:
val toMap = udf((keys: Seq[String], values1: Seq[String], values2: Seq[String]) => {
keys.zip(values1.zip(values2)).toMap
})
val myResultDF = df.groupBy("name").agg(collect_list("surname") as "surname", collect_list("age") as "age", collect_list("city") as "city").withColumn("surnamesAndAges", toMap($"surname", $"age", $"city")).drop("age", "city", "surname").show(false)
+-----+--------------------------------------+
|name |surnamesAndAges |
+-----+--------------------------------------+
|Sandy|[Sample -> [19, NY]] |
|Rick |[Fake -> [17, NY], Jordan -> [18, NY]]|
+-----+--------------------------------------+
If you are not concerned about typecasting the Dataframe to DataSet (In this case ResultRow you could do something like this
val grouped =df.withColumn("surnameAndAge",struct($"surname",$"age"))
.groupBy($"name")
.agg(collect_list("surnameAndAge").alias("surnamesAndAges"))
Then you could create a User defined function which would look like
import org.apache.spark.sql._
val arrayToMap = udf[Map[String, String], Seq[Row]] {
array => array.map {
case Row(key: String, value: String) => (key, value) }.toMap
}
Now you could use a .withColumn and call this udf
val finalData = grouped.withColumn("surnamesAndAges",arrayToMap($"surnamesAndAges"))
The Dataframe would look something like this
finalData: org.apache.spark.sql.DataFrame = [name: string, surnamesAndAges: map<string,string>]
Since Spark 2.4, you don't need to use a Spark user-defined function:
import org.apache.spark.sql.functions.{col, collect_set, map_from_entries, struct}
df.withColumn("mapEntry", struct(col("surname"), struct(col("age"), col("city"))))
.groupBy("name")
.agg(map_from_entries(collect_set("mapEntry")).as("surnameAndAges"))
Explanation
You first add a column containing a Map entry from desired columns. a Map entry is merely a struct containing two columns: first column is the key and the second column is the value. You can put another struct as the value. So here your Map entry will use column surname as key, and a struct of columns age and city as value:
struct(col("surname"), struct(col("age"), col("city")))
Then, you collect all the Map entries grouped by your groupBy key, which is column name using function collect_set, and you convert this list of Map entries to a Map using function map_from_entries

Order by key from a Map in a Dataset

I want to order by timestamp some avro files that I retrieve from HDFS.
The schema of my avro files is :
headers : Map[String,String], body : String
Now the tricky part is that the timestamp is one of the key/value from the map. So I have the timestamp contained in the map like this :
key_1 -> value_1, key_2 -> value_2, timestamp -> 1234567, key_n ->
value_n
Note that the type of the values is String.
I created a case class to create my dataset with this schema :
case class Root(headers : Map[String,String], body: String)
Creation of my dataset :
val ds = spark
.read
.format("com.databricks.spark.avro")
.load(pathToHDFS)
.as[Root]
I don't really know how to begin with this problem since I can only get the columns headers and body. How can I get the nested values to finally sort by timestamp ?
I would like to do something like this :
ds.select("headers").doSomethingToGetTheMapStructure.doSomeConversionStringToTimeStampForTheColumnTimeStamp("timestamp").orderBy("timestamp")
A little precision : I don't want to loose any data from my initial dataset, just a sorting operation.
I use Spark 2.3.0.
You can use Scala's sortBy, which takes a function. I would advise you to explicitly declare the val ds as a Vector (or other collection), that way you will see the applicable functions in IntelliJ (if you're using IntelliJ) and it will definitely compile.
See my example below based on your code :
case class Root(headers : Map[String,String], body: String)
val ds: Vector[Root] = spark
.read
.format("com.databricks.spark.avro")
.load(pathToHDFS)
.as[Root]
val sorted = ds.sortBy(r => r.headers.get("timestamp").map(PROCESSING) ).reverse
Edit: added reverse (assuming you want it descending). Inside the function that you pass as argument, you would also put the processing to timestamp.
The loaded Dataset should look something similar to the sample dataset below:
case class Root(headers : Map[String, String], body: String)
val ds = Seq(
Root(Map("k11"->"v11", "timestamp"->"1554231600", "k12"->"v12"), "body1"),
Root(Map("k21"->"v21", "timestamp"->"1554134400", "k22"->"v22"), "body2")
).toDS
You can simply look up the Map by the timestamp key, cast the value to Long, and perform an orderBy as follows:
ds.
withColumn("ts", $"headers"("timestamp").cast("Long")).
orderBy("ts").
show(false)
// +-------------------------------------------------+-----+----------+
// |headers |body |ts |
// +-------------------------------------------------+-----+----------+
// |[k21 -> v21, timestamp -> 1554134400, k22 -> v22]|body2|1554134400|
// |[k11 -> v11, timestamp -> 1554231600, k12 -> v12]|body1|1554231600|
// +-------------------------------------------------+-----+----------+
Note that $"headers"("timestamp") is just the same as using the apply column method (i.e. $"headers".apply("timestamp")).
Alternatively, you could also use getItem to access the Map by key, like:
$"headers".getItem("timestamp")
import org.apache.spark.sql.{Encoders, Encoder, Dataset}
import org.apache.spark.sql.functions.{col, desc}
import java.sql.Timestamp
case class Nested(key_1: String,key_2: String,timestamp: Timestamp,key_n: String)
case class Root(headers:Nested,body:String)
implicit val rootCodec: Encoder[Root] = Encoders.product[Root]
val avroDS:Dataset[Root] = spark.read
.format("com.databricks.spark.avro")
.load(pathToHDFS)
.as[Root]
val sortedDF: DataFrame = avroDS.orderBy(desc(col("timestamp")))
This code snippet would directly cast your Avro data to Dataset[Root]. You wont have to rely on importing sparksession.implicits and would eliminate the step of casting your timestamp field to TimestampType. Internally, Spark's Timestamp datatype is implemented using java.sql.Timestamp.

Spark SQL convert dataset to dataframe

How do I convert a dataset obj to a dataframe? In my example, I am converting a JSON file to dataframe and converting to DataSet. In dataset, I have added some additional attribute(newColumn) and convert it back to a dataframe. Here is my example code:
val empData = sparkSession.read.option("header", "true").option("inferSchema", "true").option("multiline", "true").json(filePath)
.....
import sparkSession.implicits._
val res = empData.as[Emp]
//for (i <- res.take(4)) println(i.name + " ->" + i.newColumn)
val s = res.toDF();
s.printSchema()
}
case class Emp(name: String, gender: String, company: String, address: String) {
val newColumn = if (gender == "male") "Not-allowed" else "Allowed"
}
But I am expected the new column name newColumn added in s.printschema(). output result. But it is not happening? Why? Any reason? How can I achieve this?
The schema of the output with Product Encoder is solely determined based on it's constructor signature. Therefore anything that happens in the body is simply discarded.
You can
empData.map(x => (x, x.newColumn)).toDF("value", "newColumn")

KMeansModel.clusterCenters returns NULL

I am using AWS glue to execute Kmeans clustering on my dataset. I wish to find not only the cluster labels but also the cluster centers. I am failing to find the later.
In the code below model.clusterCenters returns NULL. KMeans clustering works fine, and it returns the cluster label i.e. clusterInstance variable.
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
object Clustering {
case class ObjectDay(realnumber: Double, bnumber : Double, blockednumber: Double,
creationdate : String, fname : String, uniqueid : Long, registrationdate : String,
plusnumber : Double, cvalue : Double, hvalue : Double)
case class ClusterInfo( instance: Int, centers: String)
def main(args: Array[String]): Unit = {
val sc: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(sc)
val spark: SparkSession = glueContext.getSparkSession
import spark.implicits._
// write your code here - start
// Data Catalog: database and table name
val dbName = "dbname"
val tblName = "raw"
val sqlText = "SELECT <columns removed> FROM viewname WHERE `creation_date` ="
// S3 location for output
val outputDir = "s3://blucket/path/"
// Read data into a DynamicFrame using the Data Catalog metadata
val rawDyf: DynamicFrame = glueContext.getCatalogSource(database = dbName, tableName = tblName).getDynamicFrame()
// get only single day data with only numbers
// Spark SQL on a Spark dataframe
val numberDf = rawDyf.toDF()
numberDf.createOrReplaceTempView("viewname")
def getDataViaSql(runDate : LocalDate): RDD[ObjectDay] ={
val data = spark.sql(s"${sqlText} '${runDate.toString}'")
data.as[ObjectDay].rdd
}
def getDenseVector(rddnumbers: RDD[ObjectDay]): RDD[linalg.Vector]={
rddnumbers.map(s => Vectors.dense(Array(s.realnumber, s.bnumber, s.blockednumber))).cache()
}
def getClusters( numbers: RDD[linalg.Vector] ): RDD[ClusterInfo] = {
// Trains a k-means model
val model: KMeansModel = KMeans.train(numbers, 2, 20)
val centers: Array[linalg.Vector] = model.clusterCenters
//put together unique_ids with cluster predictions
val clusters: RDD[Int] = model.predict(numbers)
clusters.map{ clusterInstance =>
ClusterInfo(clusterInstance.toInt, centers(clusterInstance).toJson)
}
}
def combineDataAndClusterInstances(rddnumbers : RDD[ObjectDay], clusterCenters: RDD[ClusterInfo]): DataFrame ={
val numbersWithCluster = rddnumbers.zip(clusterCenters)
numbersWithCluster.map(
x =>
(x._1.realnumber, x._1.bnumber, x._1.blockednumber, x._1.creationdate, x._1.fname,
x._1.uniqueid, x._1.registrationdate, x._1.plusnumber, x._1.cvalue, x._1.hvalue,
x._2.instance, x._2.centers)
)
.toDF("realnumber", "bnumber", "blockednumber", "creationdate",
"fname","uniqueid", "registrationdate", "plusnumber", "cvalue", "hvalue",
"clusterInstance", "clusterCenter")
}
def process(runDate : LocalDate): DataFrame = {
val rddnumbers = getDataViaSql( runDate)
val dense = getDenseVector(rddnumbers)
val clusterCenters = getClusters(dense)
combineDataAndClusterInstances(rddnumbers, clusterCenters)
}
val startdt = LocalDate.parse("2018-01-01", DateTimeFormatter.ofPattern("yyyy-MM-dd"))
val dfByDates = (0 to 240)
.map(days => startdt.plusDays(days))
.map(process(_))
val result = dfByDates.tail.fold(dfByDates.head)((accDF, newDF) => accDF.union(newDF))
val output = DynamicFrame(result, glueContext).withName(name="prediction")
// write your code here - end
glueContext.getSinkWithFormat(connectionType = "s3",
options = JsonOptions(Map("path" -> outputDir)), format = "csv").writeDynamicFrame(output)
}
}
I can successfully find the cluster centres using Python sklearn library on the same data.
UPDATED: Showing the complete Scala code which runs as Glue job. Also I am not getting any error while running the job. I just dont get any cluster centres.
What am I missing ?
Nevermind. It is generating cluster centres.
I didnt see the S3 output files until now.
I was running Glue Crawler and looking at the results in AWS Athena.
The crawler created a struct or array column datatype for clustercenter column and Athena failed to parse and read the JSON stored as string in the CSV output.
Sorry to bother.

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong
Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()
For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD
What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.