save updates into dataframe and reuse the saved dataframe in spark scala - scala

I get multiple incoming files and i have to compare each incoming file with the source file then merge and replace the old rows with the new rows and append the extra rows if any present in the source file. Afterwords I have to use the updated sourcefile and compare with another incoming file, update it and so the process goes on.
I have so far created the dataframe for each file and compared and merged using join. i want to save all the updates done in the source file and use the updated source file again to compare and update incomming files.
val merge = df.union(dfSource.join(df, Seq( "EmployeeID" ),joinType= "left_anti").orderBy("EmployeeID") )
merge.write.mode ("append").format("text").insertInto("dfSource")
merge.show()
I tried this way but it dosent update my dfSource dataframe. could somebody help please.
Thanks

Not possible this way. Need to use tables and then save to a file as final part of process.
Suggest you align your approach as follows - which allows parallel loading but really I suspect not really of benefit.
Load all files in order of delivery with each record loaded being tagged with a timestamp or some ordering sequence from your sequence number of files along with type of record. E.g. File X with, say, position 2 in sequence gets records loaded with seqnum = 2. You can use the DF approach on the file being processed and appending to a Impala / Hive KUDU table if performing all within SPARK domain.
For records in the same file apply monotonically_increasing_id() to get ordering within the file if same key can exist in same file. See DataFrame-ified zipWithIndex. Or zipWithIndex via RDD via conversion and back to DF.
Then issue a select statement to take the key values with maximum value timestamp, seq_num per key. E.g. if in current run 3 recs, say, for key=1, only one needs to be processed - presumably the one with highest value.
Save as a new file.
Process this new file accordingly.
OR:
Bypass step 3 and read in asc order and process data accordingly.
Comment to make:
Typically I load such data with LOAD to HIVE / IMPALA with partitioning key being set via extracting timestamp from the file name. Requires some LINUX scripting / processing. That's a question of style and should not be a real Big Data bottleneck.
Here is a snippet with simulated input of how some aspects can be done to allow a MAX select against a key for UPSerts. The Operation, DEL,ALT whatever you need to add. Although I think you can do this yourself actually from what I have seen:
import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
def dfSchema(columnNames: List[String]): StructType =
StructType(
Seq(
StructField(name = "key", dataType = StringType, nullable = false),
StructField(name = "file", dataType = StringType, nullable = false),
StructField(name = "ts", dataType = StringType, nullable = false),
StructField(name = "val", dataType = StringType, nullable = false),
StructField(name = "seq_val", dataType = LongType, nullable = false)
)
)
val newSchema = dfSchema(List("key", "file", "ts", "val", "seq_val"))
val df1 = Seq(
("A","F1", "ts1","1"),
("B","F1", "ts1","10"),
("A","F1", "ts2","2"),
("C","F2", "ts3","8"),
("A","F2", "ts3","3"),
("A","F0", "ts0","0")
).toDF("key", "file", "ts","val")
val rddWithId = df1.sort($"key", $"ts".asc).rdd.zipWithIndex
val dfZippedWithId = spark.createDataFrame(rddWithId.map{ case (row, index) => Row.fromSeq(row.toSeq ++ Array(index))}, newSchema)
dfZippedWithId.show
returns:
+---+----+---+---+-------+
|key|file| ts|val|seq_val|
+---+----+---+---+-------+
| A| F0|ts0| 0| 0|
| A| F1|ts1| 1| 1|
| A| F1|ts2| 2| 2|
| A| F2|ts3| 3| 3|
| B| F1|ts1| 10| 4|
| C| F2|ts3| 8| 5|
+---+----+---+---+-------+
ready for subsequent processing.

Related

Selecting a column from a dataset's row

I'd like to loop on a Spark dataset and save specific values in a Map depending on the characteristics of each row. I'm new to Spark and Scala so I joined a simple example of what I'm trying to do in python.
Minimal working example in python:
mydict = dict()
for row in data:
if row['name'] == "Yan":
mydict[row['id']] = row['surname']
else:
mydict[row['id']] = "Random lad"
Where data is a (big) spark dataset, of type org.apache.spark.sql.Dataset[org.apache.spark.sql.Row].
Do you know the Spark or Scala way of doing it?
You can not loop over the contents of a Dataset because they are not accessible on the machine running this code but instead are scattered over (possibly many) different worker nodes. That is a fundamental concept of distributed execution engines like spark.
Instead you have to manipulate your data in a functional (where map, filter, reduce, ... operations are spread to the workers) or declarative (sql queries that are performed on the workers) way.
To achieve your goal you could run a map over you data which checks whether the name equals "Yan" and go on from there. After this transformation you can collect your dataframe and transform it into a dict.
You should also check your approach on using Spark and the map: it seems you want to create an entry in mydict for each element of data. This means your data is either small enough that you don't really have to use Spark or it will probably fail because it does not fit in your drivers memory.
I think you are looking for something like that. If your final df is not big you can collect it and store as map.
scala> df.show()
+---+----+--------+
| id|name|surrname|
+---+----+--------+
| 1| Yan| abc123|
| 2| Abc| def123|
+---+----+--------+
scala> df.select('id, when('name === "Yan", 'surrname).otherwise("Random lad")).toDF("K","V").show()
+---+----------+
| K| V|
+---+----------+
| 1| abc123|
| 2|Random lad|
+---+----------+
Here is a simple way to do it, but be careful with collect(), since it collects the data in driver. The data should be able to fit in driver.
I don't recommend you to do this.
var df: DataFrame = Seq(
("1", "Yan", "surname1"),
("2", "Yan1", "surname2"),
("3", "Yan", "surname3"),
("4", "Yan2", "surname4")
).toDF("id", "name", "surname")
val myDict = df.withColumn("newName", when($"name" === "Yan", $"surname").otherwise("RandomeName"))
.rdd.map(row => (row.getAs[String]("id"), row.getAs[String]("newName")))
.collectAsMap()
myDict.foreach(println)
Output:
(2,RandomeName)
(1,surname1)
(4,RandomeName)
(3,surname3)

I need to skip three rows from the dataframe while loading from a CSV file in scala

I am loading my CSV file to a data frame and I can do that but I need to skip the starting three lines from the file.
I tried .option() command by giving header as true but it is ignoring the only first line.
val df = spark.sqlContext.read
.schema(Myschema)
.option("header",true)
.option("delimiter", "|")
.csv(path)
I thought of giving header as 3 lines but I couldn't find the way to do that.
alternative thought: skip those 3 lines from the data frame
Please help me with this. Thanks in Advance.
A generic way to handle your problem would be to index the dataframe and filter the indices that are greater than 2.
Straightforward approach:
As suggested in another answer, you may try adding an index with monotonically_increasing_id.
df.withColumn("Index",monotonically_increasing_id)
.filter('Index > 2)
.drop("Index")
Yet, that's only going to work if the first 3 rows are in the first partition. Moreover, as mentioned in the comments, this is the case today but this code may break completely with further versions or spark and that would be very hard to debug. Indeed, the contract in the API is just "The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive". It is therefore not very sage to assume that they will always start from zero. There might even be other cases in the current version in which that does not work (I'm not sure though).
To illustrate my first concern, have a look at this:
scala> spark.range(4).withColumn("Index",monotonically_increasing_id()).show()
+---+----------+
| id| Index|
+---+----------+
| 0| 0|
| 1| 1|
| 2|8589934592|
| 3|8589934593|
+---+----------+
We would only remove two rows...
Safe approach:
The previous approach will work most of the time though but to be safe, you can use zipWithIndex from the RDD API to get consecutive indices.
def zipWithIndex(df : DataFrame, name : String) : DataFrame = {
val rdd = df.rdd.zipWithIndex
.map{ case (row, i) => Row.fromSeq(row.toSeq :+ i) }
val newSchema = df.schema
.add(StructField(name, LongType, false))
df.sparkSession.createDataFrame(rdd, newSchema)
}
zipWithIndex(df, "index").where('index > 2).drop("index")
We can check that it's safer:
scala> zipWithIndex(spark.range(4).toDF("id"), "index").show()
+---+-----+
| id|index|
+---+-----+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
+---+-----+
You can try this option
df.withColumn("Index",monotonically_increasing_id())
.filter(col("Index") > 2)
.drop("Index")
You may try changing wrt to your schema.
import org.apache.spark.sql.Row
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Read CSV
val file = sc.textFile("csvfilelocation")
//Remove first 3 lines
val data = file.mapPartitionsWithIndex{ (idx, iter) => if (idx == 0) iter.drop(3) else iter }
//Create RowRDD by mapping each line to the required fields
val rowRdd = data.map(x=>Row(x(0), x(1)))
//create dataframe by calling sqlcontext.createDataframe with rowRdd and your schema
val df = sqlContext.createDataFrame(rowRdd, schema)

Flag faulty Row in spark dataframe with boolean values

I was trying a hands on the spark dataframes. With previous knowledge from Cascading Framework which has a trap mechanism to filter out faulty rows (rows with null values) into a separate Tap called Trap. Those who are unaware let me make that clear. When you get a faulty row which has been read from a text file. The framework either scraps out the bad row from the entire data or stops the execution. Now in apache spark, I observed that the bad rows didn't hinder the execution. That is good but when it comes to getting business insights from data, Quality of the data Does matter!
So, I have a text file with bunch of rows in it (you may pick up any dataset, you like to), in which few records do contain null values. Now I load the text file into a Dataframe with spark.read.csv. Now, what I want to do is analyze the Dataframe and dynamically create a column named "isMyRowBad" where the logic will analyze each rows at a time and if the logic finds out the row which has a null value, it flags the isMyRowBad column on that particular row as true and the columns which do not have null values, the corresponding column isMyRowBad should have false for that purticular row which is clean.
Giving you the overview of the incoming and outgoing datasets
INCOMING DATAFRAME
fname,lname,age
will,smith,40
Dwayne,Nunn,36
Aniruddha,Sinha,
Maria,,22
OUTGOING DATAFRAME
fname,lname,age,isMyRowBad
will,smith,40,false
Dwayne,Nunn,36,false
Aniruddha,Sinha,,true
Maria,,22,true
The above method to classify good and bad rows might seem a little foolish but it does make sense since I will not need to run filter operation multiple times. let us take a look, how?
Suppose I have a Dataframe named inDf as inputDf and AnalysedDf:(DataFrame,DataFrame) as output Df Tuple
Now, I did try this part of code
val analyzedDf: (DataFrame, DataFrame) = (inputDf.filter(_.anyNull),inputDf.filter(!_.anyNull))
This code segregates good and bad rows. I agree! but this has a performance setback as filter runs two times which means filter will iterate all over the dataset for two times!( you may counter this point if you feel running filter two times does make sense when considering 50 fields and atleast 584000 rows ( that is 250 mb of data)!)
and this as well
val analyzedDf: DataFrame = inputDf.select("*").withColumn("isMyRowBad", <this point, I am not able to analyze row>
The above snippet shows where I am not able to figure out how to sweep the entire row and mark the row as bad with a boolean value.
Hope, you all got to understand what am I aiming to achieve. Please ignore the syntactical errors if you find in the snippets since I typed them here right away(will correct the same with future edits)
Please give me a hint(a little code snippet or a pseudo code will be enough) on how to proceed with the challenge. Please reach out to me if you didn't understand what I intend to do.
Any help will be greatly appreciated. Thanks in advance!
P.S: There are brilliant people out here on BigData/spark/hadoop/scala etc. Request you to kindly correct me on any point which I might have wrongly written(conceptually)
The below code give me a solution by the way. Please have a look
package aniruddha.data.quality
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.functions._
/**
* Created by aniruddha on 8/4/17.
*/
object DataQualityCheck extends App {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val schema: StructType = StructType(List(
StructField("fname", StringType, nullable = true),
StructField("lname", StringType, nullable = true),
StructField("age", IntegerType, nullable = true),
StructField("pan", StringType, nullable = true),
StructField("married", StringType, nullable = true)
))
val inputDataFrame: DataFrame = spark
.read
.schema(schema)
.option("header",true)
.option("delimiter",",")
.csv("inputData/infile")
//inputDataFrame.show()
val analysedDataFrame: DataFrame = inputDataFrame.select("*").withColumn("isRowBad", when($"pan".isNull||$"lname".isNull||$"married".isNull,true).otherwise(false))
analysedDataFrame show
}
input
fname,lname,age,pan,married
aniruddha,sinha,23,0AA22,no
balajee,venkatesh,23,0b96,no
warren,shannon,72,,
wes,borland,63,0b22,yes
Rohan,,32,0a96,no
james,bond,66,007,no
output
+---------+---------+---+-----+-------+--------+
| fname| lname|age| pan|married|isRowBad|
+---------+---------+---+-----+-------+--------+
|aniruddha| sinha| 23|0AA22| no| false|
| balajee|venkatesh| 23| 0b96| no| false|
| warren| shannon| 72| null| null| true|
| wes| borland| 63| 0b22| yes| false|
| Rohan| null| 32| 0a96| no| true|
| james| bond| 66| 007| no| false|
+---------+---------+---+-----+-------+--------+
The code works fine but I have a problem with the when function. Can't we just select all the columns without hardcoding it?
As far as I know, you can't do this with the inbuilt csv parser. You can get the parser to stop if it hits an error (failFast mode), but not annotate.
However, you could do this with a custom csv parser, that can process the data in a single pass. Unless we want to do some clever type introspection, it is easiest if we create a helper class to annotate the structure of the file:
case class CSVColumnDef(colPos: Int, colName: String, colType: String)
val columns = List(CSVColumnDef(0,"fname","String"),CSVColumnDef(1,"lname","String"),CSVColumnDef(2,"age", "Int"))
Next, we need some functions to a) split the input, b) extract data from split data, c) check if row is bad:
import scala.util.Try
def splitToSeq(delimiter: String) = udf[Seq[String],String](_.split(delimiter))
def extractColumnStr(i: Int) = udf[Option[String],Seq[String]](s => Try(Some(s(i))).getOrElse(None))
def extractColumnInt(i: Int) = udf[Option[Int],Seq[String]](s => Try(Some(s(i).toInt)).getOrElse(None))
def isRowBad(delimiter: String) = udf[Boolean,String](s => {
(s.split(delimiter).length != columns.length) || (s.split(delimiter).exists(_.length==0))
})
To use these, we first need to read in the text file (since I don't have it, and to allow people to replicate this answer, I will create an rdd):
val input = sc.parallelize(List(("will,smith,40"),("Dwayne,Nunn,36"),("Aniruddha,Sinha,"),("Maria,,22")))
input.take(5).foreach(println)
Given this input, we can create a dataframe with a single column, the raw line, and add our split column to it:
val delimiter = ","
val raw = "raw"
val delimited = "delimited"
val compDF = input.toDF(raw).withColumn(delimited, splitToSeq(delimiter)(col(raw)))
Finally, we can extract all the columns we previously defined, and check if the rows are bad:
val df = columns.foldLeft(compDF){case (acc,column) => column.colType match {
case "Int" => acc.withColumn(column.colName, extractColumnInt(column.colPos)(col(delimited)))
case _ => acc.withColumn(column.colName, extractColumnStr(column.colPos)(col(delimited)))
}}.
withColumn("isMyRowBad", isRowBad(delimiter)(col(raw))).
drop(raw).drop(delimited)
df.show
df.printSchema
The nice thing about this solution is that the spark execution planner is smart enough to build all of those .withColumn operations into a single pass (map) over the data, without zero shuffling. The annoying thing is that it is a lot more dev work than using a nice shiny csv library, and we need to define the columns somehow. If you wanted to be a bit more clever, you could get the column names from the first line of the file (hint: .mapPartitionsWithIndex), and just parse everything as a string. We also can't define a case class to describe the entire DF, since you have too many columns to approach the solution that way. Hope this helps...
This can be done using udf. Although the answer given by Ben Horsburgh is definitely brilliant, yet we can do this without getting much into internal architecture behind Dataframes. The following code can give you an idea
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
/**
* Created by vaijnath on 10/4/17.
*/
object DataQualityCheck extends App {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val schema: StructType = StructType(List(
StructField("fname", StringType, nullable = true),
StructField("lname", StringType, nullable = true),
StructField("married", StringType, nullable = true)
))
val inputDataFrame: DataFrame = spark
.read
.schema(schema)
.option("header",false)
.option("delimiter",",")
.csv("hydrograph.engine.spark/testData/inputFiles/delimitedInputFile.txt")
//inputDataFrame.show()
def isBad(row:Row):Boolean={
row.anyNull
}
val simplefun=udf(isBad(_:Row))
val cols=struct(inputDataFrame.schema.fieldNames.map(e=> col(e)):_*)
// println(cols+"******************") //for debugging
val analysedDataFrame: DataFrame = inputDataFrame.withColumn("isRowBad", simplefun(cols))
analysedDataFrame.show
}
Please get back to me if you face any issues. I believe this solution can be appropriate since you seem to look for a code with usage of dataframe.
Thanks.

Adding StringType column to existing Spark DataFrame and then applying default values

Scala 2.10 here using Spark 1.6.2. I have a similar (but not the same) question as this one, however, the accepted answer is not an SSCCE and assumes a certain amount of "upfront knowledge" about Spark; and therefore I can't reproduce it or make sense of it. More importantly, that question is also just limited to adding a new column to an existing dataframe, whereas I need to add a column as well as a value for all existing rows in the dataframe.
So I want to add a column to an existing Spark DataFrame, and then apply an initial ('default') value for that new column to all rows.
val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)
jsonDF.show()
When I run that I get this following as output (via .show()):
+----+--------+
| x| y|
+----+--------+
|true|not true|
+----+--------+
Now I want to add a new field to jsonDF, after it's created and without modifying the json string, such that the resultant DF would look like this:
+----+--------+----+
| x| y| z|
+----+--------+----+
|true|not true| red|
+----+--------+----+
Meaning, I want to add a new "z" column to the DF, of type StringType, and then default all rows to contain a z-value of "red".
From that other question I have pieced the following pseudo-code together:
val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)
//jsonDF.show()
val newDF = jsonDF.withColumn("z", jsonDF("col") + 1)
newDF.show()
But when I run this, I get a compiler error on that .withColumn(...) method:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "col" among (x, y);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:664)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:652)
I also don't see any API methods that would allow me to set "red" as the default value. Any ideas as to where I'm going awry?
You can use lit function. First you have to import it
import org.apache.spark.sql.functions.lit
and use it as shown below
jsonDF.withColumn("z", lit("red"))
Type of the column will be inferred automatically.

Reference column by id in Spark Dataframe

I have multiple duplicate columns (due to joins) If I try to call them by alias, I get an ambiguous reference error:
Reference 'customers_id' is ambiguous, could be: customers_id#13, customers_id#85, customers_id#130
Is there a way to reference a column in a Scala Spark Dataframe by it's order in the Dataframe or by numeric ID, not by an alias? Sanitized names suggest that columns do have an id assigned (13, 85, 130 in the example below)
LATER EDIT:
I found out that I can reference a specific column by the original dataframe it was in. But, while I can use OriginalDataframe.customer_id in select function, the withColumnRename function only accepts string alias so I cannot rename the duplicate column in the final dataframe.
So, I guess the end question is:
Is there a way to reference a column that has a duplicate alias, that works with all functions that require a string alias as argument?
LATER EDIT 2:
Renaming seemed to have worked via adding a new column and dropping one of the current ones:
joined_dataframe = joined_dataframe.withColumn("renamed_customers_id", original_dataframe("customers_id")).drop(original_dataframe("customers_id"))
But, I'd like to keep my question open:
Is there a way to reference a column that has a duplicate alias (so, using something other than alias) in a way that all functions which expect a string alias accept it?
One way to get out of such a situation would be to create a new Dataframe using the old one's rdd, but with a new schema, where you can name each column as you'd like. This, of course, requires you to explicitly describe the entire schema, including the type of each column. As long as the new schema you provides matches the number of columns, and the column types, of the old Dataframe - this should work.
For example - starting with a Dataframe with two columns named type we can rename them type1 and type2:
df.show()
// +---+----+----+
// | id|type|type|
// +---+----+----+
// | 1| AAA| aaa|
// | 1| BBB| bbb|
// +---+----+----+
val newDF = sqlContext.createDataFrame(df.rdd, new StructType()
.add("id", IntegerType)
.add("type1", StringType)
.add("type2", StringType)
)
newDF.show()
// +---+-----+-----+
// | id|type1|type2|
// +---+-----+-----+
// | 1| AAA| aaa|
// | 1| BBB| bbb|
// +---+-----+-----+
The main problem is join, ı use python.
h1.createOrReplaceTempView("h1")
h2.createOrReplaceTempView("h2")
h3.createOrReplaceTempView("h3")
joined1 = h1.join(h2, (h1.A == h2.A) & (h1.B == h2.B) & (h1.C == h2.C), 'inner')
Result dataframe columns:
A B Column1 Column2 A B Column3 ...
I don't like this , but join must be implement like this:
joined1 = h1.join(h2, [*argv], 'inner')
We assume argv = ["A", "B", "C"]
Result columns:
A B column1 column2 column3 ...