Union Dataframes based on condition in spark(scala) - scala

I have a folder which consists of 4 subfolders which contains parquet files
Folder->A.parquet,B.parquet,C.parquet,D.parquet(subfolders). My requirement is I want to union data frames based on file Names I provide to the method.
I am doing it with code
val df = listDirectoriesGetWantedFile(folderPath,sqlContext,A,B)
def listDirectoriesGetWantedFile(folderPath: String, sqlContext: SQLContext, str1: String, str2: String): DataFrame = {
var df: DataFrame = null
val sb = new StringBuilder
sb.setLength(0)
var done = false
val path = new Path(folderPath)
if (fileSystem.isDirectory(path)) {
var files = fileSystem.listStatus(path)
for (file <- files) {
if (file.getPath.getName.contains(str) && !done) {
sb.append(file.getPath.toString())
sb.append(",")
done = true
} else if (file.getPath.getName.contains(str2)) {
sb.append(file.getPath.toString())
}
}
}
But I need to split the sb and then union the dataframes. Which I am unable to find the solution. How can I approach it and solve

If I understand your question, you could simply do something like this :
def listDirectoriesGetWantedFile(path: String,
sqlContext: SQLContext,
folder1: String,
folder2: String): DataFrame = {
val df1 = sqlContext.read.parquet(s"$path/$folder1")
val df2 = sqlContext.read.parquet(s"$path/$folder2")
df1.union(df2)
}
EDIT
By using Hadoop FileSystem, you can check path existence on your folders. So you may try something like that :
def listDirectoriesGetWantedFile(path: String, sqlContext: SQLContext, folders: Seq[String]): DataFrame = {
val conf = new Configuration()
val fs = FileSystem.get(conf)
val existingFolders = folders
.map(folder => new Path(s"$path/$folder"))
.filter(fs.exists(_))
.map(_.toString)
if (existingFolders.isEmpty) {
sqlContext.emptyDataFrame
} else {
sqlContext.read.parquet(existingFolders: _*)
}
}

Related

Store Schema of Read File Into csv file in spark scala

i am reading a csv file using inferschema option enabled in data frame using below command.
df2 = spark.read.options(Map("inferSchema"->"true","header"->"true")).csv("s3://Bucket-Name/Fun/Map/file.csv")
df2.printSchema()
Output:
root
|-- CC|Fun|Head|Country|SendType: string (nullable = true)
Now I would like to store the above output only into a csv file having just these column names and datatype of these columns like below.
column_name,datatype
CC,string
Fun,string
Head,string
Country,string
SendType,string
I tried writing this into a csv using below option, but this is writing the file with entire data.
df2.coalesce(1).write.format("csv").mode("append").save("schema.csv")
regards
mahi
df.schema.fields to get fields & its datatype.
Check below code.
scala> val schema = df.schema.fields.map(field => (field.name,field.dataType.typeName)).toList.toDF("column_name","datatype")
schema: org.apache.spark.sql.DataFrame = [column_name: string, datatype: string]
scala> schema.show(false)
+---------------+--------+
|column_name |datatype|
+---------------+--------+
|applicationName|string |
|id |string |
|requestId |string |
|version |long |
+---------------+--------+
scala> schema.write.format("csv").save("/tmp/schema")
Try something like below use coalesce(1) and .option("header","true") to output with header
import java.io.FileWriter
object SparkSchema {
def main(args: Array[String]): Unit = {
val fw = new FileWriter("src/main/resources/csv.schema", true)
fw.write("column_name,datatype\n")
val spark = Constant.getSparkSess
import spark.implicits._
val df = List(("", "", "", 1l)).toDF("applicationName", "id", "requestId", "version")
val columnList : List[(String, String)] = df.schema.fields.map(field => (field.name, field.dataType.typeName))
.toList
try {
val outString = columnList.map(col => {
col._1 + "," + col._2
}).mkString("\n")
fw.write(outString)
}
finally fw.close()
val newColumnList : List[(String, String)] = List(("newColumn","integer"))
val finalColList = columnList ++ newColumnList
writeToS3("s3://bucket/newFileName.csv",finalColList)
}
def writeToS3(s3FileNameWithpath : String,finalColList : List[(String,String)]) {
val outString = finalColList.map(col => {
col._1 + "," + col._2
}).mkString("\\n")
import org.apache.hadoop.fs._
import org.apache.hadoop.conf.Configuration
val conf = new Configuration()
conf.set("fs.s3a.access.key", "YOUR ACCESS KEY")
conf.set("fs.s3a.secret.key", "YOUR SECRET KEY")
val dest = new Path(s3FileNameWithpath)
val fs = dest.getFileSystem(conf)
val out = fs.create(dest, true)
out.write( outString.getBytes )
out.close()
}
}
An alternative to #QuickSilver's and #Srinivas' solutions, which they should both work, is to use the DDL representation of the schema. With df.schema.toDDL you get:
CC STRING, fun STRING, Head STRING, Country STRING, SendType STRING
which is the string representation of the schema then you can split and replace as shown next:
import java.io.PrintWriter
val schema = df.schema.toDDL.split(",")
// Array[String] = Array(`CC` STRING, `fun` STRING, `Head` STRING, `Country` STRING, `SendType` STRING)
val writer = new PrintWriter("/tmp/schema.csv")
writer.write("column_name,datatype\n")
schema.foreach{ r => writer.write(r.replace(" ", ",") + "\n") }
writer.close()
To write to S3 you can use Hadoop API as QuickSilver already implemented or a 3rd party library such as MINIO:
import io.minio.MinioClient
val minioClient = new MinioClient("https://play.min.io", "ACCESS_KEY", "SECRET_KEY")
minioClient.putObject("YOUR_BUCKET","schema.csv", "/tmp/schema.csv", null)
Or even better by generating a string, storing it into a buffer and then send it via InputStream to S3:
import java.io.ByteArrayInputStream
import io.minio.MinioClient
val minioClient = new MinioClient("https://play.min.io", "ACCESS_KEY", "SECRET_KEY")
val schema = df.schema.toDDL.split(",")
val schemaBuffer = new StringBuilder
schemaBuffer ++= "column_name,datatype\n"
schema.foreach{ r => schemaBuffer ++= r.replace(" ", ",") + "\n" }
val inputStream = new ByteArrayInputStream(schemaBuffer.toString.getBytes("UTF-8"))
minioClient.putObject("YOUR_BUCKET", "schema.csv", inputStream, new PutObjectOptions(inputStream.available(), -1))
inputStream.close
#PySpark
df_schema = spark.createDataFrame([(i.name, str(i.dataType)) for i in df.schema.fields], ['column_name', 'datatype'])
df_schema.show()
This will create new dataFrame for schema of existing dataframe
UseCase:
Useful when you want create table with Schema of the dataframe & you cannot use below code as pySpark user may not be authorized to execute DDL commands on database.
df.createOrReplaceTempView("tmp_output_table")
spark.sql("""drop table if exists schema.output_table""")
spark.sql("""create table schema.output_table as select * from tmp_output_table""")
In Pyspark - You can find all column names & data types (DataType) of PySpark DataFrame by using df.dtypes. Follow this link for more details pyspark.sql.DataFrame.dtypes
Having said that, try using below code -
data = df.dtypes
cols = ["col_name", "datatype"]
df = spark.createDataFrame(data=data,schema=cols)
df.show()

How to perform UPSERT or MERGE operation in Apache Spark?

I am trying to update and insert records to old Dataframe using unique column "ID" using Apache Spark.
In order to update Dataframe, you can perform "left_anti" join on unique columns and then UNION it with Dataframe which contains new records
def refreshUnion(oldDS: Dataset[_], newDS: Dataset[_], usingColumns: Seq[String]): Dataset[_] = {
val filteredNewDS = selectAndCastColumns(newDS, oldDS)
oldDS.join(
filteredNewDS,
usingColumns,
"left_anti")
.select(oldDS.columns.map(columnName => col(columnName)): _*)
.union(filteredNewDS.toDF)
}
def selectAndCastColumns(ds: Dataset[_], refDS: Dataset[_]): Dataset[_] = {
val columns = ds.columns.toSet
ds.select(refDS.columns.map(c => {
if (!columns.contains(c)) {
lit(null).cast(refDS.schema(c).dataType) as c
} else {
ds(c).cast(refDS.schema(c).dataType) as c
}
}): _*)
}
val df = refreshUnion(oldDS, newDS, Seq("ID"))
Spark Dataframes are immutable structure. Therefore, you can't do any update based on the ID.
The way to update dataframe is to merge the older dataframe and the newer dataframe and save the merged dataframe on HDFS. To update the older ID you would require some de-duplication key (Timestamp may be).
I am adding the sample code for this in scala. You need to call the merge function with the uniqueId and the timestamp column name. Timestamp should be in Long.
case class DedupableDF(unique_id: String, ts: Long);
def merge(snapshot: DataFrame)(
delta: DataFrame)(uniqueId: String, timeStampStr: String): DataFrame = {
val mergedDf = snapshot.union(delta)
return dedupeData(mergedDf)(uniqueId, timeStampStr)
}
def dedupeData(dataFrameToDedupe: DataFrame)(
uniqueId: String,
timeStampStr: String): DataFrame = {
import sqlContext.implicits._
def removeDuplicates(
duplicatedDataFrame: DataFrame): Dataset[DedupableDF] = {
val dedupableDF = duplicatedDataFrame.map(a =>
DedupableDF(a(0).asInstanceOf[String], a(1).asInstanceOf[Long]))
val mappedPairRdd =
dedupableDF.map(row ⇒ (row.unique_id, (row.unique_id, row.ts))).rdd;
val reduceByKeyRDD = mappedPairRdd
.reduceByKey((row1, row2) ⇒ {
if (row1._2 > row2._2) {
row1
} else {
row2
}
})
.values;
val ds = reduceByKeyRDD.toDF.map(a =>
DedupableDF(a(0).asInstanceOf[String], a(1).asInstanceOf[Long]))
return ds;
}
/** get distinct unique_id, timestamp combinations **/
val filteredData =
dataFrameToDedupe.select(uniqueId, timeStampStr).distinct
val dedupedData = removeDuplicates(filteredData)
dataFrameToDedupe.createOrReplaceTempView("duplicatedDataFrame");
dedupedData.createOrReplaceTempView("dedupedDataFrame");
val dedupedDataFrame =
sqlContext.sql(s""" select distinct duplicatedDataFrame.*
from duplicatedDataFrame
join dedupedDataFrame on
(duplicatedDataFrame.${uniqueId} = dedupedDataFrame.unique_id
and duplicatedDataFrame.${timeStampStr} = dedupedDataFrame.ts)""")
return dedupedDataFrame
}

How to deal with contexts in Spark/Scala when using map()

I'm not very familiar with Scala, neither with Spark, and I'm trying to develop a basic test to understand how DataFrames actually work. My objective is to update my myDF based on values of some registries of another table.
Well, on the one hand, I have my App:
object TestApp {
def main(args: Array[String]) {
val conf: SparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(conf)
implicit val hiveContext : SQLContext = new HiveContext(sc)
val test: Test = new Test()
test.test
}
}
On the other hand, I have my Test class :
class Test(implicit sqlContext: SQLContext) extends Serializable {
val hiveContext: SQLContext = sqlContext
import hiveContext.implicits._
def test(): Unit = {
val myDF = hiveContext.read.table("myDB.Customers").sort($"cod_a", $"start_date".desc)
myDF.map(myMap).take(1)
}
def myMap(row: Row): Row = {
def _myMap: (String, String) = {
val investmentDF: DataFrame = hiveContext.read.table("myDB.Investment")
var target: (String, String) = casoX(investmentDF, row.getAs[String]("cod_a"), row.getAs[String]("cod_p"))
target
}
def casoX(df: DataFrame, codA: String, codP: String)(implicit hiveContext: SQLContext): (String, String) = {
var rows: Array[Row] = null
if (codP != null) {
println(df)
rows = df.filter($"cod_a" === codA && $"cod_p" === codP).orderBy($"sales".desc).select($"cod_t", $"nom_t").collect
} else {
rows = df.filter($"cod_a" === codA).orderBy($"sales".desc).select($"cod_t", $"nom_t").collect
}
if (rows.length > 0) (row(0).asInstanceOf[String], row(1).asInstanceOf[String]) else null
}
val target: (String, String) = _myMap
Row(row(0), row(1), row(2), row(3), row(4), row(5), row(6), target._1, target._2, row(9))
}
}
Well, when I execute it, I have a NullPointerException on the instruction val investmentDF: DataFrame = hiveContext.read.table("myDB.Investment"), and more precisely hiveContext.read
If I analyze hiveContext in the "test" function, I can access to its SparkContext, and I can load my DF without any problem.
Nevertheless if I analyze my hiveContext object just before getting the NullPointerException, its sparkContext is null, and I suppose due to sparkContext is not Serializable (and as I am in a map function, I'm loosing part of my hiveContext object, am I right?)
Anyway, I don't know what's wrong exactly with my code, and how should I alter it to get my investmentDF without any NullPointerException?
Thanks!

why remove header from csv file doesn't work

object test {
case class Caserne(x: String, y: String, Name: String, Description: String)
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("BankDataAnalysis").setMaster("local[1]")
val sc = new SparkContext(conf)
val sqlContext= new SQLContext(sc)
import sqlContext.implicits._
// load caserne data
val caserneTxt = sc.parallelize(
IOUtils.toString(
new URL("http://donnees.ville.montreal.qc.ca/dataset/c69e78c6-e454-4bd9-9778-e4b0eaf8105b/resource/f6542ad1-31f5-458e-b33d-1a028fab3e98/download/casernessim.csv"),
Charset.forName("utf8")).split("\n"))
val header = caserneTxt.first()
val caserne = caserneTxt.map(s => s.split(",")).filter(s => s != header).map(
s => Caserne(s(0),
s(1),
s(2).replaceAll("[^\\d]", "").trim(),
s(3).replaceAll("""<(?!\/?a(?=>|\s.*>))\/?.*?>""", " ").trim()
)).toDF()
caserne.registerTempTable("caserne")
sqlContext.sql("Select * from caserne").show()
}
}
I have to remove csv file header. I used filter(s => s != header) but it did'nt work. Thank you for your help
Try using :-
val rows = data.filter(s=> header(s,"X") != "X")
reference :- How do I convert csv file to rdd
I found this convenient method
val header = caserneTxt.first()
val no_header = caserneTxt.filter(_(0) != header(0))
one way would be using the one of the header key and filter that from dataframe something like below
dataFrame.filter(row => row.getAs[String]("description") != "description").show

Generate keywords using Apache Spark and mllib

I wrote code like this:
val hashingTF = new HashingTF()
val tfv: RDD[Vector] = sparkContext.parallelize(articlesList.map { t => hashingTF.transform(t.words) })
tfv.cache()
val idf = new IDF().fit(tfv)
val rate: RDD[Vector] = idf.transform(tfv)
How to get top 5 keywords from the "rate" RDD for each articlesList item?
ADD:
articlesList contains objects:
case class ArticleInfo (val url: String, val author: String, val date: String, val keyWords: List[String], val words: List[String])
words contains all words from article.
I do not understand the structure of rate, in the documentation says:
#return an RDD of TF-IDF vectors
My solution is:
(articlesList, rate.collect()).zipped.foreach { (art,tfidf) =>
val keywords = new mutable.TreeSet[(String, Double)]
art.words.foreach { word =>
val wordHash = hashingTF.indexOf(word)
val wordTFIDF = tfidf.apply(wordHash)
if (keywords.size == KEYWORD_COUNT) {
val minimum = keywords.minBy(_._2)
if (minimum._2 < wordHash) {
keywords.remove(minimum)
keywords.add((word,wordTFIDF))
}
} else {
keywords.add((word,wordTFIDF))
}
}
art.keyWords = keywords.toList.map(_._1)
}