How to export multiple maps to csv in Scala? - scala

How can I export all these maps to a single csv file using scala? mapKey contains all keys from all the maps, but not all of the (language) maps may have a value for a certain key.
The head row should contain "key", "default", "de", "fr", "it", "en"
val mapDefault: Map[String, String] = messages.getOrElse("default",Map())
val mapDe: Map[String, String] = messages.getOrElse("de", Map())
val mapFr: Map[String, String] = messages.getOrElse("fr", Map())
val mapEn: Map[String, String] = messages.getOrElse("en", Map())
val mapIt: Map[String, String] = messages.getOrElse("it", Map())
var mapKey: Set[String] = mapDefault.keySet ++ mapDe.keySet ++
mapFr.keySet ++ mapEn.keySet ++ mapIt.keySet

As mentioned in comment - it's best to use some library that constructs the CSV for you, especially to deal with special characters (comma or newline) in input (which would break your CSV if you're not careful).
Either way - you first have to transform your data into a sequence of records with a constant order and number of columns.
Below is an implementation that does not use any library, just to show the gist of how it's done in Scala - feel free to replace the actual CSV creation with a proper library:
// create headers row:
val headers = Seq("key", "default", "de", "fr", "it", "en")
// data rows: for each key in mapKey, create a Seq[String] with the values for this
// key (in correct order - matching the headers), or with empty String for
// missing values
val records: Seq[Seq[String]] = mapKey.toSeq.map(key => Seq(
key,
mapDefault.getOrElse(key, ""),
mapDe.getOrElse(key, ""),
mapFr.getOrElse(key, ""),
mapIt.getOrElse(key, ""),
mapEn.getOrElse(key, "")
))
// add the headers as the first record
val allRows: Seq[Seq[String]] = headers +: records
// create CSV (naive implementation - assumes no special chars in input!)
val csv: String = allRows.map(_.mkString(",")).mkString("\n")
// write to file:
new PrintWriter("filename.csv") { write(csv); close() }

Related

Load constraints from csv-file (amazon deequ)

I'm checking out Deequ which seems like a really nice library. I was wondering if it is possible to load constraints from a csv file or an orc-table in HDFS?
Lets say I have a table with theese types
case class Item(
id: Long,
productName: String,
description: String,
priority: String,
numViews: Long
)
and I want to put constraints like:
val checks = Check(CheckLevel.Error, "unit testing my data")
.isComplete("id") // should never be NULL
.isUnique("id") // should not contain duplicates
But I want to load the ".isComplete("id")", ".isUnique("id")" from a csv file so the business can add the constraints and we can run te tests based on their input
val verificationResult = VerificationSuite()
.onData(data)
.addChecks(Seq(checks))
.run()
I've managed to get the constraints from suggestionResult.constraintSuggestion
val allConstraints = suggestionResult.constraintSuggestions
.flatMap { case (_, suggestions) => suggestions.map { _.constraint }}
.toSeq
which gives a List like for example:
allConstraints = List(CompletenessConstraint(Completeness(id,None)), ComplianceConstraint(Compliance('id' has no negative values,id >= 0,None))
But it gets generated from suggestionResult.constraintSuggestions. But I want to be able to create a List like that based on the inputs from a csv file, can anyone help me?
To sum things up:
Basically I just want to add:
val checks = Check(CheckLevel.Error, "unit testing my data")
.isComplete("columnName1")
.isUnique("columnName1")
.isComplete("columnName2")
dynamically based on a file where the file has for example:
columnName;isUnique;isComplete (header)
columnName1;true;true
columnName2;false;true
I chose to store the CSV in src/main/resources as it's very easy to read from there, and easy to maintain in parallel with the code being QA'ed.
def readCSV(spark: SparkSession, filename: String): DataFrame = {
import spark.implicits._
val inputFileStream = Try {
this.getClass.getResourceAsStream("/" + filename)
}
.getOrElse(
throw new Exception("Cannot find" + filename + "in src/main/resources")
)
val readlines =
scala.io.Source.fromInputStream(inputFileStream).getLines.toList
val csvData: Dataset[String] =
spark.sparkContext.parallelize(readlines).toDS
spark.read.option("header", true).option("inferSchema", true).csv(csvData)
}
This loads it as a DataFrame; this can easily be passed to code like gavincruick's example on GitHub, copied here for convenience:
//code to build verifier from DF that has a 'Constraint' column
type Verifier = DataFrame => VerificationResult
def generateVerifier(df: DataFrame, columnName: String): Try[Verifier] = {
val constraintCheckCodes: Seq[String] = df.select(columnName).collect().map(_(0).toString).toSeq
def checkSrcCode(checkCodeMethod: String, id: Int): String = s"""com.amazon.deequ.checks.Check(com.amazon.deequ.checks.CheckLevel.Error, "$id")$checkCodeMethod"""
val verifierSrcCode = s"""{
|import com.amazon.deequ.constraints.ConstrainableDataTypes
|import com.amazon.deequ.{VerificationResult, VerificationSuite}
|import org.apache.spark.sql.DataFrame
|
|val checks = Seq(
| ${constraintCheckCodes.zipWithIndex
.map { (checkSrcCode _).tupled }
.mkString(",\n ")}
|)
|
|(data: DataFrame) => VerificationSuite().onData(data).addChecks(checks).run()
|}
""".stripMargin.trim
println(s"Verification function source code:\n$verifierSrcCode\n")
compile[Verifier](verifierSrcCode)
}
/** Compiles the scala source code that, when evaluated, produces a value of type T. */
def compile[T](source: String): Try[T] =
Try {
val toolbox = currentMirror.mkToolBox()
val tree = toolbox.parse(source)
val compiledCode = toolbox.compile(tree)
compiledCode().asInstanceOf[T]
}
//example usage...
//sample test data
val testDataDF = Seq(
("2020-02-12", "England", "E10000034", "Worcestershire", 1),
("2020-02-12", "Wales", "W11000024", "Powys", 0),
("2020-02-12", "Wales", null, "Unknown", 1),
("2020-02-12", "Canada", "MADEUP", "Ontario", 1)
).toDF("Date", "Country", "AreaCode", "Area", "TotalCases")
//constraints in a DF
val constraintsDF = Seq(
(".isComplete(\"Area\")"),
(".isComplete(\"Country\")"),
(".isComplete(\"TotalCases\")"),
(".isComplete(\"Date\")"),
(".hasCompleteness(\"AreaCode\", _ >= 0.80, Some(\"It should be above 0.80!\"))"),
(".isContainedIn(\"Country\", Array(\"England\", \"Scotland\", \"Wales\", \"Northern Ireland\"))")
).toDF("Constraint")
//Build Verifier from constraints DF
val verifier = generateVerifier(constraintsDF, "Constraint").get
//Run verifier against a sample DF
val result = verifier(testDataDF)
//display results
VerificationResult.checkResultsAsDataFrame(spark, result).show()
It depends on how complicated you want to allow the constraints to be. In general, deequ allows you to use arbitrary scala code for the validation function of a constraint, so its difficult (and dangerous from a security perspective) to load that from a file.
I think you would have to come up with your own schema and semantics for the CSV file, at least it is not directly supported in deequ.

I want to save Map[String, String] to disk and later read it back as same type. Somehow I cannot find collectAsMap method with my sparkContext

I am working on Spark Scala and there is a requirement to save Map[String, String] to the disk so that a different Spark application can read it.
(x,1),(y,2)...
To Save:
sc.parallelize(itemMap.toSeq).coalesce(1).saveAsTextFile(fileName)
I am doing a coalesce as the data is only 450 rows.
But to read it back, I am not able to convert it back to Map[String, String]
val myMap = sc.textFile(fileName).zipWithUniqueId().collect.toMap
the data comes as
((x,1),0),((y,2),1)...
What is the possible solution?
Thanks.
Loading a text file results in RDD[String], so you will have to deserialize your string representations of the tuples.
You can change your Save operation to add a delimiter between tuple value 1 and tuple value 2, or parse the string (:v1, :v2).
val d = spark.sparkContext.textFile(fileName)
val myMap = d.map(s => {
val parsedVals = s.substring(1, s.length-1).split(",")
(parsedVals(0), parsedVals(1))
}).collect.toMap
Alternatively, you can change your save operation to create a delimiter (like a comma) and parse the structure that way:
itemMap.toSeq.map(kv => kv._1 + "," + kv._2).saveAsTextFile(fileName)
val myMap = spark.sparkContext.textFile("trash3.txt")
.map(_.split(","))
.map(d => (d(0), d(1)))
.collect.toMap
Method "collectAsMap" exists in "PairRDDFunctions" class, means, applicable only for RDD with two values RDD[(K, V)].
If this function call is required, can be organized with code below. Dataframe is used for store in csv format ant avoid hand-made parsing
val originalMap = Map("x" -> 1, "y" -> 2)
// write
sparkContext.parallelize(originalMap.toSeq).coalesce(1).toDF("k", "v").write.csv(path)
// read
val restoredDF = spark.read.csv(path)
val restoredMap = restoredDF.rdd.map(r => (r.getString(0), r.getString(1))).collectAsMap()
println("restored map: " + restoredMap)
Output:
restored map: Map(y -> 2, x -> 1)

What is happening here in Spark Code

val jsonString = sc.sequenceFile[Long,String](paths).map(x => {
x._2
})
The paths variable has csv of locations for some text files.
What exactly is the code mapping here?
can some explain me what is happening here. i am bit confused.
sc.sequenceFile reads Hadoop Sequence Files which are flat files storing key-value pairs (see https://wiki.apache.org/hadoop/SequenceFile). It produces a RDD[(Long, String)] - an RDD where each record is a key-value pair (of type Tuple2[Long, String]). Then, the mapping simply maps each pair to the value only (using Tuple2._2).
The resulting jsonString is an RDD[String].
The map operation can be replaced with a few equivalent calls that may or may not be clearer:
// Using the implicit PairRDDFunctions.values:
val jsonString = sc.sequenceFile[Long,String](paths).values
// Using map with anonymous argument:
val rdd = sc.parallelize(Seq((1, 2))).map(_._2)
// Using map with pattern matching:
val rdd = sc.parallelize(Seq((1, 2))).map {
case (key, value) => value
}

Streaming CSV Source with AKKA-HTTP

I am trying to stream data from Mongodb using reactivemongo-akkastream 0.12.1 and return the result into a CSV stream in one of the routes (using Akka-http).
I did implement that following the exemple here:
http://doc.akka.io/docs/akka-http/10.0.0/scala/http/routing-dsl/source-streaming-support.html#simple-csv-streaming-example
and it looks working fine.
The only problem I am facing now is how to add the headers to the output CSV file. Any ideas?
Thanks
Aside from the fact that that example isn't really a robust method of generating CSV (doesn't provide proper escaping) you'll need to rework it a bit to add headers. Here's what I would do:
make a Flow to convert a Source[Tweet] to a source of CSV rows, e.g. a Source[List[String]]
concatenate it to a source containing your headers as a single List[String]
adapt the marshaller to render a source of rows rather than tweets
Here's some example code:
case class Tweet(uid: String, txt: String)
def getTweets: Source[Tweet, NotUsed] = ???
val tweetToRow: Flow[Tweet, List[String], NotUsed] =
Flow[Tweet].map { t =>
List(
t.uid,
t.txt.replaceAll(",", "."))
}
// provide a marshaller from a row (List[String]) to a ByteString
implicit val tweetAsCsv = Marshaller.strict[List[String], ByteString] { row =>
Marshalling.WithFixedContentType(ContentTypes.`text/csv(UTF-8)`, () =>
ByteString(row.mkString(","))
)
}
// enable csv streaming
implicit val csvStreaming = EntityStreamingSupport.csv()
val route = path("tweets") {
val headers = Source.single(List("uid", "text"))
val tweets: Source[List[String], NotUsed] = getTweets.via(tweetToRow)
complete(headers.concat(tweets))
}
Update: if your getTweets method returns a Future you can just map over its source value and prepend the headers that way, e.g:
val route = path("tweets") {
val headers = Source.single(List("uid", "text"))
val rows: Future[Source[List[String], NotUsed]] = getTweets
.map(tweets => headers.concat(tweets.via(tweetToRow)))
complete(rows)
}

How do I detect if a Spark DataFrame has a column

When I create a DataFrame from a JSON file in Spark SQL, how can I tell if a given column exists before calling .select
Example JSON schema:
{
"a": {
"b": 1,
"c": 2
}
}
This is what I want to do:
potential_columns = Seq("b", "c", "d")
df = sqlContext.read.json(filename)
potential_columns.map(column => if(df.hasColumn(column)) df.select(s"a.$column"))
but I can't find a good function for hasColumn. The closest I've gotten is to test if the column is in this somewhat awkward array:
scala> df.select("a.*").columns
res17: Array[String] = Array(b, c)
Just assume it exists and let it fail with Try. Plain and simple and supports an arbitrary nesting:
import scala.util.Try
import org.apache.spark.sql.DataFrame
def hasColumn(df: DataFrame, path: String) = Try(df(path)).isSuccess
val df = sqlContext.read.json(sc.parallelize(
"""{"foo": [{"bar": {"foobar": 3}}]}""" :: Nil))
hasColumn(df, "foobar")
// Boolean = false
hasColumn(df, "foo")
// Boolean = true
hasColumn(df, "foo.bar")
// Boolean = true
hasColumn(df, "foo.bar.foobar")
// Boolean = true
hasColumn(df, "foo.bar.foobaz")
// Boolean = false
Or even simpler:
val columns = Seq(
"foobar", "foo", "foo.bar", "foo.bar.foobar", "foo.bar.foobaz")
columns.flatMap(c => Try(df(c)).toOption)
// Seq[org.apache.spark.sql.Column] = List(
// foo, foo.bar AS bar#12, foo.bar.foobar AS foobar#13)
Python equivalent:
from pyspark.sql.utils import AnalysisException
from pyspark.sql import Row
def has_column(df, col):
try:
df[col]
return True
except AnalysisException:
return False
df = sc.parallelize([Row(foo=[Row(bar=Row(foobar=3))])]).toDF()
has_column(df, "foobar")
## False
has_column(df, "foo")
## True
has_column(df, "foo.bar")
## True
has_column(df, "foo.bar.foobar")
## True
has_column(df, "foo.bar.foobaz")
## False
Another option which I normally use is
df.columns.contains("column-name-to-check")
This returns a boolean
Actually you don't even need to call select in order to use columns, you can just call it on the dataframe itself
// define test data
case class Test(a: Int, b: Int)
val testList = List(Test(1,2), Test(3,4))
val testDF = sqlContext.createDataFrame(testList)
// define the hasColumn function
def hasColumn(df: org.apache.spark.sql.DataFrame, colName: String) = df.columns.contains(colName)
// then you can just use it on the DF with a given column name
hasColumn(testDF, "a") // <-- true
hasColumn(testDF, "c") // <-- false
Alternatively you can define an implicit class using the pimp my library pattern so that the hasColumn method is available on your dataframes directly
implicit class DataFrameImprovements(df: org.apache.spark.sql.DataFrame) {
def hasColumn(colName: String) = df.columns.contains(colName)
}
Then you can use it as:
testDF.hasColumn("a") // <-- true
testDF.hasColumn("c") // <-- false
Try is not optimal as the it will evaluate the expression inside Try before it takes the decision.
For large data sets, use the below in Scala:
df.schema.fieldNames.contains("column_name")
For those who stumble across this looking for a Python solution, I use:
if 'column_name_to_check' in df.columns:
# do something
When I tried #Jai Prakash's answer of df.columns.contains('column-name-to-check') using Python, I got AttributeError: 'list' object has no attribute 'contains'.
Your other option for this would be to do some array manipulation (in this case an intersect) on the df.columns and your potential_columns.
// Loading some data (so you can just copy & paste right into spark-shell)
case class Document( a: String, b: String, c: String)
val df = sc.parallelize(Seq(Document("a", "b", "c")), 2).toDF
// The columns we want to extract
val potential_columns = Seq("b", "c", "d")
// Get the intersect of the potential columns and the actual columns,
// we turn the array of strings into column objects
// Finally turn the result into a vararg (: _*)
df.select(potential_columns.intersect(df.columns).map(df(_)): _*).show
Alas this will not work for you inner object scenario above. You will need to look at the schema for that.
I'm going to change your potential_columns to fully qualified column names
val potential_columns = Seq("a.b", "a.c", "a.d")
// Our object model
case class Document( a: String, b: String, c: String)
case class Document2( a: Document, b: String, c: String)
// And some data...
val df = sc.parallelize(Seq(Document2(Document("a", "b", "c"), "c2")), 2).toDF
// We go through each of the fields in the schema.
// For StructTypes we return an array of parentName.fieldName
// For everything else we return an array containing just the field name
// We then flatten the complete list of field names
// Then we intersect that with our potential_columns leaving us just a list of column we want
// we turn the array of strings into column objects
// Finally turn the result into a vararg (: _*)
df.select(df.schema.map(a => a.dataType match { case s : org.apache.spark.sql.types.StructType => s.fieldNames.map(x => a.name + "." + x) case _ => Array(a.name) }).flatMap(x => x).intersect(potential_columns).map(df(_)) : _*).show
This only goes one level deep, so to make it generic you would have to do more work.
in pyspark you can simply run
'field' in df.columns
If you shred your json using a schema definition when you load it then you don't need to check for the column. if it's not in the json source it will appear as a null column.
val schemaJson = """
{
"type": "struct",
"fields": [
{
"name": field1
"type": "string",
"nullable": true,
"metadata": {}
},
{
"name": field2
"type": "string",
"nullable": true,
"metadata": {}
}
]
}
"""
val schema = DataType.fromJson(schemaJson).asInstanceOf[StructType]
val djson = sqlContext.read
.schema(schema )
.option("badRecordsPath", readExceptionPath)
.json(dataPath)
def hasColumn(df: org.apache.spark.sql.DataFrame, colName: String) =
Try(df.select(colName)).isSuccess
Use the above mentioned function to check the existence of column including nested column name.
In PySpark, df.columns gives you a list of columns in the dataframe, so
"colName" in df.columns
would return a True or False. Give a try on that. Good luck!
For nested columns you can use
df.schema.simpleString().find('column_name')