Streaming CSV Source with AKKA-HTTP - mongodb

I am trying to stream data from Mongodb using reactivemongo-akkastream 0.12.1 and return the result into a CSV stream in one of the routes (using Akka-http).
I did implement that following the exemple here:
http://doc.akka.io/docs/akka-http/10.0.0/scala/http/routing-dsl/source-streaming-support.html#simple-csv-streaming-example
and it looks working fine.
The only problem I am facing now is how to add the headers to the output CSV file. Any ideas?
Thanks

Aside from the fact that that example isn't really a robust method of generating CSV (doesn't provide proper escaping) you'll need to rework it a bit to add headers. Here's what I would do:
make a Flow to convert a Source[Tweet] to a source of CSV rows, e.g. a Source[List[String]]
concatenate it to a source containing your headers as a single List[String]
adapt the marshaller to render a source of rows rather than tweets
Here's some example code:
case class Tweet(uid: String, txt: String)
def getTweets: Source[Tweet, NotUsed] = ???
val tweetToRow: Flow[Tweet, List[String], NotUsed] =
Flow[Tweet].map { t =>
List(
t.uid,
t.txt.replaceAll(",", "."))
}
// provide a marshaller from a row (List[String]) to a ByteString
implicit val tweetAsCsv = Marshaller.strict[List[String], ByteString] { row =>
Marshalling.WithFixedContentType(ContentTypes.`text/csv(UTF-8)`, () =>
ByteString(row.mkString(","))
)
}
// enable csv streaming
implicit val csvStreaming = EntityStreamingSupport.csv()
val route = path("tweets") {
val headers = Source.single(List("uid", "text"))
val tweets: Source[List[String], NotUsed] = getTweets.via(tweetToRow)
complete(headers.concat(tweets))
}
Update: if your getTweets method returns a Future you can just map over its source value and prepend the headers that way, e.g:
val route = path("tweets") {
val headers = Source.single(List("uid", "text"))
val rows: Future[Source[List[String], NotUsed]] = getTweets
.map(tweets => headers.concat(tweets.via(tweetToRow)))
complete(rows)
}

Related

Convert a Source to Flow in Scala

How to convert a Source to Flow?
Input: Source[ByteString,NotUsed] a
Intermediary Step: Call an API which returns an InputStream
Output: Flow[ByteString,ByteString,NotUsed]
I am doing it as:
Type of input is = Source[ByteString,NotUsed]
val sink: Sink[ByteString,InputStream] = StreamConverters.asInputStream()
val output: InputStream = <API CALL>
val mySource: Source[ByteString,Future[IOResult]] = StreamConverters.fromInputStream(() => output)
val myFlow: Flow[ByteString,ByteString,NotUsed] = Flow.fromSinkAndSource(sink,source)
When I use the above Flow in the source it returns an empty result. Can someone help me figure out of I am doing it right?
I'm not sure tu fully grasp what you want to achieve but maybe this is a use case for flatMapConcat:
def readInputstream(bs: ByteString): Source[ByteString, Future[IOResult]] =
// Get some IS from the ByteString
StreamConverters.fromInputStream(() => ???)
val myFlow: Flow[ByteString, ByteString, NotUsed] =
Flow.flatMapConcat(bs => readInputstream(bs))
// And use it like this:
val source: Source[ByteString] = ???
source
.via(myFlow)
.to(???)

Load constraints from csv-file (amazon deequ)

I'm checking out Deequ which seems like a really nice library. I was wondering if it is possible to load constraints from a csv file or an orc-table in HDFS?
Lets say I have a table with theese types
case class Item(
id: Long,
productName: String,
description: String,
priority: String,
numViews: Long
)
and I want to put constraints like:
val checks = Check(CheckLevel.Error, "unit testing my data")
.isComplete("id") // should never be NULL
.isUnique("id") // should not contain duplicates
But I want to load the ".isComplete("id")", ".isUnique("id")" from a csv file so the business can add the constraints and we can run te tests based on their input
val verificationResult = VerificationSuite()
.onData(data)
.addChecks(Seq(checks))
.run()
I've managed to get the constraints from suggestionResult.constraintSuggestion
val allConstraints = suggestionResult.constraintSuggestions
.flatMap { case (_, suggestions) => suggestions.map { _.constraint }}
.toSeq
which gives a List like for example:
allConstraints = List(CompletenessConstraint(Completeness(id,None)), ComplianceConstraint(Compliance('id' has no negative values,id >= 0,None))
But it gets generated from suggestionResult.constraintSuggestions. But I want to be able to create a List like that based on the inputs from a csv file, can anyone help me?
To sum things up:
Basically I just want to add:
val checks = Check(CheckLevel.Error, "unit testing my data")
.isComplete("columnName1")
.isUnique("columnName1")
.isComplete("columnName2")
dynamically based on a file where the file has for example:
columnName;isUnique;isComplete (header)
columnName1;true;true
columnName2;false;true
I chose to store the CSV in src/main/resources as it's very easy to read from there, and easy to maintain in parallel with the code being QA'ed.
def readCSV(spark: SparkSession, filename: String): DataFrame = {
import spark.implicits._
val inputFileStream = Try {
this.getClass.getResourceAsStream("/" + filename)
}
.getOrElse(
throw new Exception("Cannot find" + filename + "in src/main/resources")
)
val readlines =
scala.io.Source.fromInputStream(inputFileStream).getLines.toList
val csvData: Dataset[String] =
spark.sparkContext.parallelize(readlines).toDS
spark.read.option("header", true).option("inferSchema", true).csv(csvData)
}
This loads it as a DataFrame; this can easily be passed to code like gavincruick's example on GitHub, copied here for convenience:
//code to build verifier from DF that has a 'Constraint' column
type Verifier = DataFrame => VerificationResult
def generateVerifier(df: DataFrame, columnName: String): Try[Verifier] = {
val constraintCheckCodes: Seq[String] = df.select(columnName).collect().map(_(0).toString).toSeq
def checkSrcCode(checkCodeMethod: String, id: Int): String = s"""com.amazon.deequ.checks.Check(com.amazon.deequ.checks.CheckLevel.Error, "$id")$checkCodeMethod"""
val verifierSrcCode = s"""{
|import com.amazon.deequ.constraints.ConstrainableDataTypes
|import com.amazon.deequ.{VerificationResult, VerificationSuite}
|import org.apache.spark.sql.DataFrame
|
|val checks = Seq(
| ${constraintCheckCodes.zipWithIndex
.map { (checkSrcCode _).tupled }
.mkString(",\n ")}
|)
|
|(data: DataFrame) => VerificationSuite().onData(data).addChecks(checks).run()
|}
""".stripMargin.trim
println(s"Verification function source code:\n$verifierSrcCode\n")
compile[Verifier](verifierSrcCode)
}
/** Compiles the scala source code that, when evaluated, produces a value of type T. */
def compile[T](source: String): Try[T] =
Try {
val toolbox = currentMirror.mkToolBox()
val tree = toolbox.parse(source)
val compiledCode = toolbox.compile(tree)
compiledCode().asInstanceOf[T]
}
//example usage...
//sample test data
val testDataDF = Seq(
("2020-02-12", "England", "E10000034", "Worcestershire", 1),
("2020-02-12", "Wales", "W11000024", "Powys", 0),
("2020-02-12", "Wales", null, "Unknown", 1),
("2020-02-12", "Canada", "MADEUP", "Ontario", 1)
).toDF("Date", "Country", "AreaCode", "Area", "TotalCases")
//constraints in a DF
val constraintsDF = Seq(
(".isComplete(\"Area\")"),
(".isComplete(\"Country\")"),
(".isComplete(\"TotalCases\")"),
(".isComplete(\"Date\")"),
(".hasCompleteness(\"AreaCode\", _ >= 0.80, Some(\"It should be above 0.80!\"))"),
(".isContainedIn(\"Country\", Array(\"England\", \"Scotland\", \"Wales\", \"Northern Ireland\"))")
).toDF("Constraint")
//Build Verifier from constraints DF
val verifier = generateVerifier(constraintsDF, "Constraint").get
//Run verifier against a sample DF
val result = verifier(testDataDF)
//display results
VerificationResult.checkResultsAsDataFrame(spark, result).show()
It depends on how complicated you want to allow the constraints to be. In general, deequ allows you to use arbitrary scala code for the validation function of a constraint, so its difficult (and dangerous from a security perspective) to load that from a file.
I think you would have to come up with your own schema and semantics for the CSV file, at least it is not directly supported in deequ.

Scala flattening embedded list of lists

I have created a Twitter datastream that is displaying hashtag, author, and mentioned users in the below format.
(List(timetofly, hellocake),Shera_Eyra,List(blxcknicotine, kimtheskimm))
I can't do analysis on this format because of the embedded lists. How can I create another datastream that displays the data in this format?
timetofly, Shera_Eyra, blxcknicotine
timetofly, Shera_Eyra, kimtheskimm
hellocake, Shera_Eyra, blxcknicotine
hellocake, Shera_Eyra, kimtheskimm
Here is my code to produce the data:
val sparkConf = new SparkConf().setAppName("TwitterPopularTags")
val ssc = new StreamingContext(sparkConf, Seconds(sampleInterval))
val stream = TwitterUtils.createStream(ssc, None)
val data = stream.map {line =>
(line.getHashtagEntities.map(_.getText),
line.getUser().getScreenName(),
line.getUserMentionEntities.map(_.getScreenName).toList)
}
In your code snippet, data is a DStream[(Array[String], String, List[String])]. To get a DStream[String] in your desired format, you can use flatMap and map:
val data = stream.map { line =>
(line.getHashtagEntities.map(_.getText),
line.getUser().getScreenName(),
line.getUserMentionEntities.map(_.getScreenName).toList)
}
val data2 = data.flatMap(a => a._1.flatMap(b => a._3.map(c => (b, a._2, c))))
.map { case (hash, user, mention) => s"$hash, $user, $mention" }
The flatMap results in a DStream[(String, String, String)] in which each tuple consists of a hash tag entity, user, and mention entity. The subsequent call to map with the pattern matching creates a DStream[String] in which each String consists of the elements in each tuple, separated by a comma and space.
I would use for comprehension for this:
val data = (List("timetofly", "hellocake"), "Shera_Eyra", List("blxcknicotine", "kimtheskimm"))
val result = for {
hashtag <- data._1
user = data._2
mentionedUser <- data._3
} yield (hashtag, user, mentionedUser)
result.foreach(println)
Output:
(timetofly,Shera_Eyra,blxcknicotine)
(timetofly,Shera_Eyra,kimtheskimm)
(hellocake,Shera_Eyra,blxcknicotine)
(hellocake,Shera_Eyra,kimtheskimm)
If you would prefer a seq of lists of strings, rather than a seq of tuples of strings, then change the yield to give you a list instead: yield List(hashtag, user, mentionedUser)

spark whole textiles - many small files

I want to ingest many small text files via spark to parquet. Currently, I use wholeTextFiles and perform some parsing additionally.
To be more precise - these small text files are ESRi ASCII Grid files each with a maximum size of around 400kb. GeoTools are used to parse them as outlined below.
Do you see any optimization possibilities? Maybe something to avoid the creation of unnecessary objects? Or something to better handle the small files. I wonder if it is better to only get the paths of the files and manually read them instead of using String -> ByteArrayInputStream.
case class RawRecords(path: String, content: String)
case class GeometryId(idPath: String, value: Double, geo: String)
#transient lazy val extractor = new PolygonExtractionProcess()
#transient lazy val writer = new WKTWriter()
def readRawFiles(path: String, parallelism: Int, spark: SparkSession) = {
import spark.implicits._
spark.sparkContext
.wholeTextFiles(path, parallelism)
.toDF("path", "content")
.as[RawRecords]
.mapPartitions(mapToSimpleTypes)
}
def mapToSimpleTypes(iterator: Iterator[RawRecords]): Iterator[GeometryId] = iterator.flatMap(r => {
val extractor = new PolygonExtractionProcess()
// http://docs.geotools.org/latest/userguide/library/coverage/arcgrid.html
val readRaster = new ArcGridReader(new ByteArrayInputStream(r.content.getBytes(StandardCharsets.UTF_8))).read(null)
// TODO maybe consider optimization of known size instead of using growable data structure
val vectorizedFeatures = extractor.execute(readRaster, 0, true, null, null, null, null).features
val result: collection.Seq[GeometryId] with Growable[GeometryId] = mutable.Buffer[GeometryId]()
while (vectorizedFeatures.hasNext) {
val vectorizedFeature = vectorizedFeatures.next()
val geomWKTLineString = vectorizedFeature.getDefaultGeometry match {
case g: Geometry => writer.write(g)
}
val geomUserdata = vectorizedFeature.getAttribute(1).asInstanceOf[Double]
result += GeometryId(r.path, geomUserdata, geomWKTLineString)
}
result
})
I have suggestions:
use wholeTextFile -> mapPartitions -> convert to Dataset. Why? If you make mapPartitions on Dataset, then all rows are converted from internal format to object - it causes additional serialization.
Run Java Mission Control and sample your application. It will show all compilations and times of execution of methods
Maybe you can use binaryFiles, it will give you Stream, so you can parse it without additional reading in mapPartitions

Convert csv to RDD

I tried the accepted solution in How do I convert csv file to rdd, I want to print out all the users except "om":
val csv = sc.textFile("file.csv") // original file
val data = csv.map(line => line.split(",").map(elem => elem.trim)) //lines in rows
val header = new SimpleCSVHeader(data.take(1)(0)) // we build our header with the first line
val rows = data.filter(line => header(line,"user") != "om") // filter the header out
val users = rows.map(row => header(row,"user")
users.collect().map(user => println(user))
but I got an error:
java.util.NoSuchElementException: key not found: user
I try to debug it and find the index attributes in header look like this:
Since I'm new to spark and scala, does this mean that user is already in a Map? Then why the key not found error?
I found out my mistake. It's not related to Spark/Scala. When I created the example csv, I use command in R:
df <- data.frame(user=c('om','daniel','3754978'),topic=c('scala','spark','spark'),hits=c(120,80,1))
write.csv(df, "df.csv",row.names=FALSE)
but write.csv will add " around factors by default, so that's why the map can't find key user because "user" is the real key, using
write.csv(df, "df.csv",quote=FALSE, row.names=FALSE)
will solve this problem.
I've rewritten the sample code to remove the header method.
IMO, this example provides a step by step walkthrough that is easier to follow. Here is a more detailed explanation.
def main(args: Array[String]): Unit = {
val csv = sc.textFile("/path/to/your/file.csv")
// split / clean data
val headerAndRows = csv.map(line => line.split(",").map(_.trim))
// get header
val header = headerAndRows.first
// filter out header
val data = headerAndRows.filter(_(0) != header(0))
// splits to map (header/value pairs)
val maps = data.map(splits => header.zip(splits).toMap)
// filter out the 'om' user
val result = maps.filter(map => map("user") != "om")
// print result
result.foreach(println)
}