Save Dataset elements to files with specified file path - scala

I have a dataset of event case class that I would like to save the json string element inside it into a file on s3 with a path like bucketName/service/yyyy/mm/dd/hh/[SomeGuid].gz
So for example, the events case class looks like this:
case class Event(
hourPath: String, // e.g. bucketName/service/yyyy/mm/dd/hh/
json: String // The json line that represents this particular event.
... // Other properties used in earlier transformations.
)
Is there a way to save on the dataset where we write the events that belong to a particular hour into a file on s3?
Calling partitionBy on the DataframeWriter is the closest I can get, but the file path isn't exactly what I want.

You can iterate each item and write it to a file in S3. It's efficient to do it with Spark because it will be executed in parallel.
This code is working for me:
val tempDS = eventsDS.rdd.collect.map(x => saveJSONtoS3(x.hourPath,x.json))
def saveJSONtoS3(path: String, jsonString: String) : Unit = {
val bucketName = path.substring(0,path.indexOf('/'));
val file = path.substring(bucketName.length()+1);
val creds = new BasicAWSCredentials(AWS_ACCESS_KEY, AWS_SECRET_KEY)
val amazonS3Client = new AmazonS3Client(creds)
val meta = new ObjectMetadata();
amazonS3Client.putObject(bucketName, file, new ByteArrayInputStream(jsonString.getBytes), meta)
}
You need to import:
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.auth.BasicAWSCredentials
import com.amazonaws.services.s3.model.ObjectMetadata
You need to include aws-java-sdk library.

Related

How to efficiently read/parse loads of .gz files in a s3 folder with spark on EMR

I'm trying to read all files in a directory on s3 via a spark app that's executing on EMR.
The data is store in a typical format like "s3a://Some/path/yyyy/mm/dd/hh/blah.gz"
If I use deeply nested wildcards (e.g. "s3a://SomeBucket/SomeFolder/////*.gz"), the performance is terrible and takes about 40 minutes to read a few tens of thousand small gzipped json files.
It works, but losing 40 minutes to test some code is really bad.
I have two other approaches that my research has told me are much more performant.
Using the hadoop.fs library (2.8.5) I try to read each file path I provide it.
private def getEventDataHadoop(
eventsFilePaths: RDD[String]
)(implicit sqlContext: SQLContext): Try[RDD[String]] =
Try(
{
val conf = sqlContext.sparkContext.hadoopConfiguration
eventsFilePaths.map(eventsFilePath => {
val p = new Path(eventsFilePath)
val fs = p.getFileSystem(conf)
val eventData: FSDataInputStream = fs.open(p)
IOUtils.toString(eventData)
})
}
)
These file paths are generated by the below code:
private[disneystreaming] def generateInputBucketPaths(
s3Protocol: String,
bucketName: String,
service: String,
region: String,
yearsMonths: Map[String, Set[String]]
): Try[Set[String]] =
Try(
{
val days = 1 to 31
val hours = 0 to 23
val dateFormatter: Int => String = buildDateFormat("00")
yearsMonths.flatMap { yearMonth: (String, Set[String]) =>
for {
month: String <- yearMonth._2
day: Int <- days
hour: Int <- hours
} yield
s"$s3Protocol$bucketName/$service/$region/${dateFormatter(yearMonth._1.toInt)}/${dateFormatter(month.toInt)}/" +
s"${dateFormatter(day)}/${dateFormatter(hour)}/*.gz"
}.toSet
}
)
The hadoop.fs code fails because the Path class is not serializable. I can't think of how I can get around that.
So this led me to another approach using AmazonS3Client, where I just ask the client to give me all the file paths in a folder (or prefix), then parse the files to a string, which will likely fail due to them being compressed:
private def getEventDataS3(bucketName: String, prefix: String)(
implicit sqlContext: SQLContext
): Try[RDD[String]] =
Try(
{
import com.amazonaws.services.s3._, model._
import scala.collection.JavaConverters._
val request = new ListObjectsRequest()
request.setBucketName(bucketName)
request.setPrefix(prefix)
request.setMaxKeys(Integer.MAX_VALUE)
val s3 = new AmazonS3Client(new ProfileCredentialsProvider("default"))
val objs: ObjectListing = s3.listObjects(request) // Note that this method returns truncated data if longer than the "pageLength" above. You might need to deal with that.
sqlContext.sparkContext
.parallelize(objs.getObjectSummaries.asScala.map(_.getKey).toList)
.flatMap { key =>
Source
.fromInputStream(s3.getObject(bucketName, key).getObjectContent: InputStream)
.getLines()
}
}
)
This code produce a null exception because the profile cannot be null ("java.lang.IllegalArgumentException: profile file cannot be null").
Remember this code is running on EMR within AWS, so how do I provide the credentials it wants? How are other people running spark jobs on EMR using this client?
Any help with getting any of these approaches working is much appreciated.
Path is serializable in later Hadoop releases, because it is useful to be able to use in Spark RDDs. Until then, convert the path to a URI, marshall that, and create a new path from that URI inside your closure.

Process continuously parquet files as Datastreams in Flink's DataStream API

I have a parquet file on HDFS. It is overwritten daily with a new one. My goal is to emit this parquet file continuously - when it changes - as a DataStream in a Flink Job using the DataStream API.
The end goal is to use the file content in a Broadcast State, but this is out of scope for this question.
To process a file continuously, there is this very useful API: Data-sources about datasources. More specifically, FileProcessingMode.PROCESS_CONTINUOUSLY: this is exactly what I need. This works for reading/monitoring text files, no problem, but not for parquet files:
// Partial version 1: the raw file is processed continuously
val path: String = "hdfs://hostname/path_to_file_dir/"
val textInputFormat: TextInputFormat = new TextInputFormat(new Path(path))
// monitor the file continuously every minute
val stream: DataStream[String] = streamExecutionEnvironment.readFile(textInputFormat, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 60000)
To process parquet files, I can use Hadoop Input Formats using this API:
using-hadoop-inputformats. However there is no FileProcessingMode parameter via this API, and this processes the file only once:
// Partial version 2: the parquet file is only processed once
val parquetPath: String = "/path_to_file_dir/parquet_0000"
// raw text format
val hadoopInputFormat: HadoopInputFormat[Void, ArrayWritable] = HadoopInputs.readHadoopFile(new MapredParquetInputFormat(), classOf[Void], classOf[ArrayWritable], parquetPath)
val stream: DataStream[(Void, ArrayWritable)] = streamExecutionEnvironment.createInput(hadoopInputFormat).map { record =>
// process the record here ...
}
I would like to somehow combine the two APIs, to process continuously Parquet Files via the DataStream API. Have any of you tried something like this ?
After browsing Flink's code, looks like that those two APIS are relatively different, and it does not seem possible to merge them together.
The other approach, which I will detail here, is to define your own SourceFunction that will periodically read the file:
class ParquetSourceFunction extends SourceFunction[Int] {
private var isRunning = true
override def run(ctx: SourceFunction.SourceContext[Int]): Unit = {
while (isRunning) {
val path = new Path("path_to_parquet_file")
val conf = new Configuration()
val readFooter = ParquetFileReader.readFooter(conf, path, ParquetMetadataConverter.NO_FILTER)
val metadata = readFooter.getFileMetaData
val schema = metadata.getSchema
val parquetFileReader = new ParquetFileReader(conf, metadata, path, readFooter.getBlocks, schema.getColumns)
var pages: PageReadStore = null
try {
while ({ pages = parquetFileReader.readNextRowGroup; pages != null }) {
val rows = pages.getRowCount
val columnIO = new ColumnIOFactory().getColumnIO(schema)
val recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema))
(0L until rows).foreach { _ =>
val group = recordReader.read()
val my_integer = group.getInteger("field_name", 0)
ctx.collect(my_integer)
}
}
}
// do whatever logic suits you to stop "watching" the file
Thread.sleep(60000)
}
}
override def cancel(): Unit = isRunning = false
}
Then, use the streamExecutionEnvironment to register this source:
val dataStream: DataStream[Int] = streamExecutionEnvironment.addSource(new ParquetProtoSourceFunction)
// do what you want with your new datastream

Spark: How to get String value while generating output file

I have two files
--------Student.csv---------
StudentId,City
101,NDLS
102,Mumbai
-------StudentDetails.csv---
StudentId,StudentName,Course
101,ABC,C001
102,XYZ,C002
Requirement
StudentId in first should be replaced with StudentName and Course in the second file.
Once replaced I need to generate a new CSV with complete details like
ABC,C001,NDLS
XYZ,C002,Mumbai
Code used
val studentRDD = sc.textFile(file path);
val studentdetailsRDD = sc.textFile(file path);
val studentB = sc.broadcast(studentdetailsRDD.collect)
//Generating CSV
studentRDD.map{student =>
val name = getName(student.StudentId)
val course = getCourse(student.StudentId)
Array(name, course, student.City)
}.mapPartitions{data =>
val stringWriter = new StringWriter();
val csvWriter =new CSVWriter(stringWriter);
csvWriter.writeAll(data.toList)
Iterator(stringWriter.toString())
}.saveAsTextFile(outputPath)
//Functions defined to get details
def getName(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.StudentName}
}
def getCourse(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.Course}
}
Problem
File gets generated but the values are object representations instead of String value.
How can I get the string values instead of objects ?
As suggested in another answer, Spark's DataFrame API is especially suitable for this, as it easily supports joining two DataFrames, and writing CSV files.
However, if you insist on staying with RDD API, looks like the main issue with your code is the lookup functions: getName and getCourse basically do nothing, because their return type is Unit; Using an if without an else means that for some inputs there's no return value, which makes the entire function return Unit.
To fix this, it's easier to get rid of them and simplify the lookup by broadcasting a Map:
// better to broadcast a Map instead of an Array, would make lookups more efficient
val studentB = sc.broadcast(studentdetailsRDD.keyBy(_.StudentId).collectAsMap())
// convert to RDD[String] with the wanted formatting
val resultStrings = studentRDD.map { student =>
val details = studentB.value(student.StudentId)
Array(details.StudentName, details.Course, student.City)
}
.map(_.mkString(",")) // naive CSV writing with no escaping etc., you can also use CSVWriter like you did
// save as text file
resultStrings.saveAsTextFile(outputPath)
Spark has great support for join and write to file. Join only takes 1 line of code and write also only takes 1.
Hand write those code can be error proven, hard to read and most likely super slow.
val df1 = Seq((101,"NDLS"),
(102,"Mumbai")
).toDF("id", "city")
val df2 = Seq((101,"ABC","C001"),
(102,"XYZ","C002")
).toDF("id", "name", "course")
val dfResult = df1.join(df2, "id").select("id", "city", "name")
dfResult.repartition(1).write.csv("hello.csv")
There will be a directory created. There is only 1 file in the directory which is the finally result.

spark whole textiles - many small files

I want to ingest many small text files via spark to parquet. Currently, I use wholeTextFiles and perform some parsing additionally.
To be more precise - these small text files are ESRi ASCII Grid files each with a maximum size of around 400kb. GeoTools are used to parse them as outlined below.
Do you see any optimization possibilities? Maybe something to avoid the creation of unnecessary objects? Or something to better handle the small files. I wonder if it is better to only get the paths of the files and manually read them instead of using String -> ByteArrayInputStream.
case class RawRecords(path: String, content: String)
case class GeometryId(idPath: String, value: Double, geo: String)
#transient lazy val extractor = new PolygonExtractionProcess()
#transient lazy val writer = new WKTWriter()
def readRawFiles(path: String, parallelism: Int, spark: SparkSession) = {
import spark.implicits._
spark.sparkContext
.wholeTextFiles(path, parallelism)
.toDF("path", "content")
.as[RawRecords]
.mapPartitions(mapToSimpleTypes)
}
def mapToSimpleTypes(iterator: Iterator[RawRecords]): Iterator[GeometryId] = iterator.flatMap(r => {
val extractor = new PolygonExtractionProcess()
// http://docs.geotools.org/latest/userguide/library/coverage/arcgrid.html
val readRaster = new ArcGridReader(new ByteArrayInputStream(r.content.getBytes(StandardCharsets.UTF_8))).read(null)
// TODO maybe consider optimization of known size instead of using growable data structure
val vectorizedFeatures = extractor.execute(readRaster, 0, true, null, null, null, null).features
val result: collection.Seq[GeometryId] with Growable[GeometryId] = mutable.Buffer[GeometryId]()
while (vectorizedFeatures.hasNext) {
val vectorizedFeature = vectorizedFeatures.next()
val geomWKTLineString = vectorizedFeature.getDefaultGeometry match {
case g: Geometry => writer.write(g)
}
val geomUserdata = vectorizedFeature.getAttribute(1).asInstanceOf[Double]
result += GeometryId(r.path, geomUserdata, geomWKTLineString)
}
result
})
I have suggestions:
use wholeTextFile -> mapPartitions -> convert to Dataset. Why? If you make mapPartitions on Dataset, then all rows are converted from internal format to object - it causes additional serialization.
Run Java Mission Control and sample your application. It will show all compilations and times of execution of methods
Maybe you can use binaryFiles, it will give you Stream, so you can parse it without additional reading in mapPartitions

Iteration over typesafe files

I have read this topic
Iterate over fields in typesafe config
and made some changes but still don't know how to iterate over conf files in play framework.
Providers=[{1234 : "CProduct"},
{12345 : "ChProduct"},
{123 : "SProduct"}]
This is my Conf file called providers.conf , the question is how can i iterate over them and create a dropdownbox from them. I would like to take them as map if possible which is [int,string]
I know , i have to take them like
val config = ConfigFactory.load("providers.conf").getConfigList("Providers")
i can the conf file like that but , i should get it from template in order to do that i need to convert it to either hashmap or list or whatever functional.
Cheers,
I'm not sure if this is the most efficient way to do this, but this works:
1) Lets get our config file:
val config = ConfigFactory.load().getConfigList("providers")
scala> config.getConfigList("providers")
res23: java.util.List[_ <: com.typesafe.config.Config] = [Config(SimpleConfigObject({"id":"1234","name":" Product2"})), Config(SimpleConfigObject({"id":"4523","name":"Product1"})), Config(SimpleConfigObject({"id":"432","name":" Product3"}))]
2) For this example introduce Provider entity:
case class Provider(id: String, name: String)
3) Now lets convert list with configs to providers:
import scala.collection.JavaConversions._
providers.map(conf => Provider(conf.getString("id"), conf.getString("name"))).toList
res27: List[Provider] = List(Provider(1234, Product2), Provider(4523,Product1), Provider(432, Product3))
We need explicitly convert it toList, cause by default Java List converts to Buffer.
Here is my solution for that ,
val config = ConfigFactory.load("providers.conf").getConfigList("Providers")
var providerlist = new java.util.ArrayList[model.Provider]
val providers = (0 until config.size())
providers foreach {
count =>
val iterator = config.get(count).entrySet().iterator()
while(iterator.hasNext()) {
val entry = iterator.next()
val p = new Provider(entry.getKey(), entry.getValue().render())
providerlist.add(p);
}
}
println(providerlist.get(0).providerId+providerlist.get(0).providerName)
println(providerlist.get(33).providerId+providerlist.get(33).providerName)
and my provider.class
package model
case class Provider(providerId: String, providerName: String)