write Parquet file with Scalavro and Parquet-avro - scala

I need write a file in parquet fileformat, for read after with spark.
I use scala with Scalavro and Parquet-avro.
In my test I have write a file in avro format and work fine:
import java.io._
import com.gensler.scalavro.types.AvroType
import scala.util.{ Success, Failure }
//object structure
case class defMyList(mydata:String)
case class objectTest(name: String, desc:String,myList:Seq[defMyList])
def test()
{
//create object data
val objectList = objectTest(
name = "object name",
desc = "object desc",
myList = Seq(
defMyList("asdfasdfasfsafsdfasdfasdf"),
defMyList("asdfasdfasfsafsdfasdfasdf")
)
)
val objectListType = AvroType[objectTest]
println("schema: " + objectListType.schema)
val filestream= new File("C:\\avrofile.avro")
val outStream = new FileOutputStream(filestream)
objectListType.io.write(objectList, outStream)
val inStream: java.io.InputStream = new FileInputStream(filestream)
objectListType.io.read(inStream) match {
case Success(readResult) => println("Successfully deserialized: " + readResult)
case Failure(cause) => println("Failure")
}
}
How can I change this code for write in parquet format?
thank you

Related

Migrate code Scala to databricks notebook

Working to get this code running using notebooks in databricks(already tested and working with an IDE), can not get this working if I change the structure of the code.
import java.io.{BufferedReader, InputStreamReader}
import java.text.SimpleDateFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
object TestUnit {
val dateFormat = new SimpleDateFormat("yyyyMMdd")
case class Averages (cust: String, Num: String, date: String, credit: Double)
def main(args: Array[String]): Unit = {
val inputFile = "s3a://tfsdl-ghd-wb/raidnd/Cleartablet.csv"
val outputFile = "s3a://tfsdl-ghd-wb/raidnd/Incte_19&20.csv"
val fileSystem = getFileSystem(inputFile)
val inputData = readCSVFileLines(fileSystem, inputFile, skipHeader = true)
.toSeq
val filtinp = inputData.filter(x => x.nonEmpty)
.map(x => x.split(","))
.map(x => Revenue(x(6), x(5), x(0), x(8).toDouble))
// Create output writer
val writer = new PrintWriter(new File(outputFile))
// Header for output CSV file
writer.write("Date,customer,number,Credit,Average Credit/SKU\n")
filtinp.foreach{x =>
val (com1, avg1) = com1Average(filtermp, x)
val (com2, avg2) = com2Average(filtermp, x)
}
// Write row to output csv file
writer.write(s"${x.day},${x.customer},${x.number},${x.credit},${avgcredit1},${avgcredit2}\n")
writer.close() // close the writer`
}
}

Passing functions in Spark

This is my idea
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
object pizD {
def filePath = {
new File(this.getClass.getClassLoader.getResource("wikipedia/wikipedia.dat").toURI).getPath
}
def regex(line: String): pichA = {
......
......
pichA(t1, t2)
}
}
case class pichA(t1: String, t2: String)
object dushP {
val conf = new SparkConf()
val sc = new SparkContext(conf)
val mirdd: RDD[pichA] = ???
How to integrate sc.textfile with my methods filePath and regex?I want to combine in order to get new rdd.
val baseRDD =sc.textfile(pizD.filepath).filter(line => {
val value = pizD.regex(line)
if(value !=null)
true
else false
})
Assuming pizD.filepath will give you file name as string and regex() would return null value if regex din match. If the understanding is correct, then above code would do the trick.

spark collect method taking too much time when processing records stored in RDD[String]

I have a requirement where I have to pull parquet file from S3 process it and convert into another object format and store it in S3 in json and Parquet format.
I have done the below changes for this problem statement, but the Spark job is taking too much time when collect statement is called Please Let me know how this can be optimized, Below is the complete Code which reads Parquet file from S3 and process it and store it to S3. I am very new to Spark and BigData technology
package com.expedia.www.lambda
import java.io._
import com.amazonaws.ClientConfiguration
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing}
import com.expedia.hendrix.lambda.HotelInfosite
import com.expedia.www.hendrix.signals.definition.local.HotelInfoSignal
import com.expedia.www.options.HendrixHistoricalOfflineProcessorOptions
import com.expedia.www.user.interaction.v1.UserInteraction
import com.expedia.www.util._
import com.fasterxml.jackson.core.JsonParser
import com.fasterxml.jackson.databind.{DeserializationFeature, ObjectMapper}
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.commons.lang.exception.ExceptionUtils
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.slf4j.{Logger, LoggerFactory}
import scala.collection.JavaConverters._
import scala.io.Source
import scala.util.Random
object GenericLambdaMapper{
private def currentTimeMillis: Long = System.currentTimeMillis
/** The below Generic mapper object is built for creating json similar to the Signal pushed by hendrix */
def populateSignalRecord( genericRecord: GenericRecord, uisMessage: UserInteraction, signalType: String): HotelInfoSignal ={
val objectMapper:ObjectMapper = new ObjectMapper
objectMapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
objectMapper.configure(JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER, true)
val hotelInfoObject = objectMapper.readValue( genericRecord.toString, classOf[com.expedia.www.hendrix.signals.definition.local.HotelInfosite])
val userKey = UserKeyUtil.createUserKey(uisMessage)
val hotelInfoSignal:HotelInfoSignal = new HotelInfoSignal
hotelInfoSignal.setSignalType(signalType)
hotelInfoSignal.setData(hotelInfoObject)
hotelInfoSignal.setUserKey(userKey)
hotelInfoSignal.setGeneratedAtTimestamp(currentTimeMillis)
return hotelInfoSignal
}
}
class GenericLambdaMapper extends Serializable{
var LOGGER:Logger = LoggerFactory.getLogger("GenericLambdaMapper")
var bw : BufferedWriter = null
var fw :FileWriter = null
val random: Random = new Random
var counter: Int = 0
var fileName: String= null
val s3Util = new S3Util
/** Object Mapper function for serializing and deserializing objects**/
def objectMapper : ObjectMapper= {
val mapper = new ObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.configure(JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER, true)
}
def process(sparkContext: SparkContext, options: HendrixHistoricalOfflineProcessorOptions ): Unit = { //ObjectListing
try {
LOGGER.info("Start Date : "+options.startDate)
LOGGER.info("END Date : "+options.endDate)
val listOfFilePath: List[String] = DateTimeUtil.getDateRangeStrFromInput(options.startDate, options.endDate)
/**Looping through each folder based on start and end date **/
listOfFilePath.map(
path => applyLambdaForGivenPathAndPushToS3Signal( sparkContext, path, options )
)
}catch {
case ex: Exception => {
LOGGER.error( "Exception in downloading data :" + options.rawBucketName + options.rawS3UploadRootFolder + options.startDate)
LOGGER.error("Stack Trace :"+ExceptionUtils.getFullStackTrace(ex))
}
}
}
// TODO: Currently the Lambda is hardcoded only to HotelInfoSite to be made generic
def prepareUisObjectAndApplyLambda(uisMessage: UserInteraction, options: HendrixHistoricalOfflineProcessorOptions): List[GenericRecord] = {
try {
val schemaDefinition = Source.fromInputStream(getClass.getResourceAsStream("/"+options.avroSchemaName)).getLines.mkString("\n")
val schemaHotelInfo = new Schema.Parser().parse(schemaDefinition)
HotelInfosite.apply(uisMessage, schemaHotelInfo).toList
}catch {
case ex: Exception => LOGGER.error("Exception while preparing UIS Object" + ex.toString)
List.empty
}
}
/** Below method is used to extract userInteraction Data from Raw file **/
private def constructUisObject(uisMessageRaw: String): UserInteraction = objectMapper.readValue( uisMessageRaw, classOf[UserInteraction])
/** Below function contains logic to apply the lambda for the given range of dates and push to signals folder in S3 **/
def applyLambdaForGivenPathAndPushToS3Signal(sparkContext: SparkContext, dateFolderPath: String, options: HendrixHistoricalOfflineProcessorOptions ): Unit ={
var awsS3Client: AmazonS3Client = null;
try {
if ("sandbox".equals(options.environment)) {
val clientConfiguration = new ClientConfiguration()
.withConnectionTimeout(options.awsConnectionTimeout)
.withSocketTimeout(options.awsSocketTimeout)
.withTcpKeepAlive(true)
awsS3Client = S3Client.getAWSConnection(options.awsS3AccessKey, options.awsS3SecretKey, clientConfiguration)
} else {
awsS3Client = S3Client.getAWSConnection
}
/** Validate if destination path has any gzip file if so then just skip that date and process next record **/
LOGGER.info("Validating if the destination folder path is empty: " + dateFolderPath)
var objectListing: ObjectListing = null
var listObjectsRequest: ListObjectsRequest = new ListObjectsRequest().withBucketName(options.destinationBucketName).withPrefix(options.s3SignalRootFolder + options.signalType + "/" + dateFolderPath.toString)
objectListing = awsS3Client.listObjects(listObjectsRequest)
if (objectListing.getObjectSummaries.size > 0) {
LOGGER.warn("Record already present at the below location, so skipping the processing of record for the folder path :" + dateFolderPath.toString)
LOGGER.warn("s3n://" + options.destinationBucketName + "/" + options.s3SignalRootFolder + options.signalType + "/" + dateFolderPath.toString)
return
}
LOGGER.info("Validated the destination folder path :" + dateFolderPath + " and found it to be empty ")
/** End of validation **/
/*Selecting all the files under the source path and iterating*/
counter = 0
listObjectsRequest = new ListObjectsRequest().withBucketName(options.rawBucketName).withPrefix(options.rawS3UploadRootFolder + dateFolderPath.toString)
objectListing = awsS3Client.listObjects(listObjectsRequest)
val rddListOfParquetFileNames = objectListing.getObjectSummaries.asScala.map(_.getKey).toList
rddListOfParquetFileNames.flatMap{key => { processIndividualParquetFileAndUploadToS3(sparkContext, awsS3Client, options, key, dateFolderPath)
"COMPLETED Processing=>"+key;
}}
}catch{
case ex: Exception =>
LOGGER.error("Exception occured while processing records for the path " + dateFolderPath)
LOGGER.error("Exception in Apply Lambda method Message :" + ex.getMessage + "\n Stack Trace :" + ex.getStackTrace)
}finally {
awsS3Client.shutdown
LOGGER.info("JOB Complete ")
}
}
def processIndividualParquetFileAndUploadToS3(sparkContext: SparkContext, awsS3Client: AmazonS3Client, options: HendrixHistoricalOfflineProcessorOptions, parquetFilePath:String, dateFolderPath:String ):Unit ={
try{
LOGGER.info("Currently Processing the Parquet file: "+parquetFilePath)
LOGGER.info("Starting to reading Parquet File Start Time: "+System.currentTimeMillis)
val dataSetString: RDD[String] = ParquetHelper.readParquetData(sparkContext, options, parquetFilePath)
LOGGER.info("Data Set returned from Parquet file Successful Time: "+System.currentTimeMillis)
val lambdaSignalRecords: Array[HotelInfoSignal] = dataSetString.map(x => constructUisObject(x))
.filter(_ != null)
.map(userInteraction => processIndividualRecords(userInteraction, options))
.filter(_ != null)
.collect
LOGGER.info("Successfully Generated "+lambdaSignalRecords.length+" Signal Records")
if(lambdaSignalRecords.length > 0) {
//Write to Paraquet File :Start
val parquetFileName: String = getFileNameForParquet(dateFolderPath, counter)
val parquetWriter = ParquetHelper.newParquetWriter(HotelInfoSignal.getClassSchema, dateFolderPath, parquetFileName, options)
LOGGER.info("Initialized Parquet Writer")
lambdaSignalRecords.map(signalRecord => parquetWriter.write(signalRecord))
LOGGER.info("Completed writing the data in Parquet format")
parquetWriter.close
//Parquet Write Complete
/*val avroSignalString = lambdaSignalRecords.mkString("\n")
val sparkSession = SparkSession.builder.getOrCreate
uploadProceessedDataToS3(sparkSession, awsS3Client, dateFolderPath, avroSignalString, options)
*/ }
}catch {case ex:Exception =>
LOGGER.error("Skipping processing of record :"+parquetFilePath+" because of Exception: "+ExceptionUtils.getFullStackTrace(ex))
}
LOGGER.info("Completed data processing for file :" + options.rawBucketName + options.rawS3UploadRootFolder + parquetFilePath)
}
def uploadProceessedDataToS3(sparkSession:SparkSession, awsS3Client: AmazonS3Client, filePath: String, genericSignalRecords: String, options: HendrixHistoricalOfflineProcessorOptions):Unit ={
var jsonFile: File = null
var gzFile: File = null
try {
//Building the file name based on the folder accessed
fileName = getFileName (filePath, counter)
jsonFile = IOUtil.createS3JsonFile (genericSignalRecords, fileName)
gzFile = IOUtil.gzipIt (jsonFile)
s3Util.uploadToS3(awsS3Client, options.destinationBucketName, options.s3SignalRootFolder + options.signalType + "/" + filePath, gzFile)
counter += 1 //Incement counter
} catch {
case ex: RuntimeException => LOGGER.error ("Exception while uploading file to path :" + options.s3SignalRootFolder + options.signalType + "/" + filePath + "/" + fileName)
LOGGER.error ("Stack Trace for S3 Upload :" + ExceptionUtils.getFullStackTrace(ex))
} finally {
//Cleaning the temp file created after upload to s3, we can create a temp dir if required.
jsonFile.delete
gzFile.delete
}
}
def processIndividualRecords(userInteraction: UserInteraction, options: HendrixHistoricalOfflineProcessorOptions): HotelInfoSignal ={
try {
//Applying lambda for the indivisual UserInteraction
val list: List[GenericRecord] = prepareUisObjectAndApplyLambda (userInteraction, options)
if (list.nonEmpty) return GenericLambdaMapper.populateSignalRecord (list.head, userInteraction, options.signalType)
} catch { case ex: Exception => LOGGER.error ("Error while creating signal record from UserInteraction for Singal Type :"+ options.signalType +" For Interaction "+userInteraction.toString)
LOGGER.error ("Stack Trace while processIndividualRecords :" + ExceptionUtils.getFullStackTrace(ex))}
null
}
/** This method is used to prepare the exact file name which has processed date and the no of files counter **/
def getFileName(filePath : String, counter : Int): String = {
filePath.replace("/","-")+"_"+counter+"_"+random.alphanumeric.take(5).mkString+".json"
}
/** This method is used to prepare the exact file name which has processed date and the no of files counter **/
def getFileNameForParquet(filePath : String, counter : Int): String = {
filePath.replace("/","-")+"_"+counter+"_"+random.alphanumeric.take(5).mkString+".parquet"
}
}
package com.expedia.www.util
import com.expedia.www.options.HendrixHistoricalOfflineProcessorOptions
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.parquet.avro.{AvroParquetWriter, AvroSchemaConverter}
import org.apache.parquet.hadoop.metadata.CompressionCodecName
import org.apache.parquet.hadoop.{ParquetFileWriter, ParquetWriter}
import org.apache.parquet.schema.MessageType
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.slf4j.{Logger, LoggerFactory}
/**
* Created by prasubra on 2/17/17.
*/
object ParquetHelper {
val LOGGER:Logger = LoggerFactory.getLogger("ParquetHelper")
def newParquetWriter(signalSchema: Schema, folderPath:String, fileName:String, options:HendrixHistoricalOfflineProcessorOptions): ParquetWriter[GenericRecord] = {
val blockSize: Int = 256 * 1024 * 1024
val pageSize: Int = 64 * 1024
val compressionCodec = if (options.parquetCompressionToGzip) CompressionCodecName.GZIP else CompressionCodecName.UNCOMPRESSED
val path: Path = new Path("s3n://" + options.destinationBucketName + "/" + options.parquetSignalFolderName + options.signalType + "/" + folderPath + "/" + fileName);
val parquetSchema: MessageType = new AvroSchemaConverter().convert(signalSchema);
// var writeSupport:WriteSupport = new AvroWriteSupport(parquetSchema, signalSchema);
//(path, writeSupport, compressionCodec, blockSize, pageSize)
//var parquetWriter:ParquetWriter[GenericRecord] = new ParquetWriter(path, writeSupport, compressionCodec, blockSize, pageSize);
if ("sandbox".equals(options.environment)) {
val hadoopConf = new Configuration
hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", options.awsS3AccessKey)
hadoopConf.set("fs.s3n.awsSecretAccessKey", options.awsS3SecretKey)
hadoopConf.set("fs.s3n.maxRetries", options.awsFileReaderRetry)
AvroParquetWriter.builder(path)
.withSchema(signalSchema)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.withCompressionCodec(compressionCodec)
.withConf(hadoopConf)
.build()
} else {
AvroParquetWriter.builder(path)
.withSchema(signalSchema)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.withCompressionCodec(compressionCodec)
.withPageSize(pageSize)
.build()
}
}
def readParquetData(sc: SparkContext, options: HendrixHistoricalOfflineProcessorOptions, filePath: String): RDD[String] = {
val filePathOfParquet = "s3n://"+options.rawBucketName+"/"+ filePath
LOGGER.info("Reading Parquet file from path :"+filePathOfParquet)
val sparkSession = SparkSession.builder.getOrCreate
val dataFrame = sparkSession.sqlContext.read.parquet(filePathOfParquet)
//dataFrame.printSchema()
dataFrame.toJSON.rdd
}
}
First, you really should improve your questions, with a minimal code example. It's really hard to see whats going on in your code...
Collect retrieves all elements of your RDD into a single RDD on the driver. If your RDD is large, then this will of course take a lot of time (and maybe cause an OutOfMemeoryError if the content does not fit into the driver's main memory).
You can directly write the content of a Dataframe/Dataset using parquet. This will surely be much faster and more scalable.
Use s3a:// URLs . S3n// has a bug which really kills ORC/Parquet performance, and has been superceded by s3a now

Creating Dataframe from XML parsed by scalaxb

I can successfully parse XML data dropped into a directory by using the Spark streaming fileStream method, and I can write the resulting RDDs out to a text file just fine:
val fStream = {
ssc.fileStream[LongWritable, Text, XmlInputFormat](
WATCHDIR, xmlFilter _, newFilesOnly = false, conf = hadoopConf)
}
fStream.foreachRDD(rdd =>
if (rdd.count() == 0) {
logger.info("No files..")
})
val dStream = fStream.map{ case(x, y) =>
logger.info("Hello from the dStream")
logger.info(y.toString)
scalaxb.fromXML[Music](scala.xml.XML.loadString(y.toString))
}
dStream.foreachRDD(rdd => rdd.saveAsTextFile("file:///tmp/xmlout"))
The trouble is when I want to convert the RDDs to DataFrames in order to either register them as a temp table or saveAsParquetFile.
This code:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
dStream.foreachRDD(rdd => rdd.distinct().toDF().printSchema())
Results in this error:
java.lang.UnsupportedOperationException: Schema for type scalaxb.DataRecord[scala.Any] is not supported
I would have thought that since scalaxb generates case classes for my records, and that it would be simple for Spark to infer using reflection, and I see this is what it's trying to do, except Spark doesn't support the scalaxb.DataRecord type. Are there any Spark or Scalaxb experts who have any ideas on how to make the case classes generated by Scalaxb compatible with Spark?
BTW, here are the generated classes from scalaxb:
package generated
case class Song(attributes: Map[String, scalaxb.DataRecord[Any]] = Map()) {
lazy val title = attributes.get("#title") map { _.as[String] }
lazy val length = attributes.get("#length") map { _.as[String] }
}
case class Album(song: Seq[generated.Song] = Nil,
description: String,
attributes: Map[String, scalaxb.DataRecord[Any]] = Map()) {
lazy val title = attributes.get("#title") map { _.as[String] }
}
case class Artist(album: Seq[generated.Album] = Nil,
attributes: Map[String, scalaxb.DataRecord[Any]] = Map()) {
lazy val name = attributes.get("#name") map { _.as[String] }
}
case class Music(artist: Seq[generated.Artist] = Nil)

Writing a test case for file uploads in Play 2.1 and Scala

I found the following question/answer:
Test MultipartFormData in Play 2.0 FakeRequest
But it seems things have changed in Play 2.1. I've tried adapting the example like so:
"Application" should {
"Upload Photo" in {
running(FakeApplication()) {
val data = new MultipartFormData(Map(), List(
FilePart("qqfile", "message", Some("Content-Type: multipart/form-data"),
TemporaryFile(getClass().getResource("/test/photos/DSC03024.JPG").getFile()))
), List())
val Some(result) = routeAndCall(FakeRequest(POST, "/admin/photo/upload", FakeHeaders(), data))
status(result) must equalTo(CREATED)
headers(result) must contain(LOCATION)
contentType(result) must beSome("application/json")
However whenever I attempt to run the request, I get a null-pointer exception:
[error] ! Upload Photo
[error] NullPointerException: null (PhotoManagementSpec.scala:25)
[error] test.PhotoManagementSpec$$anonfun$1$$anonfun$apply$3$$anonfun$apply$4.apply(PhotoManagementSpec.scala:28)
[error] test.PhotoManagementSpec$$anonfun$1$$anonfun$apply$3$$anonfun$apply$4.apply(PhotoManagementSpec.scala:25)
[error] play.api.test.Helpers$.running(Helpers.scala:40)
[error] test.PhotoManagementSpec$$anonfun$1$$anonfun$apply$3.apply(PhotoManagementSpec.scala:25)
[error] test.PhotoManagementSpec$$anonfun$1$$anonfun$apply$3.apply(PhotoManagementSpec.scala:25)
If I try to replace the deprecated routeAndCall with just route (and remove the Option around result), I get a compile error stating that it can't write an instance of MultipartFormData[TemporaryFile] to the HTTP response.
What's the right way to design this test in Play 2.1 with Scala?
Edit: Tried to modify the code to test just the controller:
"Application" should {
"Upload Photo" in {
val data = new MultipartFormData(Map(), List(
FilePart("qqfile", "message", Some("Content-Type: multipart/form-data"),
TemporaryFile(getClass().getResource("/test/photos/DSC03024.JPG").getFile()))
), List())
val result = controllers.Photo.upload()(FakeRequest(POST, "/admin/photo/upload",FakeHeaders(),data))
status(result) must equalTo(OK)
contentType(result) must beSome("text/html")
charset(result) must beSome("utf-8")
contentAsString(result) must contain("Hello Bob")
}
But I now get a type error on all the test conditions around the results like so:
[error] found : play.api.libs.iteratee.Iteratee[Array[Byte],play.api.mvc.Result]
[error] required: play.api.mvc.Result
I don't understand why I'm getting an Interator for byte arrays mapped to Results. Could this have something to do with how I'm using a custom body parser? My controller's definition looks like this:
def upload = Action(CustomParsers.multipartFormDataAsBytes) { request =>
request.body.file("qqfile").map { upload =>
Using the form parser from this post: Pulling files from MultipartFormData in memory in Play2 / Scala
Play 2.3 includes a newer version of httpmime.jar, requiring some minor corrections. Building on Marcus's solution using Play's Writeable mechanism, while retaining some of the syntactic sugar from my Play 2.1 solution, this is what I've come up with:
import scala.language.implicitConversions
import java.io.{ByteArrayOutputStream, File}
import org.apache.http.entity.ContentType
import org.apache.http.entity.mime.MultipartEntityBuilder
import org.apache.http.entity.mime.content._
import org.specs2.mutable.Specification
import play.api.http._
import play.api.libs.Files.TemporaryFile
import play.api.mvc.MultipartFormData.FilePart
import play.api.mvc.{Codec, MultipartFormData}
import play.api.test.Helpers._
import play.api.test.{FakeApplication, FakeRequest}
trait FakeMultipartUpload {
implicit def writeableOf_multiPartFormData(implicit codec: Codec): Writeable[MultipartFormData[TemporaryFile]] = {
val builder = MultipartEntityBuilder.create().setBoundary("12345678")
def transform(multipart: MultipartFormData[TemporaryFile]): Array[Byte] = {
multipart.dataParts.foreach { part =>
part._2.foreach { p2 =>
builder.addPart(part._1, new StringBody(p2, ContentType.create("text/plain", "UTF-8")))
}
}
multipart.files.foreach { file =>
val part = new FileBody(file.ref.file, ContentType.create(file.contentType.getOrElse("application/octet-stream")), file.filename)
builder.addPart(file.key, part)
}
val outputStream = new ByteArrayOutputStream
builder.build.writeTo(outputStream)
outputStream.toByteArray
}
new Writeable[MultipartFormData[TemporaryFile]](transform, Some(builder.build.getContentType.getValue))
}
/** shortcut for generating a MultipartFormData with one file part which more fields can be added to */
def fileUpload(key: String, file: File, contentType: String): MultipartFormData[TemporaryFile] = {
MultipartFormData(
dataParts = Map(),
files = Seq(FilePart[TemporaryFile](key, file.getName, Some(contentType), TemporaryFile(file))),
badParts = Seq(),
missingFileParts = Seq())
}
/** shortcut for a request body containing a single file attachment */
case class WrappedFakeRequest[A](fr: FakeRequest[A]) {
def withFileUpload(key: String, file: File, contentType: String) = {
fr.withBody(fileUpload(key, file, contentType))
}
}
implicit def toWrappedFakeRequest[A](fr: FakeRequest[A]) = WrappedFakeRequest(fr)
}
class MyTest extends Specification with FakeMultipartUpload {
"uploading" should {
"be easier than this" in {
running(FakeApplication()) {
val uploadFile = new File("/tmp/file.txt")
val req = FakeRequest(POST, "/upload/path").
withFileUpload("image", uploadFile, "image/gif")
val response = route(req).get
status(response) must equalTo(OK)
}
}
}
}
I managed to get this working with Play 2.1 based on various mailing list suggestions. Here's how I do it:
import scala.language.implicitConversions
import java.io.{ ByteArrayOutputStream, File }
import org.apache.http.entity.mime.MultipartEntity
import org.apache.http.entity.mime.content.{ ContentBody, FileBody }
import org.specs2.mutable.Specification
import play.api.http.Writeable
import play.api.test.{ FakeApplication, FakeRequest }
import play.api.test.Helpers._
trait FakeMultipartUpload {
case class WrappedFakeRequest[A](fr: FakeRequest[A]) {
def withMultipart(parts: (String, ContentBody)*) = {
// create a multipart form
val entity = new MultipartEntity()
parts.foreach { part =>
entity.addPart(part._1, part._2)
}
// serialize the form
val outputStream = new ByteArrayOutputStream
entity.writeTo(outputStream)
val bytes = outputStream.toByteArray
// inject the form into our request
val headerContentType = entity.getContentType.getValue
fr.withBody(bytes).withHeaders(CONTENT_TYPE -> headerContentType)
}
def withFileUpload(fileParam: String, file: File, contentType: String) = {
withMultipart(fileParam -> new FileBody(file, contentType))
}
}
implicit def toWrappedFakeRequest[A](fr: FakeRequest[A]) = WrappedFakeRequest(fr)
// override Play's equivalent Writeable so that the content-type header from the FakeRequest is used instead of application/octet-stream
implicit val wBytes: Writeable[Array[Byte]] = Writeable(identity, None)
}
class MyTest extends Specification with FakeMultipartUpload {
"uploading" should {
"be easier than this" in {
running(FakeApplication()) {
val uploadFile = new File("/tmp/file.txt")
val req = FakeRequest(POST, "/upload/path").
withFileUpload("image", uploadFile, "image/gif")
val response = route(req).get
status(response) must equalTo(OK)
}
}
}
}
I've modified Alex's code to act as a Writable which better integrates into Play 2.2.2
package test
import play.api.http._
import play.api.mvc.MultipartFormData.FilePart
import play.api.libs.iteratee._
import play.api.libs.Files.TemporaryFile
import play.api.mvc.{Codec, MultipartFormData }
import java.io.{FileInputStream, ByteArrayOutputStream}
import org.apache.commons.io.IOUtils
import org.apache.http.entity.mime.MultipartEntity
import org.apache.http.entity.mime.content._
object MultipartWriteable {
/**
* `Writeable` for multipart/form-data.
*
*/
implicit def writeableOf_multiPartFormData(implicit codec: Codec): Writeable[MultipartFormData[TemporaryFile]] = {
val entity = new MultipartEntity()
def transform(multipart: MultipartFormData[TemporaryFile]):Array[Byte] = {
multipart.dataParts.foreach { part =>
part._2.foreach { p2 =>
entity.addPart(part._1, new StringBody(p2))
}
}
multipart.files.foreach { file =>
val part = new FileBody(file.ref.file, file.filename, file.contentType.getOrElse("application/octet-stream"), null)
entity.addPart(file.key, part)
}
val outputStream = new ByteArrayOutputStream
entity.writeTo(outputStream)
val bytes = outputStream.toByteArray
outputStream.close
bytes
}
new Writeable[MultipartFormData[TemporaryFile]](transform, Some(entity.getContentType.getValue))
}
}
This way it is possible to write something like this:
val filePart:MultipartFormData.FilePart[TemporaryFile] = MultipartFormData.FilePart(...)
val fileParts:Seq[MultipartFormData.FilePart[TemporaryFile]] = Seq(filePart)
val dataParts:Map[String, Seq[String]] = ...
val multipart = new MultipartFormData[TemporaryFile](dataParts, fileParts, List(), List())
val request = FakeRequest(POST, "/url", FakeHeaders(), multipart)
var result = route(request).get
Following EEColor's suggestion, I got the following to work:
"Upload Photo" in {
val file = scala.io.Source.fromFile(getClass().getResource("/photos/DSC03024.JPG").getFile())(scala.io.Codec.ISO8859).map(_.toByte).toArray
val data = new MultipartFormData(Map(), List(
FilePart("qqfile", "DSC03024.JPG", Some("image/jpeg"),
file)
), List())
val result = controllers.Photo.upload()(FakeRequest(POST, "/admin/photos/upload",FakeHeaders(),data))
status(result) must equalTo(CREATED)
headers(result) must haveKeys(LOCATION)
contentType(result) must beSome("application/json")
}
Here's my version of Writeable[AnyContentAsMultipartFormData]:
import java.io.File
import play.api.http.{HeaderNames, Writeable}
import play.api.libs.Files.TemporaryFile
import play.api.mvc.MultipartFormData.FilePart
import play.api.mvc.{AnyContentAsMultipartFormData, Codec, MultipartFormData}
object MultipartFormDataWritable {
val boundary = "--------ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"
def formatDataParts(data: Map[String, Seq[String]]) = {
val dataParts = data.flatMap { case (key, values) =>
values.map { value =>
val name = s""""$key""""
s"--$boundary\r\n${HeaderNames.CONTENT_DISPOSITION}: form-data; name=$name\r\n\r\n$value\r\n"
}
}.mkString("")
Codec.utf_8.encode(dataParts)
}
def filePartHeader(file: FilePart[TemporaryFile]) = {
val name = s""""${file.key}""""
val filename = s""""${file.filename}""""
val contentType = file.contentType.map { ct =>
s"${HeaderNames.CONTENT_TYPE}: $ct\r\n"
}.getOrElse("")
Codec.utf_8.encode(s"--$boundary\r\n${HeaderNames.CONTENT_DISPOSITION}: form-data; name=$name; filename=$filename\r\n$contentType\r\n")
}
val singleton = Writeable[MultipartFormData[TemporaryFile]](
transform = { form: MultipartFormData[TemporaryFile] =>
formatDataParts(form.dataParts) ++
form.files.flatMap { file =>
val fileBytes = Files.readAllBytes(Paths.get(file.ref.file.getAbsolutePath))
filePartHeader(file) ++ fileBytes ++ Codec.utf_8.encode("\r\n")
} ++
Codec.utf_8.encode(s"--$boundary--")
},
contentType = Some(s"multipart/form-data; boundary=$boundary")
)
}
implicit val anyContentAsMultipartFormWritable: Writeable[AnyContentAsMultipartFormData] = {
MultipartFormDataWritable.singleton.map(_.mdf)
}
It's adapted from (and some bugs fixed): https://github.com/jroper/playframework/blob/multpart-form-data-writeable/framework/src/play/src/main/scala/play/api/http/Writeable.scala#L108
See the whole post here, if you are interested: http://tech.fongmun.com/post/125479939452/test-multipartformdata-in-play
For me, the best solution for this problem is the Alex Varju one
Here is a version updated for Play 2.5:
object FakeMultipartUpload {
implicit def writeableOf_multiPartFormData(implicit codec: Codec): Writeable[AnyContentAsMultipartFormData] = {
val builder = MultipartEntityBuilder.create().setBoundary("12345678")
def transform(multipart: AnyContentAsMultipartFormData): ByteString = {
multipart.mdf.dataParts.foreach { part =>
part._2.foreach { p2 =>
builder.addPart(part._1, new StringBody(p2, ContentType.create("text/plain", "UTF-8")))
}
}
multipart.mdf.files.foreach { file =>
val part = new FileBody(file.ref.file, ContentType.create(file.contentType.getOrElse("application/octet-stream")), file.filename)
builder.addPart(file.key, part)
}
val outputStream = new ByteArrayOutputStream
builder.build.writeTo(outputStream)
ByteString(outputStream.toByteArray)
}
new Writeable(transform, Some(builder.build.getContentType.getValue))
}
}
In Play 2.6.x you can write test cases in the following way to test file upload API:
class HDFSControllerTest extends Specification {
"HDFSController" should {
"return 200 Status for file Upload" in new WithApplication {
val tempFile = SingletonTemporaryFileCreator.create("txt","csv")
tempFile.deleteOnExit()
val data = new MultipartFormData[TemporaryFile](Map(),
List(FilePart("metadata", "text1.csv", Some("text/plain"), tempFile)), List())
val res: Option[Future[Result]] = route(app, FakeRequest(POST, "/api/hdfs").withMultipartFormDataBody(data))
print(contentAsString(res.get))
res must beSome.which(status(_) == OK)
}
}
}
Made Alex's version compatible with Play 2.8
import akka.util.ByteString
import java.io.ByteArrayOutputStream
import org.apache.http.entity.mime.content.StringBody
import org.apache.http.entity.ContentType
import org.apache.http.entity.mime.content.FileBody
import org.apache.http.entity.mime.MultipartEntityBuilder
import play.api.http.Writeable
import play.api.libs.Files.TemporaryFile
import play.api.mvc.Codec
import play.api.mvc.MultipartFormData
import play.api.mvc.MultipartFormData.FilePart
import play.api.test.FakeRequest
trait FakeMultipartUpload {
implicit def writeableOf_multiPartFormData(
implicit codec: Codec
): Writeable[MultipartFormData[TemporaryFile]] = {
val builder = MultipartEntityBuilder.create().setBoundary("12345678")
def transform(multipart: MultipartFormData[TemporaryFile]): ByteString = {
multipart.dataParts.foreach { part =>
part._2.foreach { p2 =>
builder.addPart(part._1, new StringBody(p2, ContentType.create("text/plain", "UTF-8")))
}
}
multipart.files.foreach { file =>
val part = new FileBody(
file.ref.file,
ContentType.create(file.contentType.getOrElse("application/octet-stream")),
file.filename
)
builder.addPart(file.key, part)
}
val outputStream = new ByteArrayOutputStream
builder.build.writeTo(outputStream)
ByteString(outputStream.toByteArray)
}
new Writeable(transform, Some(builder.build.getContentType.getValue))
}
/** shortcut for generating a MultipartFormData with one file part which more fields can be added to */
def fileUpload(
key: String,
file: TemporaryFile,
contentType: String
): MultipartFormData[TemporaryFile] = {
MultipartFormData(
dataParts = Map(),
files = Seq(FilePart[TemporaryFile](key, file.file.getName, Some(contentType), file)),
badParts = Seq()
)
}
/** shortcut for a request body containing a single file attachment */
case class WrappedFakeRequest[A](fr: FakeRequest[A]) {
def withFileUpload(key: String, file: TemporaryFile, contentType: String) = {
fr.withBody(fileUpload(key, file, contentType))
}
}
implicit def toWrappedFakeRequest[A](fr: FakeRequest[A]) = WrappedFakeRequest(fr)
}