Spark: How to write org.apache.spark.rdd.RDD[java.io.ByteArrayOutputStream] - scala

I have an RDD that has the signature
org.apache.spark.rdd.RDD[java.io.ByteArrayOutputStream]
In this RDD, each row has its own partition.
This ByteArrayOutputStream is zip output. I am applying some processing on the data in each partition and i want to export the processed data from each partition as a single zip file. What is the best way to export each Row in the final RDD as one file per row on hdfs?
If you are interested in knowing how I ended up with such an Rdd.
val npyData = transformedTopData.select("tokenIDF", "topLevelId").rdd.repartition(2).mapPartitions(x => {
val vectors = for {
row <- x
} yield {
row.getAs[Vector](0)
}
Seq(ml2npyCSR(vectors.toSeq).zipOut)
}.iterator)
EDIT: Count works perfectly fine
scala> npyData.count()
res9: Long = 2

Spark has very little support for file system operations. You'll need to Hadoop FileSystem API to create individual files
// This method is needed as Hadoop conf object is not serializable
def createFileStream(pathStr:String) = {
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
val hadoopconf = new Configuration();
val fs = FileSystem.get(hadoopconf);
val outFileStream = fs.create(new Path(pathStr));
outFileStream
}
// Method writes to individual files.
// Needs a unique id along with object for output file naming
def writeToFile( x:(Char, Long) ) : Unit = {
val (dataStream, id) = x
val output_dir = "/tmp/del_a/"
val outFileStream = createFileStream(output_dir+id)
dataStream.writeTo(outFileStream)
outFileStream.close()
}
// zipWithIndex used for creating unique id for each item in rdd
npyData.zipWithIndex().foreach(writeToFile)
Reference:
Hadoop FileSystem example
ByteArrayOutputStream.writeTo(java.io.OutputStream)

I figured out that I should represent my data as PairRDD and implement a custom FileOutputFormat. I looked in to the implementation of SequenceFileOutputFormat for inspiration and managed to write my own version based on that.
My custom FileOutputFormat is available here

Related

How to convert RDD[GenericRecord] to dataframe in scala?

I get tweets from kafka topic with Avro (serializer and deserializer).
Then i create a spark consumer which extracts tweets in Dstream of RDD[GenericRecord].
Now i want to convert each rdd to a dataframe to analyse these tweets via SQL.
Any solution to convert RDD[GenericRecord] to dataframe please ?
I spent some time trying to make this work (specially how deserialize the data properly but it looks like you already cover this) ... UPDATED
//Define function to convert from GenericRecord to Row
def genericRecordToRow(record: GenericRecord, sqlType : SchemaConverters.SchemaType): Row = {
val objectArray = new Array[Any](record.asInstanceOf[GenericRecord].getSchema.getFields.size)
import scala.collection.JavaConversions._
for (field <- record.getSchema.getFields) {
objectArray(field.pos) = record.get(field.pos)
}
new GenericRowWithSchema(objectArray, sqlType.dataType.asInstanceOf[StructType])
}
//Inside your stream foreachRDD
val yourGenericRecordRDD = ...
val schema = new Schema.Parser().parse(...) // your schema
val sqlType = SchemaConverters.toSqlType(new Schema.Parser().parse(strSchema))
var rowRDD = yourGeneircRecordRDD.map(record => genericRecordToRow(record, sqlType))
val df = sqlContext.createDataFrame(rowRDD , sqlType.dataType.asInstanceOf[StructType])
As you see, I am using a SchemaConverter to get the dataframe structure from the schema that you used to deserialize (this could be more painful with schema registry). For this you need the following dependency
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>3.2.0</version>
</dependency>
you will need to change your spark version depending on yours.
UPDATE: the code above only works for flat avro schemas.
For nested structures I used something different. You can copy the class SchemaConverters, it has to be inside of com.databricks.spark.avro (it uses some protected classes from the databricks package) or you can try to use the spark-bigquery dependency. The class will not be accessible by default, so you will need to create a class inside a package com.databricks.spark.avro to access the factory method.
package com.databricks.spark.avro
import com.databricks.spark.avro.SchemaConverters.createConverterToSQL
import org.apache.avro.Schema
import org.apache.spark.sql.types.StructType
class SchemaConverterUtils {
def converterSql(schema : Schema, sqlType : StructType) = {
createConverterToSQL(schema, sqlType)
}
}
After that you should be able to convert the data like
val schema = .. // your schema
val sqlType = SchemaConverters.toSqlType(schema).dataType.asInstanceOf[StructType]
....
//inside foreach RDD
var genericRecordRDD = deserializeAvroData(rdd)
///
var converter = SchemaConverterUtils.converterSql(schema, sqlType)
...
val rowRdd = genericRecordRDD.flatMap(record => {
Try(converter(record).asInstanceOf[Row]).toOption
})
//To DataFrame
val df = sqlContext.createDataFrame(rowRdd, sqlType)
A combination of https://stackoverflow.com/a/48828303/5957143 and https://stackoverflow.com/a/47267060/5957143 works for me.
I used the following to create MySchemaConversions
package com.databricks.spark.avro
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.DataType
object MySchemaConversions {
def createConverterToSQL(avroSchema: Schema, sparkSchema: DataType): (GenericRecord) => Row =
SchemaConverters.createConverterToSQL(avroSchema, sparkSchema).asInstanceOf[(GenericRecord) => Row]
}
And then I used
val myAvroType = SchemaConverters.toSqlType(schema).dataType
val myAvroRecordConverter = MySchemaConversions.createConverterToSQL(schema, myAvroType)
// unionedResultRdd is unionRDD[GenericRecord]
var rowRDD = unionedResultRdd.map(record => MyObject.myConverter(record, myAvroRecordConverter))
val df = sparkSession.createDataFrame(rowRDD , myAvroType.asInstanceOf[StructType])
The advantage of having myConverter in the object MyObject is that you will not encounter serialization issues (java.io.NotSerializableException).
object MyObject{
def myConverter(record: GenericRecord,
myAvroRecordConverter: (GenericRecord) => Row): Row =
myAvroRecordConverter.apply(record)
}
Even though something like this may help you,
val stream = ...
val dfStream = stream.transform(rdd:RDD[GenericRecord]=>{
val df = rdd.map(_.toSeq)
.map(seq=> Row.fromSeq(seq))
.toDF(col1,col2, ....)
df
})
I'd like to suggest you an alternate approach. With Spark 2.x you can skip the whole process of creating DStreams. Instead, you can do something like this with structured streaming,
val df = ss.readStream
.format("com.databricks.spark.avro")
.load("/path/to/files")
This will give you a single dataframe which you can directly query. Here, ss is the instance of spark session. /path/to/files is the place where all your avro files are being dumped from kafka.
PS: You may need to import spark-avro
libraryDependencies += "com.databricks" %% "spark-avro" % "4.0.0"
Hope this helped. Cheers
You can use createDataFrame(rowRDD: RDD[Row], schema: StructType), which is available in the SQLContext object. Example for converting an RDD of an old DataFrame:
import sqlContext.implicits.
val rdd = oldDF.rdd
val newDF = oldDF.sqlContext.createDataFrame(rdd, oldDF.schema)
Note that there is no need to explicitly set any schema column. We reuse the old DF's schema, which is of StructType class and can be easily extended. However, this approach sometimes is not possible, and in some cases can be less efficient than the first one.

How to write Iterator[String] result from mapPartitions into one file?

I am new to Spark and Scala that is why I am having quite a hard time to get through this.
What I intend to do is to pre-process my data with Stanford CoreNLP using Spark. I understand that I have to use mapPartitions in order to have one StanfordCoreNLP instance per partition as suggested in this thread. However, I lack of knowledge/understanding how to proceed from here.
In the end I want to train word vectors on this data but for now I would be happy to find out how I can get my processed data from here and write it into another file.
This is what I got so far:
import java.util.Properties
import com.google.gson.Gson
import edu.stanford.nlp.ling.CoreAnnotations.{LemmaAnnotation, SentencesAnnotation, TokensAnnotation}
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.util.CoreMap
import masterthesis.code.wordvectors.Review
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.JavaConversions._
object ReviewPreprocessing {
def main(args: Array[String]) {
val resourceUrl = getClass.getResource("amazon-reviews/reviews_Electronics.json")
val file = sc.textFile(resourceUrl.getPath)
val linesPerPartition = file.mapPartitions( lineIterator => {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val sentencesAsTextList : List[String] = List()
val pipeline = new StanfordCoreNLP(props)
val gson = new Gson()
while(lineIterator.hasNext) {
val line = lineIterator.next
val review = gson.fromJson(line, classOf[Review])
val doc = new Annotation(review.getReviewText)
pipeline.annotate(doc)
val sentences : java.util.List[CoreMap] = doc.get(classOf[SentencesAnnotation])
val sb = new StringBuilder();
sentences.foreach( sentence => {
val tokens = sentence.get(classOf[TokensAnnotation])
tokens.foreach( token => {
sb.append(token.get(classOf[LemmaAnnotation]))
sb.append(" ")
})
})
sb.setLength(sb.length - 1)
sentencesAsTextList.add(sb.toString)
}
sentencesAsTextList.iterator
})
System.exit(0)
}
}
How would I e.g. write this result into one single file? The ordering does not matter here - I guess the ordering is lost at this point anyway.
In case you'd use saveAsTextFile right on your RDD, you'd end up having as many output files as many partitions you have. In order to have just one you can either coalesce everything into one partition like
sc.textFile("/path/to/file")
.mapPartitions(someFunc())
.coalesce(1)
.saveAsTextFile("/path/to/another/file")
Or (just for fun) you could get all partitions to driver one by one and save all data yourself.
val it = sc.textFile("/path/to/file")
.mapPartitions(someFunc())
.toLocalIterator
while(it.hasNext) {
writeToFile(it.next())
}

How to add source file name to each row in Spark?

I'm new to Spark and am trying to insert a column to each input row with the file name that it comes from.
I've seen others ask a similar question, but all their answers used wholeTextFile, but I'm trying to do this for larger CSV files (read using the Spark-CSV library), JSON files, and Parquet files (not just small text files).
I can use the spark-shell to get a list of the filenames:
val df = sqlContext.read.parquet("/blah/dir")
val names = df.select(inputFileName())
names.show
but that's a dataframe.
I am not sure how to add it as a column to each row (and if that result is ordered the same as the initial data either, though I assume it always is) and how to do this as a general solution for all input types.
Another solution I just found to add file name as one of the columns in DataFrame
val df = sqlContext.read.parquet("/blah/dir")
val dfWithCol = df.withColumn("filename",input_file_name())
Ref:
spark load data and add filename as dataframe column
When you create a RDD from a text file, you probably want to map the data into a case class, so you could add the input source in that stage:
case class Person(inputPath: String, name: String, age: Int)
val inputPath = "hdfs://localhost:9000/tmp/demo-input-data/persons.txt"
val rdd = sc.textFile(inputPath).map {
l =>
val tokens = l.split(",")
Person(inputPath, tokens(0), tokens(1).trim().toInt)
}
rdd.collect().foreach(println)
If you do not want to mix "business data" with meta data:
case class InputSourceMetaData(path: String, size: Long)
case class PersonWithMd(name: String, age: Int, metaData: InputSourceMetaData)
// Fake the size, for demo purposes only
val md = InputSourceMetaData(inputPath, size = -1L)
val rdd = sc.textFile(inputPath).map {
l =>
val tokens = l.split(",")
PersonWithMd(tokens(0), tokens(1).trim().toInt, md)
}
rdd.collect().foreach(println)
and if you promote the RDD to a DataFrame:
import sqlContext.implicits._
val df = rdd.toDF()
df.registerTempTable("x")
you can query it like
sqlContext.sql("select name, metadata from x").show()
sqlContext.sql("select name, metadata.path from x").show()
sqlContext.sql("select name, metadata.path, metadata.size from x").show()
Update
You can read the files in HDFS using org.apache.hadoop.fs.FileSystem.listFiles() recursively.
Given a list of file names in a value files (standard Scala collection containing org.apache.hadoop.fs.LocatedFileStatus), you can create one RDD for each file:
val rdds = files.map { f =>
val md = InputSourceMetaData(f.getPath.toString, f.getLen)
sc.textFile(md.path).map {
l =>
val tokens = l.split(",")
PersonWithMd(tokens(0), tokens(1).trim().toInt, md)
}
}
Now you can reduce the list of RDDs into a single one: The function for reduce concats all RDDs into a single one:
val rdd = rdds.reduce(_ ++ _)
rdd.collect().foreach(println)
This works, but I cannot test if this distributes/performs well with large files.

How can one list all csv files in an HDFS location within the Spark Scala shell?

The purpose of this is in order to manipulate and save a copy of each data file in a second location in HDFS. I will be using
RddName.coalesce(1).saveAsTextFile(pathName)
to save the result to HDFS.
This is why I want to do each file separately even though I am sure the performance will not be as efficient. However, I have yet to determine how to store the list of CSV file paths into an array of strings and then loop through each one with a separate RDD.
Let us use the following anonymous example as the HDFS source locations:
/data/email/click/date=2015-01-01/sent_20150101.csv
/data/email/click/date=2015-01-02/sent_20150102.csv
/data/email/click/date=2015-01-03/sent_20150103.csv
I know how to list the file paths using Hadoop FS Shell:
HDFS DFS -ls /data/email/click/*/*.csv
I know how to create one RDD for all the data:
val sentRdd = sc.textFile( "/data/email/click/*/*.csv" )
I haven't tested it thoroughly but something like this seems to work:
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.hadoop.fs.{FileSystem, Path, LocatedFileStatus, RemoteIterator}
import java.net.URI
val path: String = ???
val hconf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
val hdfs = FileSystem.get(hconf)
val iter = hdfs.listFiles(new Path(path), false)
def listFiles(iter: RemoteIterator[LocatedFileStatus]) = {
def go(iter: RemoteIterator[LocatedFileStatus], acc: List[URI]): List[URI] = {
if (iter.hasNext) {
val uri = iter.next.getPath.toUri
go(iter, uri :: acc)
} else {
acc
}
}
go(iter, List.empty[java.net.URI])
}
listFiles(iter).filter(_.toString.endsWith(".csv"))
This is what ultimately worked for me:
import org.apache.hadoop.fs._
import org.apache.spark.deploy.SparkHadoopUtil
import java.net.URI
val hdfs_conf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
val hdfs = FileSystem.get(hdfs_conf)
// source data in HDFS
val sourcePath = new Path("/<source_location>/<filename_pattern>")
hdfs.globStatus( sourcePath ).foreach{ fileStatus =>
val filePathName = fileStatus.getPath().toString()
val fileName = fileStatus.getPath().getName()
// < DO STUFF HERE>
} // end foreach loop
sc.wholeTextFiles(path) should help. It gives an rdd of (filepath, filecontent).

Find size of data stored in rdd from a text file in apache spark

I am new to Apache Spark (version 1.4.1). I wrote a small code to read a text file and stored its data in Rdd .
Is there a way by which I can get the size of data in rdd .
This is my code :
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.util.SizeEstimator
import org.apache.spark.sql.Row
object RddSize {
def main(args: Array[String]) {
val sc = new SparkContext("local", "data size")
val FILE_LOCATION = "src/main/resources/employees.csv"
val peopleRdd = sc.textFile(FILE_LOCATION)
val newRdd = peopleRdd.filter(str => str.contains(",M,"))
//Here I want to find whats the size remaining data
}
}
I want to get size of data before filter Transformation (peopleRdd) and after it (newRdd).
There are multiple way to get the RDD size
1.Add the spark listener in your spark context
SparkDriver.getContext.addSparkListener(new SparkListener() {
override def onStageCompleted(stageCompleted: SparkListenerStageCompleted) {
val map = stageCompleted.stageInfo.rddInfos
map.foreach(row => {
println("rdd memSize " + row.memSize)
println("rdd diskSize " + row.diskSize)
})
}})
2. Save you rdd as text file.
myRDD.saveAsTextFile("person.txt")
and call Apache Spark REST API.
/applications/[app-id]/stages
3. You can also try SizeEstimater
val rddSize = SizeEstimator.estimate(myRDD)
I'm not sure you need to do this. You could cache the rdd and check the size in the Spark UI. But lets say that you do want to do this programmatically, here is a solution.
def calcRDDSize(rdd: RDD[String]): Long = {
//map to the size of each string, UTF-8 is the default
rdd.map(_.getBytes("UTF-8").length.toLong)
.reduce(_+_) //add the sizes together
}
You can then call this function for your two RDDs:
println(s"peopleRdd is [${calcRDDSize(peopleRdd)}] bytes in size")
println(s"newRdd is [${calcRDDSize(newRdd)}] bytes in size")
This solution should work even if the file size is larger than the memory available in the cluster.
The Spark API doc says that:
You can get info about your RDDs from the Spark context: sc.getRDDStorageInfo
The RDD info includes memory and disk size: RDDInfo doc