I have two files
--------Student.csv---------
StudentId,City
101,NDLS
102,Mumbai
-------StudentDetails.csv---
StudentId,StudentName,Course
101,ABC,C001
102,XYZ,C002
Requirement
StudentId in first should be replaced with StudentName and Course in the second file.
Once replaced I need to generate a new CSV with complete details like
ABC,C001,NDLS
XYZ,C002,Mumbai
Code used
val studentRDD = sc.textFile(file path);
val studentdetailsRDD = sc.textFile(file path);
val studentB = sc.broadcast(studentdetailsRDD.collect)
//Generating CSV
studentRDD.map{student =>
val name = getName(student.StudentId)
val course = getCourse(student.StudentId)
Array(name, course, student.City)
}.mapPartitions{data =>
val stringWriter = new StringWriter();
val csvWriter =new CSVWriter(stringWriter);
csvWriter.writeAll(data.toList)
Iterator(stringWriter.toString())
}.saveAsTextFile(outputPath)
//Functions defined to get details
def getName(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.StudentName}
}
def getCourse(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.Course}
}
Problem
File gets generated but the values are object representations instead of String value.
How can I get the string values instead of objects ?
As suggested in another answer, Spark's DataFrame API is especially suitable for this, as it easily supports joining two DataFrames, and writing CSV files.
However, if you insist on staying with RDD API, looks like the main issue with your code is the lookup functions: getName and getCourse basically do nothing, because their return type is Unit; Using an if without an else means that for some inputs there's no return value, which makes the entire function return Unit.
To fix this, it's easier to get rid of them and simplify the lookup by broadcasting a Map:
// better to broadcast a Map instead of an Array, would make lookups more efficient
val studentB = sc.broadcast(studentdetailsRDD.keyBy(_.StudentId).collectAsMap())
// convert to RDD[String] with the wanted formatting
val resultStrings = studentRDD.map { student =>
val details = studentB.value(student.StudentId)
Array(details.StudentName, details.Course, student.City)
}
.map(_.mkString(",")) // naive CSV writing with no escaping etc., you can also use CSVWriter like you did
// save as text file
resultStrings.saveAsTextFile(outputPath)
Spark has great support for join and write to file. Join only takes 1 line of code and write also only takes 1.
Hand write those code can be error proven, hard to read and most likely super slow.
val df1 = Seq((101,"NDLS"),
(102,"Mumbai")
).toDF("id", "city")
val df2 = Seq((101,"ABC","C001"),
(102,"XYZ","C002")
).toDF("id", "name", "course")
val dfResult = df1.join(df2, "id").select("id", "city", "name")
dfResult.repartition(1).write.csv("hello.csv")
There will be a directory created. There is only 1 file in the directory which is the finally result.
Related
I am trying to write data in single file using spark scala:
while (loop > 0) {
val getReq = new HttpGet(ww.url.com)
val httpResponse = client.execute(getReq)
val data = Source.fromInputStream(httpResponse.getEntity.getContent()).getLines.mkString
val parser = JSON.parseFull(data)
val globalMap = parser.get.asInstanceOf[Map[String, Any]]
val reviewMap = globalMap.get("payload").get.asInstanceOf[Map[String, Any]]
val df = context.sparkContext.parallelize(Seq(reviewMap.get("records").get.toString())).toDF()
if (startIndex == 0) {
df.coalesce(1).write.mode(SaveMode.Overwrite).json("C:\\Users\\kh\\Desktop\\spark\\raw\\data\\final")
} else {
df.coalesce(1).write.mode(SaveMode.Append).json("C:\\Users\\kh\\Desktop\\spark\\raw\\data\\final")
}
startIndex = startIndex + limit
loop = loop - 1
httpResponse.close()
}
The number of file created is the number of loops and I want to create one file only.
and it is also creating CRC file as well I want to remove those:
I tried below config but it only stops creation of Success files:
.config("dfs.client.read.shortcircuit.skip.checksum", "true")
.config("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
.config("fs.file.impl.disable.cache", true)
Any ideas to create one file only without crc and success files?
Re: "The number of file created is the number of loops"
Even though you are using df.coalesce(1) in your code, it is still being executed as many number of times as you run the while loop.
I want to create one file only
From your code it seems that you are trying to invoke HTTP GET requests to some URL and save the content after parsing.
If this understanding is right then I believe you should not be using a while loop to do this task. There is map transformation that you could use in the following manner.
Please find below the Psuedo-Code for your reference.
val urls = List("a.com","b.com","c.com")
val sourcedf = sparkContext.parallelize(urls).toDF
//this could be map or flatMap based on your requirement.
val yourprocessedDF = sourcedf.map(<< do your parsing here and emit data>>)
yourprocessedDF.repartition(1).write(<<whichever format you need>>)
I have a code for tokenizing a string.
But that tokenization method uses some data which is loaded when my application starts.
val stopwords = getStopwords();
val tokens = tokenize("hello i am good",stopwords)
def tokenize(string:String,stopwords: List[String]) : List[String] = {
val splitted = string.split(" ")
// I use this stopwords for filtering my splitted array.
// Then i return the items back.
}
Now I want to make the tokenize method an UDF for Spark.I want to use it to create new column in DataFrame Transformations.
I created simple UDFs before which had no dependencies like it needs items that needs to be read from text file etc.
Can some one tell me how to do these kind of operation?
This is what I have tried ,and its working.
val moviesDF = Seq(
("kingdomofheaven"),
("enemyatthegates"),
("salesinfointheyearofdecember"),
).toDF("column_name")
val tokenizeUDF: UserDefinedFunction = udf(tokenize(_: String): List[String])
moviesDF.withColumn("tokenized", tokenizeUDF(col("column_name"))).show(100, false)
def tokenize(name: String): List[String] = {
val wordFreqMap: Map[String, Double] = DataProviderUtil.getWordFreqMap()
val stopWords: Set[String] = DataProviderUtil.getStopWordSet()
val maxLengthWord: Int = wordFreqMap.keys.maxBy(_.length).length
.................
.................
}
Its giving me the expected output:
+----------------------------+--------------------------+
|columnname |tokenized |
+----------------------------+--------------------------+
|kingdomofheaven |[kingdom, heaven] |
|enemyatthegates |[enemi, gate] |
|salesinfointheyearofdecember|[sale, info, year, decemb]|
+----------------------------+--------------------------+
Now my question is , will it work when its deployed ? Currently I am
running it locally. My main concern it that this function reads from a
file to get information like stopwords,wordfreq etc for making the
tokenization possible. So registering it like this will work properly
?
At this point, if you deploy this code Spark will try to serialize your DataProviderUtil, you would need to mark as serializable that class. Another possibility is to declare you logic inside an Object. Functions inside objects are considered static functions and they are not serialized.
I build a dataframe, which is going to have a String column that is actuall a structure, turned to a JSON string.
val df = df_helper.select(
lit("some data").as("id"),
to_json(
struct(
col("id"),
col("type"),
col("path")
)
)).as("content")
I also built a function, that takes the identifier id:String as a parameter and spits out a string list.
def buildHierarchy(id:String) : List[String] = {
val check_df = hierarchy_df.select(
$"parent_id",
$"id"
).where($"id" === id)
val pathArray = List(id)
val parentString = check_df.select($"parent_id").first.getString(0)
if (parentString == null) {
return pathArray
}
else {
val pathList = buildHierarchy(parentString)
val finalList: List[String] = pathList ++ pathArray
return finalList
}
}
I want to call this function and replace the path column with the result of the function. Is this possible, or is there a workaround?
Thank you in advance!
With the help of the following blog post I built up the hierarchy:
https://www.qubole.com/blog/processing-hierarchical-data-using-spark-graphx-pregel-api/
I included all the necessary information in my final dataframe, that I need to build up the details for each elements for the structure, along with the hierarchy path for each elements. Then as a new column I created a structure for each elements with the details.
val content_df = hierarchy_df.select($"path")
.withColumn("content",
struct( col("id"),
col("type"),
col("path")
))
I exploded the path to reach each identifiers in it, but preserve the position order, to join on the content for each level in the path.
val exploded_df = content_df.select(
$"*",posexplode($"path"),
$"pos".as("exploded_level"),
$"col".as("exploded_id"))
Then finally, join the content with the path, and aggregate the content into a path that contains all the content for each level.
val level_content_df = exploded_df.as("e")
.join(content_df.as("ouc"), col("e.col") === col("ouc.id"), "left")
.select($"e.*", $"ouc.content".as("content"))
val pathFull_df = level_content_df
.groupBy(col("id").as("id"))
.agg(sort_array(collect_list("content")).as("pathFull"))
And then finally I just put the whole thing together again :)
val content_with_path_df = content_df.as("ou")
.join(pathFull_df.as("opf"), col("ou.id") === col("opf.id"), "left")
.select($"ou.*", $"opf.pathFull".as("pathFull"))
Feel free to reach out if it doesn't make sense, it took a while for me to make it happen! :D
I want to ingest many small text files via spark to parquet. Currently, I use wholeTextFiles and perform some parsing additionally.
To be more precise - these small text files are ESRi ASCII Grid files each with a maximum size of around 400kb. GeoTools are used to parse them as outlined below.
Do you see any optimization possibilities? Maybe something to avoid the creation of unnecessary objects? Or something to better handle the small files. I wonder if it is better to only get the paths of the files and manually read them instead of using String -> ByteArrayInputStream.
case class RawRecords(path: String, content: String)
case class GeometryId(idPath: String, value: Double, geo: String)
#transient lazy val extractor = new PolygonExtractionProcess()
#transient lazy val writer = new WKTWriter()
def readRawFiles(path: String, parallelism: Int, spark: SparkSession) = {
import spark.implicits._
spark.sparkContext
.wholeTextFiles(path, parallelism)
.toDF("path", "content")
.as[RawRecords]
.mapPartitions(mapToSimpleTypes)
}
def mapToSimpleTypes(iterator: Iterator[RawRecords]): Iterator[GeometryId] = iterator.flatMap(r => {
val extractor = new PolygonExtractionProcess()
// http://docs.geotools.org/latest/userguide/library/coverage/arcgrid.html
val readRaster = new ArcGridReader(new ByteArrayInputStream(r.content.getBytes(StandardCharsets.UTF_8))).read(null)
// TODO maybe consider optimization of known size instead of using growable data structure
val vectorizedFeatures = extractor.execute(readRaster, 0, true, null, null, null, null).features
val result: collection.Seq[GeometryId] with Growable[GeometryId] = mutable.Buffer[GeometryId]()
while (vectorizedFeatures.hasNext) {
val vectorizedFeature = vectorizedFeatures.next()
val geomWKTLineString = vectorizedFeature.getDefaultGeometry match {
case g: Geometry => writer.write(g)
}
val geomUserdata = vectorizedFeature.getAttribute(1).asInstanceOf[Double]
result += GeometryId(r.path, geomUserdata, geomWKTLineString)
}
result
})
I have suggestions:
use wholeTextFile -> mapPartitions -> convert to Dataset. Why? If you make mapPartitions on Dataset, then all rows are converted from internal format to object - it causes additional serialization.
Run Java Mission Control and sample your application. It will show all compilations and times of execution of methods
Maybe you can use binaryFiles, it will give you Stream, so you can parse it without additional reading in mapPartitions
I have a function that I want to apply to a every row of a .csv file:
def convert(inString: Array[String]) : String = {
val country = inString(0)
val sellerId = inString(1)
val itemID = inString(2)
try{
val minidf = sqlContext.read.json( sc.makeRDD(inString(3):: Nil) )
.withColumn("country", lit(country))
.withColumn("seller_id", lit(sellerId))
.withColumn("item_id", lit(itemID))
val finalString = minidf.toJSON.collect().mkString(",")
finalString
} catch{
case e: Exception =>println("AN EXCEPTION "+inString.mkString(","))
("this is an exception "+e+" "+inString.mkString(","))
}
}
This function transforms an entry of the sort:
CA 112578240 132080411845 [{"id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
Where I have 4 columns, the 4th being a json blob, into
[{"country":"CA", "seller":112578240", "product":112578240, "id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
which is the json object where the first 3 columns have been inserted into the fourth.
Now, this works:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).collect().map(x => convert(x))
or this:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).take(10).map(x => convert(x))
but this does not
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).map(x => convert(x))
The last one throw a java.lang.NullPointerException.
I included a try catch clause so see where exactly is this failing and it's failing for every single row.
What am I doing wrong here?
You cannot put sqlContext or sparkContext in a Spark map, since that object can only exist on the driver node. Essentially they are in charge of distributing your tasks.
You could rewite the JSON parsing bit using one of these libraries in pure scala: https://manuel.bernhardt.io/2015/11/06/a-quick-tour-of-json-libraries-in-scala/