I have an instance of scala.io.BufferedSource (retrieved from scala.io.Source) and want to get raw bytes out of it. Most of the answers found on the internet use getLines method which disregards new-line delimiters. I need to retrieve the contents as-is and the API seems rather complicated. What is the easiest way to do that?
You can do it something like this:
val bs: BufferedSource = scala.io.Source.fromURL(new URI("https://google.com").toURL)
val result: Array[Byte] = bs.map(_.toByte).toArray
Related
I have a very large file that contains individual JSONs which I would like to iterate through, turning each one into a Map using the Jackson library:
import com.fasterxml.jackson.databind.ObjectMapper import com.fasterxml.module.scala.DefaultScalaModule
import com.fasterxml.module.scala.ScalaObjectMapper
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.register(DefaultScalaModule)
val lines = sc.textFile(fileName)
on a single JSON string, I can perform without issues:
mapper.readValue[Map[String, Object]](JSONString)
to get my map.
However, if I try the following by iterating through an RDD[String] like so I get the following error:
lines.foreach(line=> mapper.readValue[Map[String, Object]])
org.apache.Spark.SparkException: Task not serializable
I can do lines.take(10000) or so and then work on that but this file is so huge I can't "take" or "collect" the whole file in one go and I want to be able to use the same solution across files of all different sizes.
After the string becomes a Map, I need to perform functions on it and write to a string, so any solution that allows me to do that without going over my allocated memory will help. Thank you!
Managed to solve this with the below:
import scala.util.parsing.json._
val myMap = JSON.parseFull(jsonString).get.asInstanceOf[Map[String, Object]]
The above will work on an RDD[String]
I am new to Spark and Scala as well, so this might be a very basic question.
I created a text file with 4 lines of some words. The rest of the code is as below:
val data = sc.textFile("file:///home//test.txt").map(x=> x.split(" "))
println(data.collect)
println(data.take(2))
println(data.collect.foreach(println))
All the above "println" commands are producing output as: [Ljava.lang.String;#1ebec410
Any idea how do I display the actual contents of the rdd, I have even tried "saveAstextfile", it also save the same line as java...
I am using Intellij IDE for spark scala and yes, I have gone through other posts related to this, but no help. Thanking you in advance
The final return type of RDD is RDD[Array[String]] Previously you were printing the Array[String] that prints something like this [Ljava.lang.String;#1ebec410) Because the toString() method of Array is not overridden so it is just printing the HASHCODE of object
You can try casting Array[String] to List[String] by using implicit method toList now you will be able to see the content inside the list because toString() method of list in scala in overridden and shows the content
That Means if you try
data.collect.foreach(arr => println(arr.toList))
this will show you the content or as #Raphael has suggested
data.collect().foreach(arr => println(arr.mkString(", ")))
this will also work because arr.mkString(", ")will convert the array into String and Each element Seperated by ,
Hope this clears you doubt
Thanks
data is of type RDD[Array[String]], what you print is the toString of the Array[String] ( [Ljava.lang.String;#1ebec410), try this:
data.collect().foreach(arr => println(arr.mkString(", ")))
I have this two lines(among the all others)
import scala.io.Source
val source = Source.fromFile(filename)
As I understand this is a way to read file content.I have read
http://www.scala-lang.org/api/2.12.x/scala/io/Source.html#iter:Iterator[Char]
I still do not get it what does Source.from File represent,one of Type Members,or something else?
from the Scala API stated here fromFile is a method defined on the Source companion object. This is a curried method with the first param list taking a single String representing the path of the file to be read and the second curried parameter list takes a single implicit codec argument of type scala.io.Codec. And this function returns a BufferedSource object
Basic question, I want to set the standard input to be a specific string. Currently I am trying it with this:
import java.nio.charset.StandardCharsets
import java.io.ByteArrayInputStream
// Let's say we are inside a method now
val str = "textinputgoeshere"
System.setIn(new ByteArrayInputStream(str.getBytes(StandardCharsets.UTF_8)))
Because that's similar to how I'd do it in Java, however str.getBytes seems to work differently in Scala as System in is set to a memory address when I check it with println....
I've looked at the Scala API: http://www.scala-lang.org/api/current/scala/Console$.html#setIn(in:java.io.InputStream):Unit
and I've found
def withIn[T](in: InputStream)(thunk: ⇒ T): T
But this seems to only set the input stream for a specific chunk of code, I'd like this to be a feature in a Setup method in my JUnit tests.
My problem ended up being something related to my code, not this specific concept. The correct way to override Standard In / System In to a String in Scala is the following:
val str = "your string here"
val in: InputStream = new ByteArrayInputStream(str.getBytes(StandardCharsets.UTF_8))
Console.withIn(in)(yourMethod())"
My tests run correctly now.
I want to write a Scala function to read all of the lines from a file lazily (i.e returning an Iterator[String]) which also closes the file afterwards. I know about the idiom io.Source.fromFile("something.txt").getLines however as noted here this will not close the file afterwards. Surely there is a simple way to do this?
Currently I'm using this, with the scala-arm library:
import resource.managed
import io.{Source, BufferedSource}
def lines(filename: String): Iterator[String] = {
val reader = managed(Source.fromFile(filename, "UTF-8"))
reader.map(_.getLines).toTraversable.toIterator
}
but this seems to read the whole file into memory as far as I can tell.
I come from a Python background, where this is laughably trivial:
def lines(filename):
with open(filename) as f:
for line in f:
yield line
Surely there is a reasonably straightforward Scala equivalent which I just haven't managed to work out yet.