Given a file, say t.gz, that is zipped I want to be able to read this file's content line by line.
I've been able to read the contents using:
Source.fromInputStream(new GZIPInputStream(new BufferedInputStream(new FileInputStream(s))))
However, I'm looking for a way to process these files in a functional paradigm instead which has brought me to fs2.
If I unzip the file, I can do something like this:
import cats.effect._
import fs2.io.file.{Files, Path}
import fs2.{Stream, text, compression, io}
object Main extends IOApp.Simple {
def doThing(inPath: Path): Stream[IO, Unit] = {
Files[IO]
.readAll(inPath)
.through(text.utf8.decode)
.through(text.lines)
.map(line => line)
.intersperse("\n")
.through(text.utf8.encode)
.through(io.stdout)
}
val run = doThing(Path("t")).compile.drain
}
where we just go to the console in the end for simplicity.
If instead I leave it in the zipped format, I can't quite seem to find anywhere that shows how these operations would fit together to provide this as a Stream.
fs2 seems to have a compression object (https://www.javadoc.io/doc/co.fs2/fs2-docs_2.13/latest/fs2/compression/Compression.html) that seems it should do what is desired, but if it does haven't figured out how to integrate.
As such, the question is this: How do I read a zipped file into a stream to work with fs2 in a functional paradigm?
You probably want this:
object Main extends IOApp.Simple {
def doThing(inPath: Path): Stream[IO, Unit] = {
Files[IO]
.readAll(inPath)
.through(Compression[IO].gunzip())
.flatMap(_.content)
.through(text.utf8.decode)
.through(text.lines)
.map(line => line)
.intersperse("\n")
.through(text.utf8.encode)
.through(io.stdout)
}
override final val run =
doThing(Path("t")).compile.drain
}
Related
I have a dataset of event case class that I would like to save the json string element inside it into a file on s3 with a path like bucketName/service/yyyy/mm/dd/hh/[SomeGuid].gz
So for example, the events case class looks like this:
case class Event(
hourPath: String, // e.g. bucketName/service/yyyy/mm/dd/hh/
json: String // The json line that represents this particular event.
... // Other properties used in earlier transformations.
)
Is there a way to save on the dataset where we write the events that belong to a particular hour into a file on s3?
Calling partitionBy on the DataframeWriter is the closest I can get, but the file path isn't exactly what I want.
You can iterate each item and write it to a file in S3. It's efficient to do it with Spark because it will be executed in parallel.
This code is working for me:
val tempDS = eventsDS.rdd.collect.map(x => saveJSONtoS3(x.hourPath,x.json))
def saveJSONtoS3(path: String, jsonString: String) : Unit = {
val bucketName = path.substring(0,path.indexOf('/'));
val file = path.substring(bucketName.length()+1);
val creds = new BasicAWSCredentials(AWS_ACCESS_KEY, AWS_SECRET_KEY)
val amazonS3Client = new AmazonS3Client(creds)
val meta = new ObjectMetadata();
amazonS3Client.putObject(bucketName, file, new ByteArrayInputStream(jsonString.getBytes), meta)
}
You need to import:
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.auth.BasicAWSCredentials
import com.amazonaws.services.s3.model.ObjectMetadata
You need to include aws-java-sdk library.
I am working on a project for finding water depth and extent using digital ground model (DGM). I have multiple tiff files covering the area of interest and i want to combine them into a single tiff file for quick processing. How can i combine them using my own code below or any other methodology?
I have tried to concatenate the tiles bye getting them as an input one by one and then combining them but it throws GC error probably because there is something wrong with code itself. The code is provided below
import geotrellis.proj4._
import geotrellis.raster._
import geotrellis.raster.io.geotiff._
object waterdepth {
val directories = List("data")
//constants to differentiate which bands to use
val R_BAND = 0
val G_BAND = 1
val NIR_BAND = 2
// Path to our landsat band geotiffs.
def bandPath(directory: String) = s"../biggis-landuse/radar_data/${directory}"
def main(args: Array[String]): Unit = {
directories.map(directory => generateMultibandGeoTiffFile(directory))
}
def generateMultibandGeoTiffFile(directory: String) = {
val tiffFiles = new java.io.File(bandPath(directory)).listFiles.map(_.toString)
val singleBandGeoTiffArray = tiffFiles.foldLeft(Array[SinglebandGeoTiff]())((acc, el:String) => {
acc :+ SinglebandGeoTiff(el)
})
val tileArray = ArrayMultibandTile(singleBandGeoTiffArray.map(_.tile))
println(s"Writing out $directory multispectral tif")
MultibandGeoTiff(tileArray, singleBandGeoTiffArray(0).extent, singleBandGeoTiffArray(0).crs).write(s"data/$directory.tif")
it should be able to create a single tif file from all the seperate files but it throws up a memory error.
The idea you follow is correct, probably OOM happens since you're loading lot's of TIFFs into memory so it is not surprising. The solution is to allocate more memory for the JVM. However you can try this small optimization (that probably will work):
import geotrellis.proj4._
import geotrellis.raster._
import geotrellis.raster.io.geotiff._
import geotrellis.raster.io.geotiff.reader._
import java.io.File
def generateMultibandGeoTiffFile(directory: String) = {
val tiffs =
new File(bandPath(directory))
.listFiles
.map(_.toString)
// streaming = true won't force all bytes to load into memory
// only tiff metadata is fetched here
.map(GeoTiffReader.readSingleband(_, streaming = true))
val (extent, crs) = {
val tiff = tiffs.head
tiff.extent -> tiff.crs
}
// TIFF segments bytes fetch will start only during the write
MultibandGeoTiff(
MultibandTile(tiffs.map(_.tile)),
extent, crs
).write(s"data/$directory.tif")
}
}
Alpakka provides a great way to access dozens of different data sources. File oriented sources such as HDFS and FTP sources are delivered as Source[ByteString, Future[IOResult]. However, HTTP requests via Akka HTTP are delivered as entity streams of Source[ByteString, NotUsed]. In my use case, I would like to retrieve content from HTTP sources as Source[ByteString, Future[IOResult] so I can build a unified resource fetcher that works from multiple schemes (hdfs, file, ftp and S3 in this case).
In particular, I would like to convert the Source[ByteString, NotUsed] source to
Source[ByteString, Future[IOResult] where I am able to calculate the IOResult from the incoming byte stream. There are plenty of methods like flatMapConcat and viaMat but none seem to be able to extract details from the input stream (such as number of bytes read) or initialise the IOResult structure properly. Ideally, I am looking for a method with the following signature that will update the IOResult as the stream comes in.
def matCalc(src: Source[ByteString, Any]) = Source[ByteString, Future[IOResult]] = {
src.someMatFoldMagic[ByteString, IOResult](IOResult.createSuccessful(0))(m, b) => m.withCount(m.count + b.length))
}
i can't recall any existing functionality, which can out of the box do this, but you can use alsoToMat (surprisingly didn't find it in akka streams docs, although you can look it in source code documentation & java api) flow function together with Sink.fold to accumulate some value and give it in the very end. eg:
def magic(source: Source[Int, Any]): Source[Int, Future[Int]] =
source.alsoToMat(Sink.fold(0)((acc, _) => acc + 1))((_, f) => f)
the thing is that alsoToMat combines input mat value with the one provided in alsoToMat. at the same time the values produced by source are not affected by the sink in alsoToMat:
def alsoToMat[Mat2, Mat3](that: Graph[SinkShape[Out], Mat2])(matF: (Mat, Mat2) ⇒ Mat3): ReprMat[Out, Mat3] =
viaMat(alsoToGraph(that))(matF)
it's not that hard to adapt this function to return IOResult, which is according to the source code:
final case class IOResult(count: Long, status: Try[Done]) { ... }
one more last thing which you need to pay attention - you want your source be like:
Source[ByteString, Future[IOResult]]
but if you wan't to carry these mat value till the very end of stream definition, and then do smth based on this future completion, that might be error prone approach. eg, in this example i finish the work based on that future, so the last value will not be processed:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Keep, Sink, Source}
import scala.concurrent.duration._
import scala.concurrent.{Await, ExecutionContext, Future}
object App extends App {
private implicit val sys: ActorSystem = ActorSystem()
private implicit val mat: ActorMaterializer = ActorMaterializer()
private implicit val ec: ExecutionContext = sys.dispatcher
val source: Source[Int, Any] = Source((1 to 5).toList)
def magic(source: Source[Int, Any]): Source[Int, Future[Int]] =
source.alsoToMat(Sink.fold(0)((acc, _) => acc + 1))((_, f) => f)
val f = magic(source).throttle(1, 1.second).toMat(Sink.foreach(println))(Keep.left).run()
f.onComplete(t => println(s"f1 completed - $t"))
Await.ready(f, 5.minutes)
mat.shutdown()
sys.terminate()
}
This can be done by using a Promise for the materialized value propagation.
val completion = Promise[IoResult]
val httpWithIoResult = http.mapMaterializedValue(_ => completion.future)
What is left now is to complete the completion promise when the relevant data becomes available.
Alternative approach would be to drop down to the GraphStage API where you get lower level control of materialized value propagation. But even there using Promises is often the chosen implementation for materialized value propagation. Take a look at built in operator implementations like Ignore.
Link of my screenshot I am a beginner in Scala, trying to read the file but getting the java.io.FileNotFoundException,can someone help.
package standardscala
case class TempData(day :Int,doy :Int, month:Int, year :Int, precip :Double, snow :Double, tave :Double, tmax :Double, tmin :Double )
object TempData {
def main(args: Array[String]): Unit = {
val source = scala.io.Source.fromFile("DATA/MN212.csv")
val lines = source.getLines().drop(1) // to get the lines of files,drop(1) to drop the header
val data= lines.map { line => val p = line.split(",")
TempData(p(0).toInt,p(1).toInt,p(2).toInt,p(4).toInt,p(5).toDouble,p(6).toDouble,p(7).toDouble,p(8).toDouble,p(9).toDouble)
}.toArray
source.close() //Closing the connection
data.take(5) foreach println
}
}
Try to use absolute path, and the problem will disappear.
One option would be to move your csv file into a resources folder and load it as a resource like:
val f = new File(getClass.getClassLoader.getResource("your/csv/file.csv").getPath)
Or you could try loading it from an absolute path!
Please read this post about reading CSV by Alvin Alexander, writer of the Scala Cookbook:
object CSVDemo extends App {
println("Month, Income, Expenses, Profit")
val bufferedSource = io.Source.fromFile("/tmp/finance.csv")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
// do whatever you want with the columns here
println(s"${cols(0)}|${cols(1)}|${cols(2)}|${cols(3)}")
}
bufferedSource.close
}
As Silvio Manolo pointed out, you should not use fromFile with absolute path, as you code will require the same file hierarchy to run. In a first draft, this is acceptable so you can move on and test the real job!
In Scala, how does one uncompress the text contained in file.gz so that it can be processed? I would be happy with either having the contents of the file stored in a variable, or saving it as a local file so that it can be read in by the program after.
Specifically, I am using Scalding to process compressed log data, but Scalding does not define a way to read them in FileSource.scala.
Here's my version:
import java.io.BufferedReader
import java.io.InputStreamReader
import java.util.zip.GZIPInputStream
import java.io.FileInputStream
class BufferedReaderIterator(reader: BufferedReader) extends Iterator[String] {
override def hasNext() = reader.ready
override def next() = reader.readLine()
}
object GzFileIterator {
def apply(file: java.io.File, encoding: String) = {
new BufferedReaderIterator(
new BufferedReader(
new InputStreamReader(
new GZIPInputStream(
new FileInputStream(file)), encoding)))
}
}
Then do:
val iterator = GzFileIterator(new java.io.File("test.txt.gz"), "UTF-8")
iterator.foreach(println)