How to stream downloads using Scalaj-Http and Hadoop HttpFs - scala

My question is how to use a Buffered stream when using Scalaj-Http.
I have written the following code which is a complete working example that will download a file from Hadoop HDFS using HttpFS. My goal is to handle very large files and this will require using a buffered approach with multiple I/O writes to a local file.
I have not been able to find documentation on how to use a stream with the ScalaJ-Http interface. I am interested in an example for both download and upload that can handle large multi GB files. My code below uses in memory buffering which is appropriate for only prototyping.
import scalaj.http._
import ujson.Js
import java.text.SimpleDateFormat
import java.net.SocketTimeoutException
import java.io.InputStream
import java.io.BufferedOutputStream
import java.io.FileOutputStream
import java.io.FileNotFoundException
object CopyFileFromHdfs {
def main(args: Array[String]) {
val host = "hadoop.example.com"
val user = "root"
var dstFile = ""
var srcFile = ""
val operation = "OPEN"
val port = 14000
System.setProperty("sun.net.http.allowRestrictedHeaders", "true")
if (args.length != 2)
{
println("Error: Missing or too many arguments")
println("Usage: CopyFileFromHdfs <srcfile> <dstfile>")
System.exit(1)
}
srcFile = args(0)
dstFile = args(1)
// ********************************************************************************
// Create the URL string that we will use to connect to Hadoop HttpFS
//
// The string will look like this:
// http://root#123.456.789.012:14000/webhdfs/v1/?user.name=root&op=OPEN
// ********************************************************************************
val url = makeHttpfsUrl(host, user, srcFile, operation, port)
// ********************************************************************************
// Using HTTP, call the HttpFS server
//
// Exceptions:
// java.net.SocketTimeoutException
// java.net.UnknownHostException
// java.lang.IllegalArgumentException
// Remote Exceptions:
// java.io.FileNotFoundException
// com.sun.jersey.api.NotFoundException
// ********************************************************************************
try {
var response = Http(url)
.timeout(connTimeoutMs = 1000, readTimeoutMs = 5000)
.asBytes
// ********************************************************************************
// Check for an error. We are expecting an HTTP 200 response
// ********************************************************************************
if (response.code < 200 || response.code > 299)
{
val data = ujson.read(response.body)
printf("Error: Cannot download file: %s\n", dstFile)
println(removeQuotes(data("RemoteException")("message").str))
println(removeQuotes(data("RemoteException")("exception").str))
System.exit(1)
}
val is = new FileOutputStream(dstFile)
val bs = new BufferedOutputStream(is)
bs.write(response.body, 0, response.body.length)
bs.close()
is.close()
} catch {
case e: SocketTimeoutException => {
printf("Error: Cannot connect to host %s on port %d\n", host, port)
println(e)
System.exit(1);
}
case e: Exception => {
printf("Error (other): Cannot download file %s\n", srcFile)
println(e)
System.exit(1);
}
}
printf("Success: File downloaded. %s -> %s\n", srcFile, dstFile)
System.exit(0)
}
// ********************************************************************************
// The Json strings are surrounded by quotes.
// This function will remove them (only at the start and the end).
// ********************************************************************************
def removeQuotes(str: String): String = {
// This expression will delete quotes at the beginning and end of a string
return str.replaceAll("^\"|\"$", "");
}
// ********************************************************************************
// Create the URL string that we will use to connect to Hadoop HttpFS
//
// The string will look like this:
// http://root#123.456.789.012:14000/webhdfs/v1/?user.name=root&op=LISTSTATUS
// ********************************************************************************
def makeHttpfsUrl(
host: String,
user: String,
hdfsPath: String,
operation: String,
port: Integer) : String = {
var url = "http://" + user + "#" + host + ":" + port.toString + "/webhdfs/v1"
if (hdfsPath(0) == '/')
url += hdfsPath
else
url += "/" + hdfsPath
url += "?user.name=" + user + "&op=" + operation
return url
}
}

Related

akka.http.scaladsl.model.ParsingException: Unexpected end of multipart entity while uploading a large file to S3 using akka http

I am trying to upload a large file (90 MB for now) to S3 using Akka HTTP with Alpakka S3 connector. It is working fine for small files (25 MB) but when I try to upload large file (90 MB), I got the following error:
akka.http.scaladsl.model.ParsingException: Unexpected end of multipart entity
at akka.http.scaladsl.unmarshalling.MultipartUnmarshallers$$anonfun$1.applyOrElse(MultipartUnmarshallers.scala:108)
at akka.http.scaladsl.unmarshalling.MultipartUnmarshallers$$anonfun$1.applyOrElse(MultipartUnmarshallers.scala:103)
at akka.stream.impl.fusing.Collect$$anon$6.$anonfun$wrappedPf$1(Ops.scala:227)
at akka.stream.impl.fusing.SupervisedGraphStageLogic.withSupervision(Ops.scala:186)
at akka.stream.impl.fusing.Collect$$anon$6.onPush(Ops.scala:229)
at akka.stream.impl.fusing.GraphInterpreter.processPush(GraphInterpreter.scala:523)
at akka.stream.impl.fusing.GraphInterpreter.processEvent(GraphInterpreter.scala:510)
at akka.stream.impl.fusing.GraphInterpreter.execute(GraphInterpreter.scala:376)
at akka.stream.impl.fusing.GraphInterpreterShell.runBatch(ActorGraphInterpreter.scala:606)
at akka.stream.impl.fusing.GraphInterpreterShell$AsyncInput.execute(ActorGraphInterpreter.scala:485)
at akka.stream.impl.fusing.GraphInterpreterShell.processEvent(ActorGraphInterpreter.scala:581)
at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:749)
at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$shortCircuitBatch(ActorGraphInterpreter.scala:739)
at akka.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:765)
at akka.actor.Actor.aroundReceive(Actor.scala:539)
at akka.actor.Actor.aroundReceive$(Actor.scala:537)
at akka.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:671)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:614)
at akka.actor.ActorCell.invoke(ActorCell.scala:583)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268)
at akka.dispatch.Mailbox.run(Mailbox.scala:229)
at akka.dispatch.Mailbox.exec(Mailbox.scala:241)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Although, I get the success message at the end but file does not uploaded completely. It gets upload of 45-50 MB only.
I am using the below code:
S3Utility.scala
class S3Utility(implicit as: ActorSystem, m: Materializer) {
private val bucketName = "test"
def sink(fileInfo: FileInfo): Sink[ByteString, Future[MultipartUploadResult]] = {
val fileName = fileInfo.fileName
S3.multipartUpload(bucketName, fileName)
}
}
Routes:
def uploadLargeFile: Route =
post {
path("import" / "file") {
extractMaterializer { implicit materializer =>
withoutSizeLimit {
fileUpload("file") {
case (metadata, byteSource) =>
logger.info(s"Request received to import large file: ${metadata.fileName}")
val uploadFuture = byteSource.runWith(s3Utility.sink(metadata))
onComplete(uploadFuture) {
case Success(result) =>
logger.info(s"Successfully uploaded file")
complete(StatusCodes.OK)
case Failure(ex) =>
println(ex, "Error in uploading file")
complete(StatusCodes.FailedDependency, ex.getMessage)
}
}
}
}
}
}
Any help would be appraciated. Thanks
Strategy 1
Can you break the file into smaller chunks and retry, here is the sample code:
AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
.withEndpointConfiguration(new AwsClientBuilder.EndpointConfiguration("some-kind-of-endpoint"))
.withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials("user", "pass")))
.disableChunkedEncoding()
.withPathStyleAccessEnabled(true)
.build();
// Create a list of UploadPartResponse objects. You get one of these
// for each part upload.
List<PartETag> partETags = new ArrayList<PartETag>();
// Step 1: Initialize.
InitiateMultipartUploadRequest initRequest = new
InitiateMultipartUploadRequest("bucket", "key");
InitiateMultipartUploadResult initResponse =
s3Client.initiateMultipartUpload(initRequest);
File file = new File("filepath");
long contentLength = file.length();
long partSize = 5242880; // Set part size to 5 MB.
try {
// Step 2: Upload parts.
long filePosition = 0;
for (int i = 1; filePosition < contentLength; i++) {
// Last part can be less than 5 MB. Adjust part size.
partSize = Math.min(partSize, (contentLength - filePosition));
// Create a request to upload a part.
UploadPartRequest uploadRequest = new UploadPartRequest()
.withBucketName("bucket").withKey("key")
.withUploadId(initResponse.getUploadId()).withPartNumber(i)
.withFileOffset(filePosition)
.withFile(file)
.withPartSize(partSize);
// Upload part and add response to our list.
partETags.add(
s3Client.uploadPart(uploadRequest).getPartETag());
filePosition += partSize;
}
// Step 3: Complete.
CompleteMultipartUploadRequest compRequest = new
CompleteMultipartUploadRequest(
"bucket",
"key",
initResponse.getUploadId(),
partETags);
s3Client.completeMultipartUpload(compRequest);
} catch (Exception e) {
s3Client.abortMultipartUpload(new AbortMultipartUploadRequest(
"bucket", "key", initResponse.getUploadId()));
}
Strategy 2
Increase the idle-timeout of the Akka HTTP server (just set it to infinite), like the following:
akka.http.server.idle-timeout=infinite
This would increase the time period for which the server expects to be idle. By default its value is 60 seconds. And if the server is not able to upload the file within that time period, it will close the connection and throw "Unexpected end of multipart entity" error.

java.lang.NoClassDefFoundError: org/apache/flink/streaming/api/scala/StreamExecutionEnvironment

package com.knoldus
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
object SocketWindowWordCount {
def main(args: Array[String]) : Unit = {
var hostname: String = "localhost"
var port: Int = 9000
try {
val params = ParameterTool.fromArgs(args)
hostname = if (params.has("hostname")) params.get("hostname") else "localhost"
port = params.getInt("port")
} catch {
case e: Exception => {
System.err.println("No port specified. Please run 'SocketWindowWordCount " +
"--hostname <hostname> --port <port>', where hostname (localhost by default) and port " +
"is the address of the text server")
System.err.println("To start a simple text server, run 'netcat -l <port>' " +
"and type the input text into the command line")
return
}
}
// get the execution environment
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// get input data by connecting to the socket
val text: DataStream[String] = env.socketTextStream(hostname, port, '\n')
// parse the data, group it, window it, and aggregate the counts
val windowCounts = text
.flatMap { w => w.split("\\s") }
.map { w => WordWithCount(w, 1) }
.keyBy("word")
.timeWindow(Time.seconds(5))
.sum("count")
// print the results with a single thread, rather than in parallel
windowCounts.print().setParallelism(1)
env.execute("Socket Window WordCount")
}
/** Data type for words with count */
case class WordWithCount(word: String, count: Long)
}
In the end while running this code on Intellij i am getting this error
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/flink/streaming/api/scala/StreamExecutionEnvironment$
at com.knoldus.SocketWindowWordCount$.main(SocketWindowWordCount.scala:43)
at com.knoldus.SocketWindowWordCount.main(SocketWindowWordCount.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.flink.streaming.api.scala.StreamExecutionEnvironment$
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
... 2 more
Select menu item "Run" => "Edit Configurations...",
then in the "Build and run" section select "Modify options" => Java => Add dependencies with "Provided" scope to classpath in your local configuration.
In this way you don't have to remove the <scope>provided</scope>.
I resolved this by removing the <scope>provided</scope> which was present in the maven import.

SSLHandshakeException happens during file upload to AWS S3 via Alpakka

I'm trying to setup an Alpakka S3 for files upload purpose. Here is my configs:
alpakka s3 dependency:
...
"com.lightbend.akka" %% "akka-stream-alpakka-s3" % "0.20"
...
Here is application.conf:
akka.stream.alpakka.s3 {
buffer = "memory"
proxy {
host = ""
port = 8000
secure = true
}
aws {
credentials {
provider = default
}
}
path-style-access = false
list-bucket-api-version = 2
}
File upload code example:
private val awsCredentials = new BasicAWSCredentials("my_key", "my_secret_key")
private val awsCredentialsProvider = new AWSStaticCredentialsProvider(awsCredentials)
private val regionProvider = new AwsRegionProvider { def getRegion: String = "us-east-1" }
private val settings = new S3Settings(MemoryBufferType, None, awsCredentialsProvider, regionProvider, false, None, ListBucketVersion2)
private val s3Client = new S3Client(settings)(system, materializer)
val fileSource = Source.fromFuture(ByteString("ololo blabla bla"))
val fileName = UUID.randomUUID().toString
val s3Sink: Sink[ByteString, Future[MultipartUploadResult]] = s3Client.multipartUpload("my_basket", fileName)
fileSource.runWith(s3Sink)
.map {
result => println(s"${result.location}")
} recover {
case ex: Exception => println(s"$ex")
}
When I run this code I get:
javax.net.ssl.SSLHandshakeException: General SSLEngine problem
What can be a reason?
The certificate problem arises for bucket names containing dots.
You may switch to
akka.stream.alpakka.s3.path-style-access = true to get rid of this.
We're considering making it the default: https://github.com/akka/alpakka/issues/1152

How to use http4s server and client library as a proxy?

I want use http4s as proxy(like nginx), how to forward all data from my http4s server to another http server?
What I really want do is append a verify function on every request before do forward function. Hopefully like this:
HttpService[IO] {
case request =>
val httpClient: Client[IO] = Http1Client[IO]().unsafeRunSync
if(verifySuccess(request)) { // forward all http data to host2 and
// get a http response.
val result = httpClient.forward(request, "http://host2")
result
} else {
Forbidden //403
}
}
How to do this with http4s and it's client?
Thanks
Updated
with the help of #TheInnerLight, I give it a try with the snippet code:
val httpClient = Http1Client[IO]()
val service: HttpService[IO] = HttpService[IO] {
case req =>
if(true) {
for {
client <- httpClient
newAuthority = req.uri.authority.map(_.copy(host = RegName("scala-lang.org"), port = Some(80)))
proxiedReq = req.withUri(req.uri.copy(authority = newAuthority))
response <- client.fetch(proxiedReq)(IO.pure(_))
} yield response
} else {
Forbidden("Some forbidden message...")
}
}
With a request: http://localhost:28080(http4s server listen at 28080):
but occurred a error:
[ERROR] org.http4s.client.PoolManager:102 - Error establishing client connection for key RequestKey(Scheme(http),localhost)
java.net.ConnectException: Connection refused
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
at sun.nio.ch.KQueuePort$EventHandlerTask.run(KQueuePort.java:301)
at java.lang.Thread.run(Thread.java:748)
[ERROR] org.http4s.server.service-errors:88 - Error servicing request: GET / from 0:0:0:0:0:0:0:1
java.net.ConnectException: Connection refused
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
at sun.nio.ch.KQueuePort$EventHandlerTask.run(KQueuePort.java:301)
at java.lang.Thread.run(Thread.java:748)
Latest Version
val httpClient: IO[Client[IO]] = Http1Client[IO]()
override val service: HttpService[IO] = HttpService[IO] {
case req =>
val hostName = "scala-lang.org"
val myPort = 80
if(true) {
val newHeaders = {
val filterHeader = req.headers.filterNot{h =>
h.name == CaseInsensitiveString("Connection") ||
h.name == CaseInsensitiveString("Keep-Alive") ||
h.name == CaseInsensitiveString("Proxy-Authenticate") ||
h.name == CaseInsensitiveString("Proxy-Authorization") ||
h.name == CaseInsensitiveString("TE") ||
h.name == CaseInsensitiveString("Trailer") ||
h.name == CaseInsensitiveString("Transfer-Encoding") ||
h.name == CaseInsensitiveString("Upgrade")
}
filterHeader.put(Header("host", hostName))
}
for {
client <- httpClient
newAuthority = req.uri.authority
.map(_.copy(host = RegName(hostName), port = Some(myPort)))
.getOrElse( Authority(host = RegName(hostName), port = Some(myPort)))
proxiedReq = req.withUri(req.uri.copy(authority = Some(newAuthority)))
.withHeaders(newHeaders)
response <- client.fetch(proxiedReq)(x => IO.pure(x))
} yield {
val rst = response
rst
}
} else {
Forbidden("Some forbidden message...")
}
}
It works fine enough for my REST API web server.
There are some error when proxy scala-lang.org for test:
[ERROR] org.http4s.blaze.pipeline.Stage:226 - Error writing body
org.http4s.InvalidBodyException: Received premature EOF.
How about something like this:
HttpService[IO] {
case req =>
if(verifyRequest(req)) {
for {
client <- Http1Client[IO]()
newHost = "host2"
newAuthority = Authority(host = RegName("host2"), port = Some(80))
proxiedReq =
req.withUri(req.uri.copy(authority = Some(newAuthority)))
.withHeaders(req.headers.put(Header("host", newHost)))
response <- client.fetch(proxiedReq)(IO.pure(_))
} yield response
} else {
Forbidden("Some forbidden message...")
}
}
Note that you should definitely avoid littering your code with calls tounsafeRunSync. You should generally be using it at most once in your program (in Main). In other circumstances, you should focus on lifting the effects into the monad you're working in.

Gatling: Web Socket Open and Initialization Index page is giving an Error

When I'm initializing(POST INDEX PAGE) my index page, it is giving me the following error.
KO bodyString.find.transform.exists failed, could not extract: transform crashed: Unexpected character ('2' (code 50)): was expecting comma to separate OBJECT entries
Code:
.exec(http("POST INDEX PAGE")
.post("/?v-1471231389581")
// .headers(Map("Content-Type" -> "application/json; charset=UTF-8"))
.formParam("v-browserDetails","1")
.formParam("theme", "mytheme")
.formParam("v-appId", appver)
.formParam("v-sh", "1200")
.formParam("v-sw", "1920")
.formParam("v-cw", "147")
.formParam("v-ch", "1047")
.formParam("v-curdate", "1470999686031")
.formParam("v-tzo", "-330")
.formParam("v-dstd", "0")
.formParam("v-rtzo", "-330")
.formParam("v-dston", "false")
.formParam("v-vw", "147")
.formParam("v-vh", "0")
.formParam("v-loc", baseurl + "/")
.formParam("v-wn", appver + "-0.7179318188297512")
.check(Checker.httpChecker)
).pause(1 seconds)
Checker:
val httpChecker = bodyString.transform {
(resp, session) =>
val state = new VaadinState;
println("\n resp :"+resp+"\n")
println("\n session :"+session+"\n")
println("Started user " + session.get("userName").as[String] + " " + session.get("password").as[String]);
state.userName = session.get("userName").as[String];
HttpRequestCreator.userStates += (session.get("userName").as[String] -> state);
//Find and store jsessionId
val url = new URL( new AppConfig().getBaseURL() );
val jsessionCookie = session("gatling.http.cookies").as[CookieJar].get(Uri.create( url.getProtocol+"://" + url.getHost+url.getPath+"/")).find(_.getName == "JSESSIONID");
state.jsessionid = jsessionCookie.getOrElse(null).getValue;
println("\n jsession id"+state.jsessionid+"\n")
println("\n resp :"+resp+ "\n")
state.readJsonState(state.httpResponseToValidJsonString(resp));
}
After that I've tried without posting those parameters. Then the sync id of the response is giving -1 even though it says web socket is open.
10:41:22.143 [DEBUG] i.g.h.a.w.WsActor - Received text message on websocket 'gatling.http.webSocket':237|for(;;);[{"changes":{},"resources":{},"locales":{},"meta":{"appError":{"caption":"Communication problem","url":null,"message":"Take note of any unsaved data, and <u>click here</u> or press ESC to continue.","details":null}},"syncId":-1}