Fail to write to S3 - scala

I am trying to write a file to Amazon S3.
val creds = new BasicAWSCredentials(AWS_ACCESS_KEY, AWS_SECRET_KEY)
val amazonS3Client = new AmazonS3Client(creds)
val filePath = "/service2/2019/06/30/21"
val fileContent = "{\"key\":\"value\"}"
val meta = new ObjectMetadata();
amazonS3Client.putObject(bucketName, filePath, new ByteArrayInputStream(fileContent.getBytes), meta)
The program is finished with no error, but no file is written into the bucket.

The key argument seems to have a typo. Try without the initial forward slash
val filePath = "service2/2019/06/30/21"
instead of
val filePath = "/service2/2019/06/30/21"

Related

Spark properties file read

I tried to read a properties file in spark where my file location is available while run the job getting below error
code is
object runEmpJob {
def main(args: Array[String]): Unit = {
println("starting emp job")
val props = ConfigFactory.load()
val envProps = props.getConfig("C:\\Users\\mmishra092815\\IdeaProjects\\use_case_1\\src\\main\\Resource\\filepath.properties")
System.setProperty("hadoop.home.directory", "D:\\SHARED\\winutils-master\\hadoop-2.6.3\\bin")
val spark = SparkSession.builder().
appName("emp dept operation").
master(envProps.getString("Dev.executionMode")).
getOrCreate()
val empObj = new EmpOperation
empObj.runEmpOperation(spark, "String", fileType = "csv")
val inPutPath = args(1)
val outPutPath = args(2)
}
}
getting error:
Exception in thread "main"
com.typesafe.config.ConfigException$BadPath: path parameter: Invalid path C:\Users\mmishra092815\IdeaProjects\use_case_1\src\main\Resource\filepath.properties':
Token not allowed in path expression: ':' (you can double-quote this token if you really want it here)
at com.typesafe.config.impl.PathParser.parsePathExpression(PathParser.java:155)
at com.typesafe.config.impl.PathParser.parsePathExpression(PathParser.java:74)
at com.typesafe.config.impl.PathParser.parsePath(PathParser.java:61)
at com.typesafe.config.impl.Path.newPath(Path.java:230)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:192)
at com.typesafe.config.impl.SimpleConfig.getObject(SimpleConfig.java:268)
at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:274)
at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:41)
at executor.runEmpJob$.main(runEmpJob.scala:12)
at executor.runEmpJob.main(runEmpJob.scala)
Process finished with exit code 1
Loading happens in ConfigFactory.load(). If you want to load configuration from specific file, pass it like:
val props = ConfigFactory.load("C:\\Users\\mmishra092815\\IdeaProjects\\use_case_1\\src\\main\\Resource\\filepath.properties")
As described in API documentation, getConfig method does not load configuration from file - it returns a Config object for given config path (not filesystem path!)

Data download in parallel with spark

So I had a url link for a file, and the file is encrypted using "AES/CBC/PKCS5Padding", I have the cipher key and the iv string. I am using amazon ec2 to download the data and store to s3. what I am doing currently is creating an input stream using HTTPCLIENT() and GETMETHOD() and then javax.crypto library on the input stream and then finally putting this inputstream in a s3 location. Thia works for one file; but I need to scale up and do the same thing for multiple such url. How or can I take help of parallel download?
the download time for say n files is same if I don't parallelize the array of url, if I use 1 node or 4 nodes(1 master and 3 slaves). And if I use .par, it gives out of memory error. Here is the code for the data download and upload(the function downloadAndProcessDataFunc)
val client = new HttpClient()
val s3Client = AmazonS3ClientBuilder.standard().withRegion(Regions.EU_WEST_1).build()
val method = new GetMethod(url)
val metadata = new ObjectMetadata()
val responseCode = client.executeMethod(method);
println("FROM SCRIPT -> ", destFile, responseCode)
val istream = method.getResponseBodyAsStream();
val base64 = new Base64();
val key = base64.decode(cipherKey.getBytes());
val ivKey = base64.decode(iv.getBytes());
val secretKey = new SecretKeySpec( key, "AES" );
val ivSpec = new IvParameterSpec(ivKey);
val cipher = Cipher.getInstance( algorithm )
cipher.init( Cipher.DECRYPT_MODE, secretKey, ivSpec );
val cistream = new CipherInputStream( istream, cipher );
s3Client.putObject(<bucket>, destFile, cistream, metadata)
this is the call I do, but isn't parallelized
manifestArray.foreach{row => downloadAndProcessDataFunc(row.split(",")(0), row.split(",")(1), row.split(",")(2), row.split(",")(3), row.split(",")(4), row.split(",")(5).toLong)
this runs out of memory
manifest.par.foreach{row => downloadAndProcessDataFunc(row.mkString(",").split(",")(0), row.mkString(",").split(",")(1), row.mkString(",").split(",")(2), row.mkString(",").split(",")(3), element.replace("s3://testbucketforemrupload/manifest/", "data/") + "/" + row.mkString(",").split(",")(4).split("/").last)
now if i change the downloadAndProcessDataFunc function to this, it just downloads around 30/1536 such url and kills the rest of tje executors for out of memory error
val client = new HttpClient()
val s3Client = AmazonS3ClientBuilder.standard().withRegion(Regions.EU_WEST_1).build()
val method = new GetMethod(url)
val metadata = new ObjectMetadata()
val responseCode = client.executeMethod(method);
println("FROM SCRIPT -> ", destFile, responseCode)
val istream = method.getResponseBodyAsStream();
val base64 = new Base64();
val key = base64.decode(cipherKey.getBytes());
val ivKey = base64.decode(iv.getBytes());
val secretKey = new SecretKeySpec( key, "AES" );
val ivSpec = new IvParameterSpec(ivKey);
val cipher = Cipher.getInstance( algorithm )
cipher.init( Cipher.DECRYPT_MODE, secretKey, ivSpec );
val cistream = new CipherInputStream( istream, cipher );
metadata.setContentLength(sizeInBytes);
s3Client.putObject(<bucket>, destFile, cistream, metadata)
the variable are self explanatory

download gz file on clicking a url and convert to csv using scala

I am really messing with syntax here need help...
I have a URL, on clicking of which a sample.csv.gz file gets downloaded
Please can someone help me fill the syntactic gaps below:
val outputFile = "C:\\sampleNew" + ".csv"
val inputFile = "C:\\sample.csv.gz"
val fileUrl = "someSamplehttpUrl"
// On hitting this Url, sample.csv.gz file should download at destination 'outputFile'
val in = new URL()(fileUrl).openStream()
Files.copy(in, Paths.get(outputFile), StandardCopyOption.REPLACE_EXISTING)
val filePath = new File(outputFile)
if(filePath.exists()) filePath.delete()
val fw = new FileWriter(outputFile, true)
var bf = new BufferedReader(new InputStreamReader(new GZIPInputStream(new FileInputStream(inputFile)), "UTF-8"))
while (bf.ready()) fw.append(bf.readLine() + "\n")
I have been getting several errors with syntax... Any corrections here? I basically have an http get request that returns a URL, which I must open to download this gz file
Thanks!
Here are two possible solutions:
import java.io.{File, PrintWriter}
import scala.io.Source
val outputFile = "out.csv"
val inputFile = "/tmp/marks.csv"
val fileUrl = s"file:///$inputFile"
// Method 1, a traditional copy from the input to the output.
val in = Source.fromURL(fileUrl)
val out = new PrintWriter(outputFile)
for (line <- in.getLines)
out.println(line)
out.close
in.close
Here is a one liner which basically pipes the data from the input to the output.
import sys.process._
import java.net.URL
val outputFile = "out.csv"
val inputFile = "/tmp/marks.csv"
val fileUrl = s"file:///$inputFile"
// Method 2, pipe the content of the URL to the output file.
new URL(fileUrl) #> new File(outputFile) !!
Here is a version using Files.copy
val outputFile = "out.csv"
val inputFile = "/tmp/marks.csv"
val fileUrl = s"file:///$inputFile"
import java.nio.file.{Files, Paths, StandardCopyOption}
import java.net.URL
val in = new URL(fileUrl).openStream
val out = Paths.get(outputFile)
Files.copy(in, out, StandardCopyOption.REPLACE_EXISTING)
Hopefully one (or more) of the above will address your needs.

Read a file from HDFS and assign the contents to string

In Scala, How to read a file in HDFS and assign the contents to a variable. I know how to read a file and I am able to print it. But If I try assign the content to a string, It giving output as Unit(). Below is the codes I tried.
val dfs = org.apache.hadoop.fs.FileSystem.get(config);
val snapshot_file = "/path/to/file/test.txt"
val stream = dfs.open(new Path(snapshot_file))
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))
readLines.takeWhile(_ != null).foreach(line => println(line))
The above code printing the output properly. But If I tried assign the output to a string, I am getting correct output.
val snapshot_id = readLines.takeWhile(_ != null).foreach(line => println(line))
snapshot_id: Unit = ()
what is the correct way to assign the contents to a variable ?
You need to use mkString. Since println returns Unit() which gets stored to your variable if you call println on you stream
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://namenode:port/"), new org.apache.hadoop.conf.Configuration())
val path = new org.apache.hadoop.fs.Path("/user/cloudera/file.txt")
val stream = hdfs.open(path)
def readLines = scala.io.Source.fromInputStream(stream)
val snapshot_id : String = readLines.takeWhile(_ != null).mkString("\n")
I used org.apache.commons.io.IOUtils.toString to convert stream in to string
def getfileAsString( file: String): String = {
import org.apache.hadoop.fs.FileSystem
val config: Configuration = new Configuration();
config.set("fs.hdfs.impl", classOf[DistributedFileSystem].getName)
config.set("fs.file.impl", classOf[LocalFileSystem].getName)
val dfs = FileSystem.get(config)
val filePath: FSDataInputStream = dfs.open(new Path(file))
logInfo("file.available " + filePath.available)
val outputxmlAsString: String = org.apache.commons.io.IOUtils.toString(filePath, "UTF-8")
outputxmlAsString
}

How to use scalatest to test file upload in Play Framework?

I am writing tests for an application created using Scala/Play Framework. There is a route with takes file to upload. This is what I have written so far.
val dataFile: File = new File("../TestCSV/product.csv")
val tempFile = TemporaryFile(dataFile)
val part = FilePart[TemporaryFile](key = "dataFile", filename = "product.csv", contentType = Some("application/vnd.ms-excel"), ref = tempFile)
val formData: MultipartFormData[TemporaryFile] = MultipartFormData[TemporaryFile](dataParts = Map(), files = Seq(part), badParts = Seq(), missingFileParts = Seq())
val request: FakeRequest[MultipartFormData[TemporaryFile]] = FakeRequest[MultipartFormData[TemporaryFile]]("POST", "/api/core/v0.1/data-import/uploads/%s/product".format(sandboxId), headers, formData)
val response = route(request).get
status(response) mustBe OK
I am getting this error.
Cannot write an instance of play.api.mvc.MultipartFormData[play.api.libs.Files.TemporaryFile] to HTTP response. Try to define a Writeable[play.api.mvc.MultipartFormData[play.api.libs.Files.TemporaryFile]]
How do I make this class writable?