Load CSV file in S3 with AWS Glue in Scala

Load CSV file in S3 with AWS Glue in Scala - scala

This should be easy...
For my AWS Glue job, I want to load my configuration settings from a CSV file on S3. This way, my lambda function can trigger the job and send the file name as a parameter. In Python, I can do this easily:
s3 = boto3.resource('s3')
bucket = s3.Bucket(<my bucket name>)
obj = s3.Object(<my bucket name>,<file location>)
data = obj.get()['Body'].read().decode('utf-8')
In Scala, I can't find anything equivalent to the boto3 library. I've tried the getSourceWithFormat function like this:
var datasource = glueContext.getSourceWithFormat("s3", JsonOptions(Map("paths" -> Set(<file folder name>)),
Map("exclusions" -> <file patterns to exclude>)),
format = "csv", formatOptions = JsonOptions(Map("separator" -> "\t"),Map("header" -> true)))
.getDynamicFrame()
but I'd like to just load a single file and manipulate it like an array of strings.
Thank you!

It should be going like this:
Write python code in Lambda to read the file.
Create your Glue job with scala code.
Make sure you have a trigger enabled which will call the Glue job with file names.

How about converting your datasource to data frame, and then calling collect method on it ?
val myArray = datasource.toDF().collect

Related

get GCS file metadata using scala

I want to get the time creation of files in GCS, I used the code below :
println(Files
.getFileAttributeView(Paths.get("gs://datalake-dev/mu/tpu/file.0450138"), classOf[BasicFileAttributeView])
.readAttributes.creationTime)
The problem is that the Paths.get function replace // with / so I will get gs:/datalake-dev/mu/tpu/file.0450138 instead of gs://datalake-dev/mu/tpu/file.0450138.
Anyone can help me with this ?
Thanks a lot !

I solved the problem by adding the following java code and then calling the java function in scala.
import com.google.cloud.storage.*;
import java.sql.Timestamp;
public class ExtractDate {
public static String getTime(String fileName){
String bucketName = "bucket-data";
String blobName = "doc/files/"+fileName;
// Instantiates a client
Storage storage_client = StorageOptions.getDefaultInstance().getService();
Bucket bucket = storage_client.get(bucketName);
//val storage_client = Storage.
BlobId blobId = BlobId.of(bucketName, blobName);
Blob blob = storage_client.get(blobId);
Timestamp tmp = new Timestamp(bucket.get(blobName).getCreateTime());
System.out.print(bucket.get(blobName).getContent());
// return the year of the file date creation
return tmp.toString().substring(0,4);
}
}

You can use the file_get_contents method to read the contents of the path. From the documentation on Reading and Writing Files
Read objects contents using PHP to fetch an object's custom metadata from Google Cloud Storage.An App Engine PHP 5 app must use the Cloud Storage stream wrapper to write files at runtime. However, if an app needs to read files, and these files are static, you can optionally read static files uploaded with your app using PHP filesystem functions such as file_get_contents.
$fileContents = file_get_contents($filePath);
where the path specified must be a path relative to the script accessing them.
You must upload the file or files in an application subdirectory when you deploy your app to App Engine, and must configure the app.yaml file so your app can access those files. For complete details, see PHP 5 Application Configuration with app.yaml.
In the app.yaml configuration, notice that if you use a static file or directory handler (static_files or static_dir) you must specify application_readable set to true or your app won't be able to read the files. However, if the files are served by a script handler, this isn't necessary, because these files are readable by script handlers by default.

How to convert csv file in S3 bucket to RDD

I'm pretty new with this topic so any help will be much appreciated.
I trying to read a csv file which is stored in a S3 bucket and convert its data to an RDD to work directly with it without the need to create a file locally.
So far I've been able to load the file using AmazonS3ClientBuilder, but the only thing I've got is to have the file content in a S3ObjectInputStream and I'm not able to work with its content.
val bucketName = "bucket-name"
val credentials = new BasicAWSCredentials(
"acessKey",
"secretKey"
);
val s3client = AmazonS3ClientBuilder
.standard()
.withCredentials(new AWSStaticCredentialsProvider(credentials))
.withRegion(Regions.US_EAST_2)
.build();
val s3object = s3client.getObject(bucketName, "file-name.csv")
val inputStream = s3object.getObjectContent()
....
I have also tried to use a BufferedSource to work with it but once done, I don't know how to convert it to a dataframe or RDD to work with it.
val myData = Source.fromInputStream(inputStream)
....

You can do it with S3A file system provided in Hadoop-AWS module:
Add this dependency https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
Either define <property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3a.S3AFileSystem</value></property> in core-site.xml or add .config("fs.s3.impl", classOf[S3AFileSystem].getName) to SparkSession builder
Access S3 using spark.read.csv("s3://bucket/key"). If you want the RDD that was asked spark.read.csv("s3://bucket/key").rdd

At the end I was able to get the results I was searching for taking a look at https://gist.github.com/snowindy/d438cb5256f9331f5eec

How to save RDD data into json files, not folders

I am receiving the streaming data myDStream (DStream[String]) that I want to save in S3 (basically, for this question, it doesn't matter where exactly do I want to save the outputs, but I am mentioning it just in case).
The following code works well, but it saves folders with the names like jsonFile-19-45-46.json, and then inside the folders it saves files _SUCCESS and part-00000.
Is it possible to save each RDD[String] (these are JSON strings) data into the JSON files, not the folders? I thought that repartition(1) had to make this trick, but it didn't.
myDStream.foreachRDD { rdd =>
// datetimeString = ....
rdd.repartition(1).saveAsTextFile("s3n://mybucket/keys/jsonFile-"+datetimeString+".json")
}

AFAIK there is no option to save it as a file. Because it's a distributed processing framework and it's not a good practice write on single file rather than each partition writes it's own files in the specified path.
We can pass only output directory where we wanted to save the data. OutputWriter will create file(s)(depends on partitions) inside specified path with part- file name prefix.

As an alternative to rdd.collect.mkString("\n") you can use hadoop Filesystem library to cleanup output by moving part-00000 file into it's place. Below code works perfectly on local filesystem and HDFS, but I'm unable to test it with S3:
val outputPath = "path/to/some/file.json"
rdd.saveAsTextFile(outputPath + "-tmp")
import org.apache.hadoop.fs.Path
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.rename(new Path(outputPath + "-tmp/part-00000"), new Path(outputPath))
fs.delete(new Path(outputPath + "-tmp"), true)

For JAVA I implemented this one. Hope it helps:
val fs = FileSystem.get(spark.sparkContext().hadoopConfiguration());
File dir = new File(System.getProperty("user.dir") + "/my.csv/");
File[] files = dir.listFiles((d, name) -> name.endsWith(".csv"));
fs.rename(new Path(files[0].toURI()), new Path(System.getProperty("user.dir") + "/csvDirectory/newData.csv"));
fs.delete(new Path(System.getProperty("user.dir") + "/my.csv/"), true);

How to write streaming data to S3?

I want to write RDD[String] to Amazon S3 in Spark Streaming using Scala. These are basically JSON strings. Not sure how to do it more efficiently.
I found this post, in which the library spark-s3 is used. The idea is to create SparkContext and then SQLContext. After this the author of the post does something like this:
myDstream.foreachRDD { rdd =>
rdd.toDF().write
.format("com.knoldus.spark.s3")
.option("accessKey","s3_access_key")
.option("secretKey","s3_secret_key")
.option("bucket","bucket_name")
.option("fileType","json")
.save("sample.json")
}
What are another options besides spark-s3? Is it possible to append the file on S3 with the streaming data?

Files on S3 cannot be appended. An "append" means in S3 to replace the existing object with a new object that contains the additional data.

You should take a look into mode method for dataframewriter in Spark Documentation:
public DataFrameWriter mode(SaveMode saveMode)
Specifies the behavior when data or table already exists. Options
include: - SaveMode.Overwrite: overwrite the existing data. -
SaveMode.Append: append the data. - SaveMode.Ignore: ignore the
operation (i.e. no-op). - SaveMode.ErrorIfExists: default option,
throw an exception at runtime.
You can try somethling like this with Append savemode.
rdd.toDF.write
.format("json")
.mode(SaveMode.Append)
.saveAsTextFile("s3://iiiii/ttttt.json");
Spark Append:
Append mode means that when saving a DataFrame to a data source, if
data/table already exists, contents of the DataFrame are expected to
be appended to existing data.
Basically you can choose which format you want as an output format by passing "format" keyword to method
public DataFrameWriter format(java.lang.String source)
Specifies the underlying output data source. Built-in options include "parquet", "json", etc.
eg as parquet:
df.write().format("parquet").save("yourfile.parquet")
or as json:
df.write().format("json").save("yourfile.json")
Edit: Added details about s3 credentials:
there are two different options how to set credentials and we can see this in SparkHadoopUtil.scala
with environment variables System.getenv("AWS_ACCESS_KEY_ID") or with spark.hadoop.foo property:
SparkHadoopUtil.scala:
if (key.startsWith("spark.hadoop.")) {
hadoopConf.set(key.substring("spark.hadoop.".length), value)
}
so, you need to get hadoopConfiguration in javaSparkContext.hadoopConfiguration() or scalaSparkContext.hadoopConfiguration and set
hadoopConfiguration.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConfiguration.set("fs.s3.awsSecretAccessKey", mySecretKey)

Sending email with attachment using scala and Liftweb

This is the first time i am integrating Email service with liftweb
I want to send Email with attachments(Like:- Documents,Images,Pdfs)
my code looking like below
case class CSVFile(bytes: Array[Byte],filename: String = "file.csv",
mime: String = "text/csv; charset=utf8; header=present" )
val attach = CSVFile(fileupload.mkString.getBytes("utf8"))
val body = <p>Please research the enclosed.</p>
val msg = XHTMLPlusImages(body,
PlusImageHolder(attach.filename, attach.mime, attach.bytes))
Mailer.sendMail(
From("vyz#gmail.com"),
Subject(subject(0)),
To(to(0)),
)
this code is taken from LiftCookbook its not working like my requirement
its working but only the Attached file name is coming(file.csv) no data in it(i uploaded this file (gsy.docx))
Best Regards
GSY

You don't specify what type fileupload is, but assuming it is of type net.liftweb.http. FileParamHolder then the issue is that you can't just call mkString and expect it to have any data since there is no data in the object, just a fileStream method for retrieving it (either from disk or memory).
The easiest to accomplish what you want would be to use a ByteArrayInputStream and copy the data to it. I haven't tested it, but the code below should solve your issue. For brevity, it uses Apache IO Commons to copy the streams, but you could just as easily do it natively.
val data = {
val os = new ByteArrayOutputStream()
IOUtils.copy(fileupload.fileStream, os)
os.toByteArray
}
val attach = CSVFile(data)
BTW, you say you are uploading a Word (DOCX) file and expecting it to automatically be CSV when the extension is changed? You will just get a DOCX file with a csv extension unless you actually do some conversion.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Load CSV file in S3 with AWS Glue in Scala - scala

It should be going like this: Write python code in Lambda to read the file. Create your Glue job with scala code. Make sure you have a trigger enabled which will call the Glue job with file names.

How about converting your datasource to data frame, and then calling collect method on it ? val myArray = datasource.toDF().collect

Related

get GCS file metadata using scala

How to convert csv file in S3 bucket to RDD

How to save RDD data into json files, not folders

How to write streaming data to S3?

Sending email with attachment using scala and Liftweb

Categories

Resources