How to write streaming data to S3? - scala

I want to write RDD[String] to Amazon S3 in Spark Streaming using Scala. These are basically JSON strings. Not sure how to do it more efficiently.
I found this post, in which the library spark-s3 is used. The idea is to create SparkContext and then SQLContext. After this the author of the post does something like this:
myDstream.foreachRDD { rdd =>
rdd.toDF().write
.format("com.knoldus.spark.s3")
.option("accessKey","s3_access_key")
.option("secretKey","s3_secret_key")
.option("bucket","bucket_name")
.option("fileType","json")
.save("sample.json")
}
What are another options besides spark-s3? Is it possible to append the file on S3 with the streaming data?

Files on S3 cannot be appended. An "append" means in S3 to replace the existing object with a new object that contains the additional data.

You should take a look into mode method for dataframewriter in Spark Documentation:
public DataFrameWriter mode(SaveMode saveMode)
Specifies the behavior when data or table already exists. Options
include: - SaveMode.Overwrite: overwrite the existing data. -
SaveMode.Append: append the data. - SaveMode.Ignore: ignore the
operation (i.e. no-op). - SaveMode.ErrorIfExists: default option,
throw an exception at runtime.
You can try somethling like this with Append savemode.
rdd.toDF.write
.format("json")
.mode(SaveMode.Append)
.saveAsTextFile("s3://iiiii/ttttt.json");
Spark Append:
Append mode means that when saving a DataFrame to a data source, if
data/table already exists, contents of the DataFrame are expected to
be appended to existing data.
Basically you can choose which format you want as an output format by passing "format" keyword to method
public DataFrameWriter format(java.lang.String source)
Specifies the underlying output data source. Built-in options include "parquet", "json", etc.
eg as parquet:
df.write().format("parquet").save("yourfile.parquet")
or as json:
df.write().format("json").save("yourfile.json")
Edit: Added details about s3 credentials:
there are two different options how to set credentials and we can see this in SparkHadoopUtil.scala
with environment variables System.getenv("AWS_ACCESS_KEY_ID") or with spark.hadoop.foo property:
SparkHadoopUtil.scala:
if (key.startsWith("spark.hadoop.")) {
hadoopConf.set(key.substring("spark.hadoop.".length), value)
}
so, you need to get hadoopConfiguration in javaSparkContext.hadoopConfiguration() or scalaSparkContext.hadoopConfiguration and set
hadoopConfiguration.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConfiguration.set("fs.s3.awsSecretAccessKey", mySecretKey)

Related

Remove all files with a given extension using scala spark

I have some csv.crc files generated when I try to write a dataframe into a csv file using spark. Therefore I want to delete all files with .csv.crc extension
val fs = FileSystem.get(existingSparkSession.sparkContext.hadoopConfiguration)
val srcPath=new Path("./src/main/resources/myDirectory/*.csv.crc")
println(fs.exists(srcPath))
println(fs.isFile(srcPath))
if(fs.exists(srcPath) && fs.isFile(srcPath)) {
fs.delete(srcPath,true)
}
both prinln lines give false as the value. therefor its not even going into the if condition. How can I delete all.csv.crc files using scala and spark
You can use below option to avoid crc file while writing.(Note :you're eliminating checksum).
fs.setVerifyChecksum(false).
else you can avoid crc files while reading using below,
config.("dfs.client.read.shortcircuit.skip.checksum", "true").

Load CSV file in S3 with AWS Glue in Scala

This should be easy...
For my AWS Glue job, I want to load my configuration settings from a CSV file on S3. This way, my lambda function can trigger the job and send the file name as a parameter. In Python, I can do this easily:
s3 = boto3.resource('s3')
bucket = s3.Bucket(<my bucket name>)
obj = s3.Object(<my bucket name>,<file location>)
data = obj.get()['Body'].read().decode('utf-8')
In Scala, I can't find anything equivalent to the boto3 library. I've tried the getSourceWithFormat function like this:
var datasource = glueContext.getSourceWithFormat("s3", JsonOptions(Map("paths" -> Set(<file folder name>)),
Map("exclusions" -> <file patterns to exclude>)),
format = "csv", formatOptions = JsonOptions(Map("separator" -> "\t"),Map("header" -> true)))
.getDynamicFrame()
but I'd like to just load a single file and manipulate it like an array of strings.
Thank you!
It should be going like this:
Write python code in Lambda to read the file.
Create your Glue job with scala code.
Make sure you have a trigger enabled which will call the Glue job with file names.
How about converting your datasource to data frame, and then calling collect method on it ?
val myArray = datasource.toDF().collect

How to pass schema file as Macros to BigQuery sink in data fusion

I am creating a data fusion pipeline to load csv data from GCS to BigQuery for my use case i need to create a property macros and provide the value during runtime. Need to understand how we can pass the schema file as Macros to BigQuery sink.
If i simply pass the json schema file path to Macros values i am getting the following error.
java.lang.IllegalArgumentException: Invalid schema: Use JsonReader.setLenient(true) to accept malformed JSON at line 1 column 1
There is currently no way to use the contents of a file as a macro value, though there is a jira open for something like this (https://issues.cask.co/browse/CDAP-15424). It is expected that the schema contents should be set as macro value. The UI currently doesn't handle these types of macro values very well (https://issues.cask.co/browse/CDAP-15423), so I would suggest setting it through the REST endpoint (https://docs.cdap.io/cdap/6.0.0/en/reference-manual/http-restful-api/preferences.html#H2290), where the app name is the pipeline name.
Alternatively, you can make your pipeline a little more generic by writing an Action plugin that looks something like:
#Override
public void run(ActionContext context) throws Exception {
String schema = readFileContents();
context.getArguments().setArgument(key, schema);
}
The plugin would be the first stage in your pipeline, and would allow subsequent stages in your pipeline to use ${key} as a macro that would be replaced with the actual schema.
If you are using BatchSink
You can read in the
#Override
public void prepareRun(BatchSinkContext context) {
by something like:
String token =
Objects.requireNonNull(
context.getArguments().get("token"),
"Argument Setter has failed in initializing the \"token\" argument.");
HTTPSinkConfig.setToken(token);

Converting a sequence of Json Object to An Rdd

Iam currently having a json object say student.json. The Structure looks something like this
{"serialNo":"1","name":"Rahul"}
{"serialNo":"2","name":"Rakshith"}
case class Student(serialNo:Int,name:String)
student.json is a huge file which Iam planning to parse through a spark job. And the snippet :
import play.api.libs.json.{ Json, JsObject, JsString }
.....
.....
for(jsonLine <-sc.textFile("student.json")
student<- Json.parse(jsonLine).asOpt[Student])
yield(student.serialNumber -> student.name)
Is there a better way to do this??
If student.json is a huge file, and each line is just a valid json object, you should do:
val myRdd = sc.textFile("student.json").map(l=> Json.parse(l).asOpt[Student])
If you want to get the RDD to your local master, you can:
val students = myRdd.collect()..// then you can do operate it in the old fashion way.
I saw you are importing play.api.libs.json which is from the Play Framework. I don't think running a Spark program in a web application is a good idea...

How to change binary file into RDD or Dataframe?

http://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds
The link shows how to change txt file into RDD, and then change to Dataframe.
So how to deal with binary file ?
Ask for an example ,Thank you very much .
There is a similar question without answer here : reading binary data into (py) spark DataFrame
To be more detail, I don't know how to parse the binary file .for example , I can parse txt file into lines or words like this:
JavaRDD<Person> people = sc.textFile("examples/src/main/resources/people.txt").map(
new Function<String, Person>() {
public Person call(String line) throws Exception {
String[] parts = line.split(",");
Person person = new Person();
person.setName(parts[0]);
person.setAge(Integer.parseInt(parts[1].trim()));
return person;
}
});
It seems that I just need the API that could parse the binary file or binary stream like this way:
JavaRDD<Person> people = sc.textFile("examples/src/main/resources/people.bin").map(
new Function<String, Person>() {
public Person call(/*stream or binary file*/) throws Exception {
/*code to construct every row*/
return person;
}
});
EDIT:
The binary file contains structure data (relational database 's table,the database is a self-made database) and I know the meta info of the structure data.I plan to change the structure data into RDD[Row].
And I could change every thing about the binary file when I use FileSystem's API (http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html) to write the binary stream into HDFS .And The binary file is splittable. I don't have any idea to parse the binary file like the example code above . So I cann't try anything so far.
There is a binary record reader that is already available for spark (I believe available in 1.3.1, atleast in the scala api).
sc.binaryRecord(path: string, recordLength: int, conf)
Its on you though to convert those binaries to an acceptable format for processing.