How to change binary file into RDD or Dataframe? - scala

http://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds
The link shows how to change txt file into RDD, and then change to Dataframe.
So how to deal with binary file ?
Ask for an example ,Thank you very much .
There is a similar question without answer here : reading binary data into (py) spark DataFrame
To be more detail, I don't know how to parse the binary file .for example , I can parse txt file into lines or words like this:
JavaRDD<Person> people = sc.textFile("examples/src/main/resources/people.txt").map(
new Function<String, Person>() {
public Person call(String line) throws Exception {
String[] parts = line.split(",");
Person person = new Person();
person.setName(parts[0]);
person.setAge(Integer.parseInt(parts[1].trim()));
return person;
}
});
It seems that I just need the API that could parse the binary file or binary stream like this way:
JavaRDD<Person> people = sc.textFile("examples/src/main/resources/people.bin").map(
new Function<String, Person>() {
public Person call(/*stream or binary file*/) throws Exception {
/*code to construct every row*/
return person;
}
});
EDIT:
The binary file contains structure data (relational database 's table,the database is a self-made database) and I know the meta info of the structure data.I plan to change the structure data into RDD[Row].
And I could change every thing about the binary file when I use FileSystem's API (http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html) to write the binary stream into HDFS .And The binary file is splittable. I don't have any idea to parse the binary file like the example code above . So I cann't try anything so far.

There is a binary record reader that is already available for spark (I believe available in 1.3.1, atleast in the scala api).
sc.binaryRecord(path: string, recordLength: int, conf)
Its on you though to convert those binaries to an acceptable format for processing.

Related

Is there a way to read an Excel file using Dataflow

Is there a way to read an Excel file stored in a GCS bucket using Dataflow?
And I would also like to know if we can access the metadata of an object in GCS using Dataflow. If yes then how?
CSV files are often used to read files from excel. These files can be split and read line by line so they are ideal for dataflow. You can use TextIO.Read to pull in each line of the file, then parse them as CSV lines.
If you want to use a different binary excel format, then I believe that you would need to read in the entire file and use a library to parse it. I recommend using CSV files if you can.
As for reading the GCS metadata. I don't think that you can do this with TextIO, but you could call the GCS API directly to access the metadata. If you only do this for a few files at the start of your program then it will work and not be too expensive. If you need to read many files like this, you'll be adding an extra RPC for each file.
Be careful to not read the same file multiple times, I suggest reading each file's metadata once once and then writing the metadata out to a side input. Then in one of your ParDo's you can access the side input for each file.
Useful links:
ETL & Parsing CSV files in Cloud Dataflow
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/io/TextIO.Read
https://cloud.google.com/dataflow/model/par-do#side-inputs
private static final int BUFFER_SIZE = 64 * 1024;
private static void printBlob(com.google.cloud.storage.Storage storage, String bucketName, String blobPath) throws IOException, InvalidFormatException {
try (ReadChannel reader = ((com.google.cloud.storage.Storage) storage).reader(bucketName, blobPath)) {
InputStream inputStream = Channels.newInputStream(reader);
Workbook wb = WorkbookFactory.create(inputStream);
StringBuffer data = new StringBuffer();
for(int i=0;i<wb.getNumberOfSheets();i++) {
String fName = wb.getSheetAt(i).getSheetName();
File outputFile = new File("D:\\excel\\"+fName+".csv");
FileOutputStream fos = new FileOutputStream(outputFile);
XSSFSheet sheet = (XSSFSheet) wb.getSheetAt(i);
Iterator<Row> rowIterator = sheet.iterator();
data.delete(0, data.length());
while (rowIterator.hasNext())
{
// Get Each Row
Row row = rowIterator.next();
data.append('\n');
// Iterating through Each column of Each Row
Iterator<Cell> cellIterator = row.cellIterator();
while (cellIterator.hasNext())
{
Cell cell = cellIterator.next();
// Checking the cell format
switch (cell.getCellType())
{
case Cell.CELL_TYPE_NUMERIC:
data.append(cell.getNumericCellValue() + ",");
break;
case Cell.CELL_TYPE_STRING:
data.append(cell.getStringCellValue() + ",");
break;
case Cell.CELL_TYPE_BOOLEAN:
data.append(cell.getBooleanCellValue() + ",");
break;
case Cell.CELL_TYPE_BLANK:
data.append("" + ",");
break;
default:
data.append(cell + ",");
}
}
}
fos.write(data.toString().getBytes());
}
}
}
You should be able to read the metadata of a GCS file by using the GCS API. However you would need the filenames. You can do this by doing a ParDo or other transform over a list of PCollection<string> which holds the filenames.
We don't have any default readers for excel files. You can parse from a CSV file by using a text input:(ETL & Parsing CSV files in Cloud Dataflow)
I'm not very knowledgeable on excel, and how the file format is stored. If you want to process one file at a time, you can use a PCollection<string> of files. And then use some library to parse the excel file at a time.
If an excel file can be split into easily-parallelizable parts, I'd suggest you take a look at this doc (https://beam.apache.org/documentation/io/authoring-overview/). (If you are still using Dataflow SDK, it should be similar.) It may be worth splitting into smaller chunks before reading to get more parallelization out of your pipeline. In this case you could use IOChannelFactory to read from the file.

How to write streaming data to S3?

I want to write RDD[String] to Amazon S3 in Spark Streaming using Scala. These are basically JSON strings. Not sure how to do it more efficiently.
I found this post, in which the library spark-s3 is used. The idea is to create SparkContext and then SQLContext. After this the author of the post does something like this:
myDstream.foreachRDD { rdd =>
rdd.toDF().write
.format("com.knoldus.spark.s3")
.option("accessKey","s3_access_key")
.option("secretKey","s3_secret_key")
.option("bucket","bucket_name")
.option("fileType","json")
.save("sample.json")
}
What are another options besides spark-s3? Is it possible to append the file on S3 with the streaming data?
Files on S3 cannot be appended. An "append" means in S3 to replace the existing object with a new object that contains the additional data.
You should take a look into mode method for dataframewriter in Spark Documentation:
public DataFrameWriter mode(SaveMode saveMode)
Specifies the behavior when data or table already exists. Options
include: - SaveMode.Overwrite: overwrite the existing data. -
SaveMode.Append: append the data. - SaveMode.Ignore: ignore the
operation (i.e. no-op). - SaveMode.ErrorIfExists: default option,
throw an exception at runtime.
You can try somethling like this with Append savemode.
rdd.toDF.write
.format("json")
.mode(SaveMode.Append)
.saveAsTextFile("s3://iiiii/ttttt.json");
Spark Append:
Append mode means that when saving a DataFrame to a data source, if
data/table already exists, contents of the DataFrame are expected to
be appended to existing data.
Basically you can choose which format you want as an output format by passing "format" keyword to method
public DataFrameWriter format(java.lang.String source)
Specifies the underlying output data source. Built-in options include "parquet", "json", etc.
eg as parquet:
df.write().format("parquet").save("yourfile.parquet")
or as json:
df.write().format("json").save("yourfile.json")
Edit: Added details about s3 credentials:
there are two different options how to set credentials and we can see this in SparkHadoopUtil.scala
with environment variables System.getenv("AWS_ACCESS_KEY_ID") or with spark.hadoop.foo property:
SparkHadoopUtil.scala:
if (key.startsWith("spark.hadoop.")) {
hadoopConf.set(key.substring("spark.hadoop.".length), value)
}
so, you need to get hadoopConfiguration in javaSparkContext.hadoopConfiguration() or scalaSparkContext.hadoopConfiguration and set
hadoopConfiguration.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConfiguration.set("fs.s3.awsSecretAccessKey", mySecretKey)

Replace strings and save file in Scala

I have the following code where I am reading a file and replacing any occurences of "*.tar.gz" file with the new file name provided. Everything works fine and I can see the replaced changes in the console however I am not being able to write a new file with all the changes.
def modifyFile(newFileName: String, filename: String) = {
Source.fromFile(filename).getLines.foreach { line =>
println(line.replaceAll(".+\\.tar\\.gz", newFileName.concat(".tar.gz")))
}
}
}
You forgot to write your modified lines into the new file:
def modifyFile(newFileName: String, sourceFilePath: String, targetFilePath:String) {
scala.tools.nsc.io.File(targetFilePath).printlnAll(
Source.fromFile(sourceFilePath).getLines().map {
_.replaceAll(".+\\.tar\\.gz", newFileName.concat(".tar.gz"))
}.toSeq:_*)
}
Please note that this approach is not the most efficient in terms of performance, as the content of source file is read fully to memory, processed and then written back. More efficient approach will be more verbose and will include java's FileReader/FileWriter.
Upd
As rightfully pointed in comments you have to chose suitable way to write result to file depending on what tools and dependencies you have.

Combine two PDF-a documents using ITextSharp

hoping that someone can see the flaw in my code to merge to PDF-a documents using ITextSharp. Currently it complains about missing metadata which PDF-a requires.
Document document = new Document();
MemoryStream ms = new MemoryStream();
using (PdfACopy pdfaCopy = new PdfACopy(document, ms, PdfAConformanceLevel.PDF_A_1A))
{
document.Open();
using (PdfReader reader = new PdfReader("Doc1.pdf"))
{
pdfaCopy.AddDocument(reader);
}
using (PdfReader reader = new PdfReader("doc2.pdf"))
{
pdfaCopy.AddDocument(reader);
}
}
The exact error received is
Unhandled Exception: iTextSharp.text.pdf.PdfAConformanceException: The document catalog dictionary of a PDF/A conforming file shall contain
the Metadata key
I was hoping that the 'document catalog dictionary' would be copied as well, but I guess the 'new Document()' creates an empty non-conforming document or something.
Thanks! Hope you can help
Wouter
You need to add this line:
copy.CreateXmpMetadata();
This will create some default XMP metadata. Of course: if you want to create your own XMP file containing info about the documents you're about to merge, you can also use:
copy.XmpMetadata = myMetaData;
where myMetaData is a byte array containing a correct XMP stream.
I hope you understand that iText can't automatically create the correct metadata. Providing metadata is something that needs human attention.

Sending email with attachment using scala and Liftweb

This is the first time i am integrating Email service with liftweb
I want to send Email with attachments(Like:- Documents,Images,Pdfs)
my code looking like below
case class CSVFile(bytes: Array[Byte],filename: String = "file.csv",
mime: String = "text/csv; charset=utf8; header=present" )
val attach = CSVFile(fileupload.mkString.getBytes("utf8"))
val body = <p>Please research the enclosed.</p>
val msg = XHTMLPlusImages(body,
PlusImageHolder(attach.filename, attach.mime, attach.bytes))
Mailer.sendMail(
From("vyz#gmail.com"),
Subject(subject(0)),
To(to(0)),
)
this code is taken from LiftCookbook its not working like my requirement
its working but only the Attached file name is coming(file.csv) no data in it(i uploaded this file (gsy.docx))
Best Regards
GSY
You don't specify what type fileupload is, but assuming it is of type net.liftweb.http. FileParamHolder then the issue is that you can't just call mkString and expect it to have any data since there is no data in the object, just a fileStream method for retrieving it (either from disk or memory).
The easiest to accomplish what you want would be to use a ByteArrayInputStream and copy the data to it. I haven't tested it, but the code below should solve your issue. For brevity, it uses Apache IO Commons to copy the streams, but you could just as easily do it natively.
val data = {
val os = new ByteArrayOutputStream()
IOUtils.copy(fileupload.fileStream, os)
os.toByteArray
}
val attach = CSVFile(data)
BTW, you say you are uploading a Word (DOCX) file and expecting it to automatically be CSV when the extension is changed? You will just get a DOCX file with a csv extension unless you actually do some conversion.