read .doc file using scala - scala

I want to read .doc file in scala. I tried using apache.poi library for this but the method HWPFDocument(java.io.InputStream istream) accepts java io stream.
If anyone can shed some light on this, that would great!

So, here is a teaser to get you started:
val fis = new FileInputStream("/path/to/file/doc.doc")
val doc = new HWPFDocument(fis)
val we = new WordExtractor(doc)
val paras = we.getParagraphText()

You can use InputStream in Scala, just as any other Java class/interface.

Related

How to convert csv file in S3 bucket to RDD

I'm pretty new with this topic so any help will be much appreciated.
I trying to read a csv file which is stored in a S3 bucket and convert its data to an RDD to work directly with it without the need to create a file locally.
So far I've been able to load the file using AmazonS3ClientBuilder, but the only thing I've got is to have the file content in a S3ObjectInputStream and I'm not able to work with its content.
val bucketName = "bucket-name"
val credentials = new BasicAWSCredentials(
"acessKey",
"secretKey"
);
val s3client = AmazonS3ClientBuilder
.standard()
.withCredentials(new AWSStaticCredentialsProvider(credentials))
.withRegion(Regions.US_EAST_2)
.build();
val s3object = s3client.getObject(bucketName, "file-name.csv")
val inputStream = s3object.getObjectContent()
....
I have also tried to use a BufferedSource to work with it but once done, I don't know how to convert it to a dataframe or RDD to work with it.
val myData = Source.fromInputStream(inputStream)
....
You can do it with S3A file system provided in Hadoop-AWS module:
Add this dependency https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
Either define <property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3a.S3AFileSystem</value></property> in core-site.xml or add .config("fs.s3.impl", classOf[S3AFileSystem].getName) to SparkSession builder
Access S3 using spark.read.csv("s3://bucket/key"). If you want the RDD that was asked spark.read.csv("s3://bucket/key").rdd
At the end I was able to get the results I was searching for taking a look at https://gist.github.com/snowindy/d438cb5256f9331f5eec

Read Parquet into scala without Spark

I have a Parquet file which I would like to read into my Scala program without using Spark or other Big Data Technologies.
I found the projects
https://github.com/apache/parquet-mr
https://github.com/51zero/eel-sdk
but not detailed enough examples to get them to work.
Parquet-MR
https://stackoverflow.com/a/35594368/4533188 mentions this, but the examples given are not complete. For example it is not clear what path is supposed to be. It is supposed to implement InputFile, how is this supposed to be done? Also, from the post it seems to me that Parquet-MR does not directly truns the parquet data as standard Scala classes.
Eel
Here I tried
import io.eels.component.parquet.ParquetSource
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val parquetFilePath = new Path("file://home/raeg/Datatroniq/Projekte/14. Witzenmann/Teilprojekt Strom und Spannung/python_witzenmann/src/data/1.parquet")
implicit val hadoopConfiguration = new Configuration()
implicit val hadoopFileSystem = FileSystem.get(hadoopConfiguration) // This is required
ParquetSource(parquetFilePath)
.toDataStream()
.collect
.foreach(row => println(row))
but I get the error
java.io.IOException: No FileSystem for scheme: file
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(ParquetReaderTesting.sc:2582)
at org.apache.hadoop.fs.FileSystem.createFileSystem(ParquetReaderTesting.sc:2589)
at org.apache.hadoop.fs.FileSystem.access$200(ParquetReaderTesting.sc:87)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(ParquetReaderTesting.sc:2628)
at org.apache.hadoop.fs.FileSystem$Cache.get(ParquetReaderTesting.sc:2610)
at org.apache.hadoop.fs.FileSystem.get(ParquetReaderTesting.sc:366)
at org.apache.hadoop.fs.FileSystem.get(ParquetReaderTesting.sc:165)
at dataReading.A$A6$A$A6.hadoopFileSystem$lzycompute(ParquetReaderTesting.sc:7)
at dataReading.A$A6$A$A6.hadoopFileSystem(ParquetReaderTesting.sc:7)
at dataReading.A$A6$A$A6.get$$instance$$hadoopFileSystem(ParquetReaderTesting.sc:7)
at #worksheet#.#worksheet#(ParquetReaderTesting.sc:30)
in my worksheet.

scala : creating temp directory and file

I am trying to create a temp directory and file underneath it. Here is my code snippet:
var tempPath = System.getProperty("java.io.tmpdir")
val myDir = new File(tempPath.concat(scala.util.Random.nextString(10).toString))
myDir.mkdir()
val tempFile = new File(myDir.toString+"/temp.log")
This code is working fine. However I am wondering if there is any better way of doing this, please provide your comments.
Java has existing methods that can do this for you, like Files.createTempFile, Files.createTempDirectory and their overloads.
You can find some examples in this blog post.

Sending email with attachment using scala and Liftweb

This is the first time i am integrating Email service with liftweb
I want to send Email with attachments(Like:- Documents,Images,Pdfs)
my code looking like below
case class CSVFile(bytes: Array[Byte],filename: String = "file.csv",
mime: String = "text/csv; charset=utf8; header=present" )
val attach = CSVFile(fileupload.mkString.getBytes("utf8"))
val body = <p>Please research the enclosed.</p>
val msg = XHTMLPlusImages(body,
PlusImageHolder(attach.filename, attach.mime, attach.bytes))
Mailer.sendMail(
From("vyz#gmail.com"),
Subject(subject(0)),
To(to(0)),
)
this code is taken from LiftCookbook its not working like my requirement
its working but only the Attached file name is coming(file.csv) no data in it(i uploaded this file (gsy.docx))
Best Regards
GSY
You don't specify what type fileupload is, but assuming it is of type net.liftweb.http. FileParamHolder then the issue is that you can't just call mkString and expect it to have any data since there is no data in the object, just a fileStream method for retrieving it (either from disk or memory).
The easiest to accomplish what you want would be to use a ByteArrayInputStream and copy the data to it. I haven't tested it, but the code below should solve your issue. For brevity, it uses Apache IO Commons to copy the streams, but you could just as easily do it natively.
val data = {
val os = new ByteArrayOutputStream()
IOUtils.copy(fileupload.fileStream, os)
os.toByteArray
}
val attach = CSVFile(data)
BTW, you say you are uploading a Word (DOCX) file and expecting it to automatically be CSV when the extension is changed? You will just get a DOCX file with a csv extension unless you actually do some conversion.

Read property file under classpath using scala

I am trying to read a property file from classpath using scala. But it looks like it won't work, it is different from java. The following 2 code snippet, one is java (working), another is scala (not working). I don't understand what is the difference.
// working
BufferedReader reader = new BufferedReader(new InputStreamReader(
Test.class.getResourceAsStream("conf/fp.properties")));
// not working
val reader = new BufferedReader(new InputStreamReader(
getClass.getResourceAsStream("conf/fp.properties")));
Exception in thread "main" java.lang.NullPointerException
at java.io.Reader.<init>(Reader.java:78)
at java.io.InputStreamReader.<init>(InputStreamReader.java:72)
at com.ebay.searchscience.searchmetrics.fp.conf.FPConf$.main(FPConf.scala:31)
at com.ebay.searchscience.searchmetrics.fp.conf.FPConf.main(FPConf.scala)
This code finally worked for me:
import java.util.Properties
import scala.io.Source
// ... somewhere inside module.
var properties : Properties = null
val url = getClass.getResource("/my.properties")
if (url != null) {
val source = Source.fromURL(url)
properties = new Properties()
properties.load(source.bufferedReader())
}
And now you have plain old java.util.Properties to handle what my legacy code actually needed to receive.
I am guessing that your BufferedReader is a java.io.BufferedReader
In that case you could simply do the following:
import scala.io.Source.fromUrl
val reader = fromURL(getClass.getResource("conf/fp.properties")).bufferedReader()
However, this leaves the question open as to what you are planning to do with the reader afterwards. scala.io.Source already has some useful methods that might make lots of your code superfluous .. see ScalaDoc
My prefered solution is with com.typesafe.scala-logging. I did put an application.conf file in main\resources folder, with content like:
services {
mongo-db {
retrieve = """http://xxxxxxxxxxxx""",
base = """http://xxxxxx"""
}
}
and the to use it in a class, first load the config factory from typesafe and then just use it.
val conf = com.typesafe.config.ConfigFactory.load()
conf.getString("services.mongo-db.base"))
Hope it helps!
Ps. I bet that every file on resources with .conf as extension will be read.
For reading a Properties file i'd recommend to use java.util.ResourceBundle.getBundle("conf/fp"), it makes life a little easier.
The NullPointerException you are seeing is caused by a bug in the underlying Java code. It could be caused by a mistyped file name.
Sometimes you get this error also if you're trying to load the resource with the wrong classloader.
Check the resource url carefully against your classpath.
Try Source.fromInputStream(getClass.getResourceAsStream(...))
Try Source.fromInputStream(getClass.getClassLoader.getResourceAsStream())
Maybe you are using other classloaders you can try?
The same story goes for Source.fromUrl(...)
If you're trying to load configuration files and you control their format, you should have a look at Typesafe's Config utility.
The Null Pointer Exception you are getting is from getResourceAsStream returning null. The following junit.scala snippet shows how there is a difference in class vs classloader. see What is the difference between Class.getResource() and ClassLoader.getResource()?. Here I assume fileName is the name of a file residing in the class path, but not a file next to the class running the test.
assertTrue(getClass.getClassLoader().getResourceAsStream(fileName) != null)
assertTrue(getClass.getClassLoader().getResourceAsStream("/" + fileName) == null)
assertTrue(getClass.getResourceAsStream(fileName) == null)
assertTrue(getClass.getResourceAsStream("/" + fileName) != null)