Reading files from S3 using HadoopInputFile yields FileNotFoundException - scala

I am trying to read parquet files from directory at S3
val bucketKey = "s3a://foo/direcoty_to_retrieve/"
val conf: Configuration = new Configuration()
conf.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, true)
val inputFile = HadoopInputFile.fromPath(new Path(bucketKey), conf)
val reader: ParquetReader[GenericRecord] = AvroParquetReader.builder[GenericRecord](inputFile).withConf(conf).build()
however I am getting
Exception in thread "main" java.io.FileNotFoundException: No such file or directory: s3a://foo/direcoty_to_retrieve
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3356)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3053)
at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:39)
EDIT:
When I use the AvroParquetReader.builder with filePath e.g :
val reader: ParquetReader[GenericRecord] = AvroParquetReader.builder[GenericRecord](new Path(bucketKey)).withConf(conf).build()
it works, however this option is deprecated and I rather not use it.
on local directory it works. my env variables for AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY are set correctly. what can be the problem ?

I faced the same issue with reading parquet file from Amazon S3 storage via Alpakka Avro parquet lib (which depends on parquet-hadoop-1.10.1.jar). After some debugging I found out the issue in ParquetReader.build method.
public ParquetReader<T> build() throws IOException {
ParquetReadOptions options = optionsBuilder.build();
if (path != null) {
FileSystem fs = path.getFileSystem(conf);
FileStatus stat = fs.getFileStatus(path);
if (stat.isFile()) {
return new ParquetReader<>(
Collections.singletonList((InputFile) HadoopInputFile.fromStatus(stat, conf)),
options,
getReadSupport());
} else {
List<InputFile> files = new ArrayList<>();
for (FileStatus fileStatus : fs.listStatus(path, HiddenFileFilter.INSTANCE)) {
files.add(HadoopInputFile.fromStatus(fileStatus, conf));
}
return new ParquetReader<T>(files, options, getReadSupport());
}
} else {
return new ParquetReader<>(Collections.singletonList(file), options, getReadSupport());
}
}
When HadoopInputFile is used as input, builder path property set to null and reader initiated in else block. As parquet file represented as a directory in filesystem this leads to java.io.FileNotFoundException.
Solution for now will be to use deprecated method:
AvroParquetReader.builder[GenericRecord](new Path(bucketKey)).withConf(conf).build()

Related

Intellij not able to identify root folder when SBT build is used

Below is my project structure. I'm trying to read my properties file pgmproperties.properties in my program below but getting error File Not found. Please tell me where should i place my file. I'm using SBT build tool.
My code
def main(args: Array[String]) {
// This is used to read Properties files and get the values
val is: InputStream = ClassLoader.getSystemResourceAsStream("pgmproperties.properties")
val prop: Properties = new Properties()
if (is != null) {
prop.load(is)
}
else {
throw new FileNotFoundException("Properties file cannot be loaded")
}
val sndrmailid = prop.getProperty("SndrMailId") //Sender Mail id
val psswd = prop.getProperty("SndrMailPsswd") //Sender gmail Mailbox id password
val tomailid = prop.getProperty("RcvrMailId") // Receiver email id.
println( sndrmailid )
}
Error getting
The right place where to put resources is: src/main/resources (as said in the official SBT documentation).
Try to move your file there and it should work :)

How in Scala/Spark copy file from Hadoop (hdfs) to remote SFTP server?

In the file system of Hadoop I have Excel file.
I have task to copy that file from Hadoop to remote SFTP server in my Scala/Spark application.
I have formed the opinion that directly it will not work. If my fears are correct, I need to make next steps:
1) Remove excel file from Hadoop to local directory. For example I can make it with Scala DSL:
import scala.sys.process._
s"hdfs dfs -copyToLocal /hadoop_path/file_name.xlsx /local_path/" !
2) From local directory send file to remote SFTP server. What kind of library you can recommend for this task?
Is my reasoning correct? What the best way to solve my problem?
As mentioned in the comment spark-sftp is good choice
if not you can try below sample code from apache-commons-ftp libraries.. which will list all remote files.. similarly you can delete the files as well.. untested pls try it.
Option1:
import java.io.IOException
import org.apache.commons.net.ftp.FTPClient
//remove if not needed
import scala.collection.JavaConversions._
object MyFTPClass {
def main(args: Array[String]): Unit = {
// Create an instance of FTPClient
val ftp: FTPClient = new FTPClient()
try {
// Establish a connection with the FTP URL
ftp.connect("ftp.test.com")
// Enter user details : user name and password
val isSuccess: Boolean = ftp.login("user", "password")
if (isSuccess) {
// empty array is returned
val filesFTP: Array[String] = ftp.listNames()
var count: Int = 1
// Iterate on the returned list to obtain name of each file
for (file <- filesFTP) {
println("File " + count + " :" + file) { count += 1; count - 1 }
}
}
// Fetch the list of names of the files. In case of no files an
// Fetch the list of names of the files. In case of no files an
ftp.logout()
} catch {
case e: IOException => e.printStackTrace()
} finally try ftp.disconnect()
catch {
case e: IOException => e.printStackTrace()
}
}
}
Option 2:
There is something called jsch library you can see this question and example snippet from SO
Well, finally I found the way to solve the task. I decided to use jsch library.
build.sbt:
libraryDependencies += "com.jcraft" % "jsch" % "0.1.55"
.scala:
import scala.sys.process._
import com.jcraft.jsch._
// Copy Excel file from Hadoop file system to local directory with Scala DSL.
s"hdfs dfs -copyToLocal /hadoop_path/excel.xlsx /local_path/" !
val jsch = new JSch()
val session = jsch.getSession("XXX", "XXX.XXX.XXX.XXX") // Set your username and host
session.setPassword("XXX") // Set your password
val config = new java.util.Properties()
config.put("StrictHostKeyChecking", "no")
session.setConfig(config)
session.connect()
val channelSftp = session.openChannel("sftp").asInstanceOf[ChannelSftp]
channelSftp.connect()
channelSftp.put("excel.xlsx", "sftp_path/") // set your path in remote sftp server
channelSftp.disconnect()
session.disconnect()

How to set Spark configuration properties using Apache Livy?

I don't know how to pass SparkSession parameters programmatically when submitting Spark job to Apache Livy:
This is the Test Spark job:
class Test extends Job[Int]{
override def call(jc: JobContext): Int = {
val spark = jc.sparkSession()
// ...
}
}
This is how this Spark job is submitted to Livy:
val client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.build()
try {
client.uploadJar(new File(testJarPath)).get()
client.submit(new Test())
} finally {
client.stop(true)
}
How can I pass the following configuration parameters to SparkSession?
.config("es.nodes","1localhost")
.config("es.port",9200)
.config("es.nodes.wan.only","true")
.config("es.index.auto.create","true")
You can do that easily through the LivyClientBuilder like this:
val client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.setConf("es.nodes","1localhost")
.setConf("key", "value")
.build()
Configuration parameters can be set to LivyClientBuilder using
public LivyClientBuilder setConf(String key, String value)
so that your code starts with:
val client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.setConf("es.nodes","1localhost")
.setConf("es.port",9200)
.setConf("es.nodes.wan.only","true")
.setConf("es.index.auto.create","true")
.build()
LivyClientBuilder.setConf will not work, I think. Because Livy will modify all configs not starting with spark.. And Spark cannot read the modified config. See here
private static File writeConfToFile(RSCConf conf) throws IOException {
Properties confView = new Properties();
for (Map.Entry<String, String> e : conf) {
String key = e.getKey();
if (!key.startsWith(RSCConf.SPARK_CONF_PREFIX)) {
key = RSCConf.LIVY_SPARK_PREFIX + key;
}
confView.setProperty(key, e.getValue());
}
...
}
So the answer is quite simple: add spark. to all es configs, like this,
.config("spark.es.nodes","1localhost")
.config("spark.es.port",9200)
.config("spark.es.nodes.wan.only","true")
.config("spark.es.index.auto.create","true")
Don't know it is elastic-spark does the compatibility job, or spark. It just works.
PS: I've tried with the REST API, and it works. But not with the Programmatic API.

Hdfs Snappy corrupted data. Could not decompress data. Input is invalid. When does this occur and can it be prevented?

I am writing data into hdfs with the following code:
def createOrAppendHadoopSnappy(path: Path, hdfs: FileSystem): CompressionOutputStream = {
val compressionCodecFactory = new CompressionCodecFactory(hdfs.getConf)
val snappyCodec = compressionCodecFactory.getCodecByClassName(classOf[org.apache.hadoop.io.compress.SnappyCodec].getName)
snappyCodec.createOutputStream(createOrAppend(path, hdfs))
}
def createOrAppend(path: Path, hdfs: FileSystem): FSDataOutputStream = {
if (hdfs.exists(path)) {
hdfs.append(path)
} else {
hdfs.create(path)
}
}
and the code calling this function is roughly:
...
val outputStream = new BufferedOutputStream(HdfsUtils.createOrAppendHadoopSnappy(filePath, fileSystem))
...
for(managedOutputStream <- managed(outputStream)) {
IOUtils.writeLines(lines.asJavaCollection, "\n", managedOutputStream, "UTF-8")
}
...
Now at one point I have had a few files which were corrupted with the following message when I read them with 'hadoop fs -text':
java.lang.InternalError: Could not decompress data. Input is invalid.
Note that this code runs with Spark on YARN and due to some changes in the code the job was killed by yarn once and I also once manually killed the spark job in a test later.
Now I wanted to try to reproduce the scenario where the corrupt files are generated, but so far I did not succeed. I tried different things where the writing would be interrupted (with System.exit(0), an exception, manual ctrl-c). The file was not fully written, but it was not giving the java InteralError exception.
Does anyone know in which cases the corrupted files and happen and how/if they can be prevented?

How can I access a resource when running an SBT runTask?

I've got an XML file that I need to read from the classpath in order to load some test data for my project with DBUnit when running a custom runTask in SBT.
The XML file is located in /src/main/resources and is copied properly to the /target/scala_2.8.1/classes during the build, but I get a MalformedURLException when trying to access it.
The weird thing is, I can access the file when this data loading functionality was part of my Scala specs unit tests.
Any ideas?
In my case the problem was that I used getClass.getResourceAsStream() in early initialiser. Had to specify the class explicitly with Class.forName() to solve it: Class.forName(<class name>).getResourceAsStream("/data.xml")
If the error is saying that the URL is malformed, it's probably true.
Here's a code I use to grab file from resource during test:
def copyFileFromResource(source: String, dest: File) {
val in = getClass.getResourceAsStream(source)
val reader = new java.io.BufferedReader(new java.io.InputStreamReader(in))
val out = new java.io.PrintWriter(new java.io.FileWriter(dest))
var line: String = null
line = reader.readLine
while (line != null) {
out.println(line)
line = reader.readLine
}
in.close
out.flush
}