java.lang.IllegalArgumentException when HDFS file creating - scala

I have HDFS and some text, I want to create file with text. I tried to use HDFS api and FSDataOutputStream, but got an exception. Could you help me please resolve it.
The exception is:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/user1, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:647)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:513)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:499)
at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:594)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:448)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:435)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:890)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:787)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:776)
at com.example.FileBuilder$.buildFile(FileBuilder.scala:23)
The code is
val fs = FileSystem.get(new Configuration())
val path = new Path(s"hdfs:////user/" + "fileName.sql")
val fsDataOutputStream = fs.create(path)
val outputStreamWriter = new OutputStreamWriter(fsDataOutputStream, "UTF-8")
val bufferedWriter = new BufferedWriter(outputStreamWriter)
bufferedWriter.write(data)
bufferedWriter.close()
outputStreamWriter.close()
fsDataOutputStream.close()

I think there is some problem with the file path. Can you test by replace below portion of code in yours.
Configuration configuration = new Configuration();
FileSystem fs = FileSystem.get(new URI(<url:port>), configuration);
Path filePath = new Path("/user/fileName.sql");
val fsDataOutputStream = fs.create(path)

Related

Writing file using FileSystem to S3 (Scala)

I'm using scala , and trying to write file with string content,
to S3.
I've tried to do that with FileSystem ,
but I getting an error of:
"Wrong FS: s3a"
val content = "blabla"
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val s3Path: Path = new Path("s3a://bucket/ha/fileTest.txt")
val localPath= new Path("/tmp/fileTest.txt")
val os = fs.create(localPath)
os.write(content.getBytes)
fs.copyFromLocalFile(localPath,s3Path)
and i'm getting an error:
java.lang.IllegalArgumentException: Wrong FS: s3a://...txt, expected: file:///
What is wrong?
Thanks!!
you need to ask for the specific filesystem for that scheme, then you can create a text file directly on the remote system.
val s3Path: Path = new Path("s3a://bucket/ha/fileTest.txt")
val fs = s3Path.getFilesystem(spark.sparkContext.hadoopConfiguration)
val os = fs.create(s3Path, true)
os.write("hi".getBytes)
os.close
There's no need to write locally and upload; the s3a connector will buffer and upload as needed

Out of memory issue while using Multipart upload API of AWS s3

I am trying to use aws multipart upload using aws SDK and spark and file size is around 14GB but getting out of memory error. Its giving error at this line - val bytes: Array[Byte] = IOUtils.toByteArray(is)
I have tried to bump up driver memory and executor memory to 100 G and tried few other spark optimizations.
Below is the code I am trying with :-
val tm = TransferManagerBuilder.standard.withS3Client(s3Client).build
val fs = FileSystem.get(new Configuration())
val filePath = new Path(hdfsFilePath)
val is:InputStream = fs.open(filePath)
val om = new ObjectMetadata()
val bytes: Array[Byte] = IOUtils.toByteArray(is)
om.setContentLength(bytes.length)
val byteArrayInputStream: ByteArrayInputStream = new ByteArrayInputStream(bytes)
val request = new PutObjectRequest(bucketName, keyName, byteArrayInputStream, om).withSSEAwsKeyManagementParams(new SSEAwsKeyManagementParams(kmsKey)).withCannedAcl(CannedAccessControlList.BucketOwnerFullControl)
val upload = tm.upload(request)
And this is the Exception I am getting :-
java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at com.amazonaws.util.IOUtils.toByteArray(IOUtils.java:45)
PutObjectRequest accepts File:
public PutObjectRequest(String bucketName, String key, File file)
Something like the following should work (I haven't checked though):
val result = TransferManagerBuilder.standard.withS3Client(s3Client)
.build
.upload(
new PutObjectRequest(
bucketName,
keyName,
new File(new Path(hdfsFilePath))
)
.withSSEAwsKeyManagementParams(new SSEAwsKeyManagementParams(kmsKey))
.withCannedAcl(CannedAccessControlList.BucketOwnerFullControl)
)

Cannot write a string to hdfs file using scala

I wrote some code to create a file in hdfs and write bytes to it. This is the code:
def write(uri: String, filePath: String, data: String): Unit = {
System.setProperty("HADOOP_USER_NAME", "hibou")
val path = new Path(filePath + "/hello.txt")
val conf = new Configuration()
conf.set("fs.defaultFS", uri)
val fs = FileSystem.get(conf)
val os = fs.create(path)
os.writeBytes(data)
os.flush()
fs.close()
}
The code success without error but I only see that file created. When I examine the content of the file with hdfs -dfs -cat /.../hello.txt I don't see any content?

Read data from HDFS

I'm using the FSDataInputStream library to access the data from HDFS
The following is the snippet which I'm using
val fs = FileSystem.get(new java.net.URI(#HDFS_URI),new Configuration())
val stream = fs.open(new Path(#PATH))
val reader = new BufferedReader(new InputStreamReader(stream))
val offset:String = reader.readLine() #Reads the string "5432" stored in the file
Expected output is "5432".
But the actual output is "^#5^#4^#3^#2"
Not able to trim "^#" since they are not considered as characters.Please help with appropriate solution.

Hdfs file list in scala

i am trying to find the list of file in hdfs directory but the code its expecting file as the input when i try to run the below code.
val TestPath2="hdfs://localhost:8020/user/hdfs/QERESULTS1.csv"
val hdfs: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.FileSystem.get(sc.hadoopConfiguration)
val hadoopPath = new org.apache.hadoop.fs.Path(TestPath1)
val recursive = true
// val ri = hdfs.listFiles(hadoopPath, recursive)()
//println(hdfs.getChildFileSystems)
//hdfs.get(sc
val ri=hdfs.listFiles(hadoopPath, true)
println(ri)
You should set your default filesystem to hdfs:// first, I seems like your default filesystem is file://
val conf = sc.hadoopConfiguration
conf.set("fs.defaultFS", "hdfs://some-path")
val hdfs: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.FileSystem.get(conf)
...