Scala Spark - Overwrite parquet File on HDFS - scala

I was trying to append the data frame to existing parquet file found option to have the saveMode to append. But when I was trying to append it throws the error it was not the directory.
data.coalesce(1).write.mode(SaveMode.Append).parquet("/user/root/AppendTest");
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=EXECUTE, inode="/user/root/AppendTest":root:root:-rw-r--r-- (Ancestor /user/root/AppendTest is not a directory).
P.S: While was creating the new file it was generated to the folder and then I have renamed to the desired file.
I have checked How to overwrite the output directory in spark but that doesn't solve my problem here. I have tried the ways mentioned in the above questions(issue mentioned is also different).

Related

Check if directory contains json files using org.apache.hadoop.fs.Path in HDFS

I'm following the steps indicated here Avoid "Path does not exist" in dir based spark load to filter which directories in an array contain json files before sending them to the spark.read method.
When I use
inputPaths.filter(f => fs.exists(new org.apache.hadoop.fs.Path(f + "/*.json*")))
It returns empty despite json files existing in the path in one of the paths, one of the comments says this doesn't work with HDFS, is there a way to do make this work?
I running this in a databricks notebook
There is a method for listing files in dir:
fs.listStatus(dir)
Sort of
inputPaths.filter(f => fs.listStatus(f).exists(file => file.getPath.getName.endsWith(".json")))

Error while loading parquet format file into Amazon Redshift using copy command and manifest file

I'm trying to load parquet file using manifest file and getting below error.
query: 124138ailed due to an internal error. File 'https://s3.amazonaws.com/sbredshift-east/data/000002_0 has an invalid version number: )
Here is my copy command
copy testtable from 's3://sbredshift-east/manifest/supplier.manifest'
IAM_ROLE 'arn:aws:iam::123456789:role/MyRedshiftRole123'
FORMAT AS PARQUET
manifest;
here is my manifest file
**{
"entries":[
{
"url":"s3://sbredshift-east/data/000002_0",
"mandatory":true,
"meta":{
"content_length":1000
}
}
]
}**
I'm able to load the same file using copy command by specifying the file name.
copy testtable from 's3://sbredshift-east/data/000002_0' IAM_ROLE 'arn:aws:iam::123456789:role/MyRedshiftRole123' FORMAT AS PARQUET;
INFO: Load into table 'supplier' completed, 800000 record(s) loaded successfully.
COPY
What could be wrong in my copy statement?
This error happens when the content_length value is wrong. You have to specify the correct content_length. You could check it executing an s3 ls command.
aws s3 ls s3://sbredshift-east/data/
2019-12-27 11:15:19 539 sbredshift-east/data/000002_0
The 539 (file size) should be the same than the content_lenght value in your manifest file.
I don't know why they are using this meta value when you don't need it in the direct copy command.
¯\_(ツ)_/¯
The only way I've gotten parquet copy to work with manifest file is to add the meta key with the content_length.
From what I can gather in my error logs, the COPY command for parquet (w/ manifest) might first be reading the files using Redshift Spectrum as an external table. If that's the case, this hidden step does require the content_step which contradicts their initial statement about COPY commands.
https://docs.amazonaws.cn/en_us/redshift/latest/dg/loading-data-files-using-manifest.html

Spark-SQL: access file in current worker node directory

I need to read a file using spark-sql, and the file is in the current directory.
I use this command to decompress a list of files I have stored on HDFS.
val decompressCommand = Seq(laszippath, "-i", inputFileName , "-o", "out.las").!!
The file is outputted in the current worker node directory, and I know this because executing "ls -a"!! through scala I can see that the file is there. I then try to access it with the following command:
val dataFrame = sqlContext.read.las("out.las")
I assumed that the sql context would try to find the file in the current directory, but it doesn't. Also, it doesn't throw an error but a warning stating that the file could not be found (so spark continues to run).
I attempted to add the file using: sparkContext.addFile("out.las") and then access the location using: val location = SparkFiles.get("out.las") but this didn't work either.
I even ran the command val locationPt = "pwd"!! and then did val fullLocation = locationPt + "/out.las" and attempted to use that value but it didn't work either.
The actual exception that gets thrown is the following:
User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: [];
org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: []
And this happens when I try to access column "x" from a dataframe. I know that column 'X' exists because I've downloaded some of the files from HDFS, decompressed them locally and ran some tests.
I need to decompress files one by one because I have 1.6TB of data and so I cannot decompress it at one go and access them later.
Can anyone tell me what I can do to access files which are being outputted to the worker node directory? Or maybe should I be doing it some other way?
So I managed to do it now. What I'm doing is I'm saving the file to HDFS, and then retrieving the file using the sql context through hdfs. I overwrite "out.las" each time in HDFS so that I don't have take too much space.
I have used the hadoop API before to get to files, I dunno if it will help you here.
val filePath = "/user/me/dataForHDFS/"
val fs:FileSystem = FileSystem.get(new java.net.URI(filePath + "out.las"), sc.hadoopConfiguration)
And I've not tested the below, but I'm pretty sure I'm passing the java array to scala illegally. But just giving an idea of what to do afterward.
var readIn: Array[Byte] = Array.empty[Byte]
val fileIn: FSDataInputStream = fs.open(file)
val fileIn.readFully(0, readIn)

Problems with non-"UTF-8" file collection using flume - Spooldir type

My flume spool directory contains non-"UTF-8" files.
So I get a Java.nio.charset.MalformedInputException error when I try to collect it.
Changing the encoding option of a .conf file will also cause an error.
And I have to use spooldir type.
How can I collecrt non-"UTF-8" files.
enter image description here
The encoding of our log files were Latin5 (which is Turkish)
Fixed it by adding the below line into the conf file:
AGENTNAME.sources.SOURCENAME.inputCharset = ISO-8859-9

Reading Avro container files in Spark

I am working on a scenario where I need to read Avro container files from HDFS and do analysis using Spark.
Input Files Directory: hdfs:///user/learner/20151223/.lzo*
Note : The Input Avro Files are lzo compressed.
val df = sqlContext.read.avro("/user/learner/20151223/*.lzo");
When I run the above command.It throws an error :
java.io.FileNotFoundException: No avro files present at file:/user/learner/20151223/*.lzo
at com.databricks.spark.avro.AvroRelation$$anonfun$11.apply(AvroRelation.scala:225)
at com.databricks.spark.avro.AvroRelation$$anonfun$11.apply(AvroRelation.scala:225)
at scala.Option.getOrElse(Option.scala:120)
at com.databricks.spark.avro.AvroRelation.newReader(AvroRelation.scala:225)
This make sense,because the method read.avro() is expecting .avro extension files as input.So I extract and rename the input .lzo file to .avro.I am able to read the data in avro file properly.
Is there any way to read lzo compressed Avro files in spark ?
Solution worked, But !
I have found a way to solve this issue. I created a shell wrapper in which I have decompressed the .lzo into .avro file format using following way:
hadoop fs -text <file_path>*.lzo | hadoop fs - put - <file_path>.avro
I am successfull in decompressing lzo files but the problem is I am having atleast 5000 files in compressed format.Uncompressing and Converting one by one is taking nearly 1+ hours to run this Job.
Is there any way to do this Decompression in bulk way ?
Thanks again !