Problems with non-"UTF-8" file collection using flume - Spooldir type - encoding

My flume spool directory contains non-"UTF-8" files.
So I get a Java.nio.charset.MalformedInputException error when I try to collect it.
Changing the encoding option of a .conf file will also cause an error.
And I have to use spooldir type.
How can I collecrt non-"UTF-8" files.
enter image description here

The encoding of our log files were Latin5 (which is Turkish)
Fixed it by adding the below line into the conf file:
AGENTNAME.sources.SOURCENAME.inputCharset = ISO-8859-9

Related

Error while loading parquet format file into Amazon Redshift using copy command and manifest file

I'm trying to load parquet file using manifest file and getting below error.
query: 124138ailed due to an internal error. File 'https://s3.amazonaws.com/sbredshift-east/data/000002_0 has an invalid version number: )
Here is my copy command
copy testtable from 's3://sbredshift-east/manifest/supplier.manifest'
IAM_ROLE 'arn:aws:iam::123456789:role/MyRedshiftRole123'
FORMAT AS PARQUET
manifest;
here is my manifest file
**{
"entries":[
{
"url":"s3://sbredshift-east/data/000002_0",
"mandatory":true,
"meta":{
"content_length":1000
}
}
]
}**
I'm able to load the same file using copy command by specifying the file name.
copy testtable from 's3://sbredshift-east/data/000002_0' IAM_ROLE 'arn:aws:iam::123456789:role/MyRedshiftRole123' FORMAT AS PARQUET;
INFO: Load into table 'supplier' completed, 800000 record(s) loaded successfully.
COPY
What could be wrong in my copy statement?
This error happens when the content_length value is wrong. You have to specify the correct content_length. You could check it executing an s3 ls command.
aws s3 ls s3://sbredshift-east/data/
2019-12-27 11:15:19 539 sbredshift-east/data/000002_0
The 539 (file size) should be the same than the content_lenght value in your manifest file.
I don't know why they are using this meta value when you don't need it in the direct copy command.
¯\_(ツ)_/¯
The only way I've gotten parquet copy to work with manifest file is to add the meta key with the content_length.
From what I can gather in my error logs, the COPY command for parquet (w/ manifest) might first be reading the files using Redshift Spectrum as an external table. If that's the case, this hidden step does require the content_step which contradicts their initial statement about COPY commands.
https://docs.amazonaws.cn/en_us/redshift/latest/dg/loading-data-files-using-manifest.html

Scala Spark - Overwrite parquet File on HDFS

I was trying to append the data frame to existing parquet file found option to have the saveMode to append. But when I was trying to append it throws the error it was not the directory.
data.coalesce(1).write.mode(SaveMode.Append).parquet("/user/root/AppendTest");
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=EXECUTE, inode="/user/root/AppendTest":root:root:-rw-r--r-- (Ancestor /user/root/AppendTest is not a directory).
P.S: While was creating the new file it was generated to the folder and then I have renamed to the desired file.
I have checked How to overwrite the output directory in spark but that doesn't solve my problem here. I have tried the ways mentioned in the above questions(issue mentioned is also different).

citrus waitFor().file fails to read a file

I’m trying to use waitFor() in my Citrustest to wait for an output file on disk to be written by the process I’m testing. I’ve used this code
outputFile = new File “/esbfiles/blesbt/bl03orders.99160221.14289.xml");
waitFor().file(outputFile).seconds(65L).interval(1000L);
after a few seconds, the file appears in the folder as expected. The user I’m running the test code as has permissions to read the file. The waitFor(), however, ends in a timeout.
09:46:44 09:46:44,818 DEBUG dition.FileCondition| Checking file path '/esbfiles/blesbt/bl03orders.99160221.14289.xml'
09:46:44 09:46:44,818 WARN dition.FileCondition| Failed to access file resource 'class path resource [esbfiles/blesbt/bl03orders.99160221.14289.xml] cannot be resolved to URL because it does not exist'
What could be the problem? Can’t I check for files outside the classpath?
This is actually a bug in Citrus. Citrus is working with the file path instead of the file object and in combination with Spring's PathMatchingResourcePatternResolver this causes Citrus to search for a classpath resource instead of using the absolute file path as external file system resource.
You can fix this by providing the absolute file path instead of the file object like this:
waitFor().file(“file:/esbfiles/blesbt/bl03orders.99160221.14289.xml")
.seconds(65L)
.interval(1000L);
Issue regarding broken file object conversion has been opened: https://github.com/christophd/citrus/issues/303
Thanks for pointing to it!

Reading Avro container files in Spark

I am working on a scenario where I need to read Avro container files from HDFS and do analysis using Spark.
Input Files Directory: hdfs:///user/learner/20151223/.lzo*
Note : The Input Avro Files are lzo compressed.
val df = sqlContext.read.avro("/user/learner/20151223/*.lzo");
When I run the above command.It throws an error :
java.io.FileNotFoundException: No avro files present at file:/user/learner/20151223/*.lzo
at com.databricks.spark.avro.AvroRelation$$anonfun$11.apply(AvroRelation.scala:225)
at com.databricks.spark.avro.AvroRelation$$anonfun$11.apply(AvroRelation.scala:225)
at scala.Option.getOrElse(Option.scala:120)
at com.databricks.spark.avro.AvroRelation.newReader(AvroRelation.scala:225)
This make sense,because the method read.avro() is expecting .avro extension files as input.So I extract and rename the input .lzo file to .avro.I am able to read the data in avro file properly.
Is there any way to read lzo compressed Avro files in spark ?
Solution worked, But !
I have found a way to solve this issue. I created a shell wrapper in which I have decompressed the .lzo into .avro file format using following way:
hadoop fs -text <file_path>*.lzo | hadoop fs - put - <file_path>.avro
I am successfull in decompressing lzo files but the problem is I am having atleast 5000 files in compressed format.Uncompressing and Converting one by one is taking nearly 1+ hours to run this Job.
Is there any way to do this Decompression in bulk way ?
Thanks again !

Spark Shell unable to read file at valid path

I am trying to read a file in Spark Shell that comes with CentOS distribution of Cloudera on my local machine. Following are the commands I have entered in Spark Shell.
spark-shell
val fileData = sc.textFile("hdfs://user/home/cloudera/cm_api.py");
fileData.count
I also tried this statment for reading file:
val fileData = sc.textFile("user/home/cloudera/cm_api.py");
However I am getting
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/user/cloudera/user/cloudera/cm_api.py
I haven't changed any settings or configurations. What am I doing wrong?
You are missing the leading slash in your url, so the path is relative. To make it absolute, use
val fileData = sc.textFile("hdfs:///user/home/cloudera/cm_api.py")
or
val fileData = sc.textFile("/user/home/cloudera/cm_api.py")
I think you need to put the file in hdfs first: hadoop fs -put, then check the file: hadoop fs -ls, then go spark-shell , val fileData = sc.textFile("cm_api.py")
In "hdfs://user/home/cloudera/cm_api.py", you are missing the hostname of the URI. You should have pass something like "hdfs://<host>:<port>/user/home/cloudera/cm_api.py", where <host> is Hadoop NameNode host and the <port> is, well, port number of Hadoop NameNode, which is 50070 by default.
The error message says hdfs://quickstart.cloudera:8020/user/cloudera/user/cloudera/cm_api.py does not exist. The path looks suspicious! The file you mean is probably at hdfs://quickstart.cloudera:8020/user/cloudera/cm_api.py.
If it is, you can access it by using that full path. Or, if the default file system is configured as hdfs://quickstart.cloudera:8020/user/cloudera/, you can use simply cm_api.py.
You may be confused between HDFS file paths and local file paths. By specifying
hdfs://quickstart.cloudera:8020/user/home/cloudera/cm_api.py
you are saying two things:
1) there is a computer by the name "quickstart.cloudera' reachable via the network (try ping to ensure that is the case), and it is running HDFS.
2) the HDFS file system contains a file at /user/home/cloudera/cm_api.py (try 'hdfs dfs -ls /user/home/cloudera/' to verify this
If you are trying to access a file on the local file system you have to use a different URI:
file:///user/home/cloudera/cm_api.py