Spark Streaming textFileStream COPYING - scala

I'm trying to monitor a repository in HDFS to read and process data in files copied to it (to copy files from local system to HDFS I use hdfs dfs -put ), sometimes it generates the problem : Spark Streaming: java.io.FileNotFoundException: File does not exist: .COPYING so I read the problems in forums and the question here Spark Streaming: java.io.FileNotFoundException: File does not exist: <input_filename>._COPYING_
According to what I read the problem is linked to Spark streaming reading the file before it finishes being copied in HDFS and on Github :
https://github.com/maji2014/spark/blob/b5af1bdc3e35c53564926dcbc5c06217884598bb/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala , they say that they corrected the problem but only for FileInputDStream as I could see but I'm using textFileStream
When I tried to use FileInputDStream the IDE throws an error the Symbol is not accessible from this place.
Does anyone know how to filter out the files that are still COPYING because I tried :
var lines = ssc.textFileStream(arg(0)).filter(!_.contains("_COPYING_")
but that didn't work and it's expected because the filter should be applied on the name of the file process I guess which I can't access
As you can see I did plenty of research before asking the question but didn't get lucky ,
Any help please ?

So I had a look: -put is the wrong method. Look at the final comment: you have to use -rename in your shell script to have an atomical transaction on the HDFS.

Related

Unable to load multiple json files with pyspark

I am fairly new to pyspark and am trying to load data from a folder which contains multiple json files.However the load fails. Here is the code that I am using:
spark = SparkSession.builder.master("local[1]") \
.appName('SparkByExamples.com') \
.getOrCreate()
spark.read.json('file_directory/*')
I am getting error as :
Exception in thread "globPath-ForkJoinPool-1-worker-57" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
I tried setting the path variables for hadoop and spark as well but still no use.
However, if I load a single file from the directory, it loads perfectly.
Can someone please tell me what is going wrong in this case.
I can successfully read all CSV under a directory without adding the asterik.
I think you should try
spark.read.json('file_directory/')

Kafka Connect - Missing Text

Kafka Version : 2.12-2.1.1
I created a very simple example to create a source and sink connector by using following commands :
bin\windows\connect-standalone.bat config\connect-standalone.properties config\connect-file-source.properties config\connect-file-sink.properties
Source file name : text_2.txt
Sink file name : test.sink_2.txt
A topic named "connect-test-2" is used and I created a consumer in PowerShell to show the result.
It works perfect at the first time. However, after i reboot my machine and start everything again. I found that some text are missing.
For example, when I type the characters below into test_2.txt file and save as following:
HAHAHA..
missing again
some text are missing
I am able to enter text
first letter is missing
testing testing.
The result windows (Consumer) and the sink file shows the following:
As you can see, some text are missing and i cannot find out why this happen. Any advice?
[Added information below]
connect-file-source.properties
name=local-file-source
connector.class=FileStreamSource
tasks.max=1
file=test_2.txt
topic=connect-test-2
connect-file-sink.properties
name=local-file-sink
connector.class=FileStreamSink
tasks.max=1
file=test.sink_2.txt
topics=connect-test-2
I think the strange behaviour is caused be the way you are modifying the sink file (text_2.txt).
How you applied changes after stoping the connector:
Using some editor <-- I think you use that method
Append only new characters to the end of file
FileStreamSource track changes based on the position in the file. You are using Kafka Connect in standalone mode so current position is written in /tmp/connect.offsets file.
If you modify source file using the editor, the whole content of the file has been changed. However FileStreamSource checks only if the size has change and poll characters, which offsets in the file is bigger then last processed by the Connector.
You should modify source file only by appending new characters to the end of the file

Spark Scala S3 storage: permission denied

I've read a lot of topic on Internet on how to get working Spark with S3 still there's nothing working properly.
I've downloaded : Spark 2.3.2 with hadoop 2.7 and above.
I've copied only some libraries from Hadoop 2.7.7 (which matches Spark/Hadoop version) to Spark jars folder:
hadoop-aws-2.7.7.jar
hadoop-auth-2.7.7.jar
aws-java-sdk-1.7.4.jar
Still I can't use nor S3N nor S3A to get my file read by spark:
For S3A I have this exception:
sc.hadoopConfiguration.set("fs.s3a.access.key","myaccesskey")
sc.hadoopConfiguration.set("fs.s3a.secret.key","mysecretkey")
val file = sc.textFile("s3a://my.domain:8080/test_bucket/test_file.txt")
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: AE203E7293ZZA3ED, AWS Error Code: null, AWS Error Message: Forbidden
Using this piece of Python, and some more code, I can list my buckets, list my files, download files, read files from my computer and get file url.
This code gives me the following file url:
https://my.domain:8080/test_bucket/test_file.txt?Signature=%2Fg3jv96Hdmq2450VTrl4M%2Be%2FI%3D&Expires=1539595614&AWSAccessKeyId=myaccesskey
How should I install / set up / download to get spark able to read and write from my S3 server ?
Edit 3:
Using debug tool in comment here's the result.
Seems like the issue is with a signature thing not sure what it means.
First you will need to download aws-hadoop.jar and aws-java-sdk.jar that matches the install of your spark-hadoop release and add them to the jars folder inside spark folder.
Then you will need to precise the server you will use and enable path style if your S3 server do not support dynamic DNS:
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.endpoint","my.domain:8080")
#I had to change signature version because I have an old S3 api implementation:
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
Here's my final code:
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val tmp = sc.textFile("s3a://test_bucket/test_file.txt")
sc.hadoopConfiguration.set("fs.s3a.access.key","mykey")
sc.hadoopConfiguration.set("fs.s3a.secret.key","mysecret")
sc.hadoopConfiguration.set("fs.s3a.endpoint","my.domain:8080")
sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled","true")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
tmp.count()
I would recommand to put most of the settings inside spark-defaults.conf:
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.endpoint mydomain:8080
spark.hadoop.fs.s3a.connection.ssl.enabled true
spark.hadoop.fs.s3a.signing-algorithm S3SignerType
One of the issue I had has been to set spark.hadoop.fs.s3a.connection.timeout to 10 but this value is set in millisecond prior to Hadoop 3 and it gives you a very long timeout; error message would appear 1.5 minute after the attempt to read a file.
PS:
Special thanks to Steve Loughran.
Thank you a lot for the precious help.

Url for HDFS file system

I have some data in HDFS /user/Cloudera/Test/*. I am very well able to see the records by running hdfs -dfs -cat Test/*.
Now the same file, I need it to be read as RDD in scala.
I have tried the following in scala shell.
val file = sc.textFile("hdfs://quickstart.cloudera:8020/user/Cloudera/Test")
Then I have written some filter and for loop to read the words. But when I use the Println at last, it says file not found.
Can anyone please help me know what would be the HDFS url in this case.
Note: I am using Cloudera CDH5.0 VM
If you are trying to access your file in spark job then you can simply use URL
val file = sc.textFile("/user/Cloudera/Test")
Spark will automatically detect this file. You do not need to add localhost as prefix because spark job by default read them from HDFS directory.
Hope this solve your query.
Instead of using "quickstart.cloudera" and the port, use just the ip address:
val file = sc.textFile("hdfs://<ip>/user/Cloudera/Test")

Read many files from Kafka Connect FileStreamSourceTask

I am reading 1 log file in Kafka, and creating a topic. This is succesful. To read this file, I am editing the file config/connect-file-source.properties to this purpose, and according to Step 7 of Kafka Quickstart (http://kafka.apache.org/quickstart#quickstart_kafkaconnect).
But, now, I would like to read a lot of files. In the file config/connect-file-source.properties I have edited the variable file with a pattern, for instance:
file=/etc/logs/archive.log*
Because I want to read all the files of the directory logs, with the pattern archive*.log. But, this line doesn't work.
What is the best form to implement the reading of files with a pattern, using the file config/connect-file-source.properties ?
In config/connect-file-source.properties,
source class is FileStreamSource and it uses task class as FileStreamSourceTask.
It reads a file using FileInputStream, so it cannot open multiple files at once. (by passing a directory name or regex pattern..)
You should implement your own Source & SourceTask class, or use an existing one that supports this feature such as kafka-connect-spooldir