Hi My reqmnt is to create Analytics from http://10.3.9.34:9900/messages that is pull data from from http://10.3.9.34:9900/messages and put this data in HDFS location /user/cloudera/flume and from HDFS create Analytics report using Tableau or HUE UI . I tried with below code at scala console of spark-shell of CDH5.5 but unable to fetch data from the http link
import org.apache.spark.SparkContext
val dataRDD = sc.textFile("http://10.3.9.34:9900/messages")
dataRDD.collect().foreach(println)
dataRDD.count()
dataRDD.saveAsTextFile("/user/cloudera/flume")
I get below error at scala console:
java.io.IOException: No FileSystem for scheme: http at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2623)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2637)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2680)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2662) at
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:379) at
org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
You can't use a http endpoint as input, it needs to be a file system such as HDFS, S3 or local.
You would need a separate process which is pulling data from this endpoint, perhaps using something like Apache NiFi to land the data on a filesystem where you can then use it as input to Spark.
Related
Unity Catalog have recently been set up in my databricks account, and I am trying to stream from an Azure container containing parquet files to a service catalog, using a notebook that ran before.
I do however now get the following
py4j.security.Py4JSecurityException: Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql.streaming.DataStreamReader.format(java.lang.String) is not whitelisted on class class org.apache.spark.sql.streaming.DataStreamReader
when trying to run the following spark command from my Notebook:
df = (spark
.readStream
.format("cloudFiles")
.option("cloudFiles.format", "parquet")
.option("cloudFiles.useNotifications", "false") # useNotifications determines if we efficiently scan the new files or if we set up a subscription to listen to new file events
.option("cloudFiles.schemaEvolutionMode", "failOnNewColumns") # schemaEvolutionMode determines what happens when the schema changes
.option("cloudFiles.schemaLocation", schemaPath)
.load(dataPath)
)
where schemaPathand dataPath contain the paths to the parquet schema and data files.
The closest related error I have found is the following pre-Unity Catalog error, suggesting that I should disable table access control on my clusters:
https://kb.databricks.com/en_US/streaming/readstream-is-not-whitelisted
All table access control are disabled in my Admin Console.
Are there some other settings that should be set to ensure white-listing from Azure files now that Unity Catalog is set up?
------ Edit -----
Using a Single User cluster on Databricks runtime version 11.3 beta, I get the following error instead:
com.databricks.sql.cloudfiles.errors.CloudFilesIOException: Failed to write to the schema log at location
followed by the location to the azure schema in my storage location. I also get this error message by spawning new job clusters from azure datafactory.
I've read a lot of topic on Internet on how to get working Spark with S3 still there's nothing working properly.
I've downloaded : Spark 2.3.2 with hadoop 2.7 and above.
I've copied only some libraries from Hadoop 2.7.7 (which matches Spark/Hadoop version) to Spark jars folder:
hadoop-aws-2.7.7.jar
hadoop-auth-2.7.7.jar
aws-java-sdk-1.7.4.jar
Still I can't use nor S3N nor S3A to get my file read by spark:
For S3A I have this exception:
sc.hadoopConfiguration.set("fs.s3a.access.key","myaccesskey")
sc.hadoopConfiguration.set("fs.s3a.secret.key","mysecretkey")
val file = sc.textFile("s3a://my.domain:8080/test_bucket/test_file.txt")
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: AE203E7293ZZA3ED, AWS Error Code: null, AWS Error Message: Forbidden
Using this piece of Python, and some more code, I can list my buckets, list my files, download files, read files from my computer and get file url.
This code gives me the following file url:
https://my.domain:8080/test_bucket/test_file.txt?Signature=%2Fg3jv96Hdmq2450VTrl4M%2Be%2FI%3D&Expires=1539595614&AWSAccessKeyId=myaccesskey
How should I install / set up / download to get spark able to read and write from my S3 server ?
Edit 3:
Using debug tool in comment here's the result.
Seems like the issue is with a signature thing not sure what it means.
First you will need to download aws-hadoop.jar and aws-java-sdk.jar that matches the install of your spark-hadoop release and add them to the jars folder inside spark folder.
Then you will need to precise the server you will use and enable path style if your S3 server do not support dynamic DNS:
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.endpoint","my.domain:8080")
#I had to change signature version because I have an old S3 api implementation:
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
Here's my final code:
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val tmp = sc.textFile("s3a://test_bucket/test_file.txt")
sc.hadoopConfiguration.set("fs.s3a.access.key","mykey")
sc.hadoopConfiguration.set("fs.s3a.secret.key","mysecret")
sc.hadoopConfiguration.set("fs.s3a.endpoint","my.domain:8080")
sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled","true")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
tmp.count()
I would recommand to put most of the settings inside spark-defaults.conf:
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.endpoint mydomain:8080
spark.hadoop.fs.s3a.connection.ssl.enabled true
spark.hadoop.fs.s3a.signing-algorithm S3SignerType
One of the issue I had has been to set spark.hadoop.fs.s3a.connection.timeout to 10 but this value is set in millisecond prior to Hadoop 3 and it gives you a very long timeout; error message would appear 1.5 minute after the attempt to read a file.
PS:
Special thanks to Steve Loughran.
Thank you a lot for the precious help.
spring cloud dataflow sftp source starter app states that file name should be in the headers (mode=contents). However, when I connect this source to a log sink, I see a few headers (like Content-Type) but not the file_name header. I want to use this header to upload the file to S3 with the same name.
spring server: Spring Cloud Data Flow Local Server (v1.2.3.RELEASE)
my apps are all imported from here
stream definition:
stream create --definition "sftp --remote-dir=/incoming --username=myuser --password=mypwd --host=myftp.company.io --mode=contents --filename-pattern=preloaded_file_2017_ --allow-unknown-keys=true | log" --name test_sftp_log
configuring the log application to --expression=#root --level=debug doesn't make any difference. Also, writing my own sink that tries to access the file_name header I get an error message that such header does not exist
logs snippets from the source and sink are in this gist
Please follow this link bellow, You need to code your own Source and populate such a header manually downstream already after FileReadingMessageSource. And only after that send the message with content and appropriate header to the target destination.
https://github.com/spring-cloud-stream-app-starters/file/issues/9
I have some data in HDFS /user/Cloudera/Test/*. I am very well able to see the records by running hdfs -dfs -cat Test/*.
Now the same file, I need it to be read as RDD in scala.
I have tried the following in scala shell.
val file = sc.textFile("hdfs://quickstart.cloudera:8020/user/Cloudera/Test")
Then I have written some filter and for loop to read the words. But when I use the Println at last, it says file not found.
Can anyone please help me know what would be the HDFS url in this case.
Note: I am using Cloudera CDH5.0 VM
If you are trying to access your file in spark job then you can simply use URL
val file = sc.textFile("/user/Cloudera/Test")
Spark will automatically detect this file. You do not need to add localhost as prefix because spark job by default read them from HDFS directory.
Hope this solve your query.
Instead of using "quickstart.cloudera" and the port, use just the ip address:
val file = sc.textFile("hdfs://<ip>/user/Cloudera/Test")
I have a Jenkins job that runs the Selenium tests. The results are then stored in a CSV file and then fed to Cassandra. My requirement is to create JIRA request if the test fails either by analyzing the CSV file or from Cassandra. Please suggest the possible approaches.
Jira API + CSV Reader or Cassandra API
https://docs.atlassian.com/jira/REST/latest/