I am Exploring StreamSet Tool,I have a log file n , I need to parse the log file to the StreamSet tool,I passed the log file from the Directory to the log parser,the format of the log parser is the Common log format , n the destination is the local fs..When I start executing it is running but I am not getting the output.Could any one please help me..
Related
I have created an init script that helps me in getting custom logs in databricks , By default log get created at local (Driver/ worker machine ) path log/log4j-active.log but how can I enable to ship it to DBFS or storage. ???`
%sh
ls logs
getting below output
lineage.json
log4j-active.log
log4j-mylog4j-active.log
metrics.json
product.json
stderr
stdout
ttyd_logs
usage.json
i want to copy my log file log4j-mylog4j-active.log to dbfs or blob storage anything would work ..
dbutils.fs.cp("logs/log4j-mylog4j-active.log", "dbfs:/cluster-logs/")
I am also trying filesystem copy but can't do
FileNotFoundException: /logs/log4j-active.log
I have also tried to create a folder and specify the path in the logging ( in cluster advance option)
but that also didn't work , i don't know why my fs logs are not getting ship to that location of dbfs.
can i get help that how can I transfer my fs log to dbfs or storage ??
thanks in advance !!
You just need to enable logging in your cluster configuration (unfold "Advanced options") & specify where logs should go - by default it's a dbfs:/cluster-logs/ (and cluster ID will be appended to it), but you can specify another path.
It seems connect file source is reading from the beginning of the files when the connector is restarted .
I couldn't find the equivalent configuration.
How to specify to read only the appended data ( Please note this only happens if connect is restarted , in that case, it is only reading the data that got appended) .
in Talend(data integration) i am trying to copy local directory to remote directory but when i am running the job only i can copy files but not folders from directory.please help me with this job.
In my talend job i am using local connection and remote connection components->
tfilelist->tfileproperties(to store path and name in one table)->tmssqlinput(extracting path from last table)->iteration-> tssh(if directory s not available then create)->finally sending it to tftpput to connect and copy to remote directory.
when i am storing in one table using tfileproperties in that for files it will generate some size but when folder s coming the size will be zero,using this condition m creating the directory using tssh component but unable to create folders,please help me.
Do you get an error message?
I believe the output of the TMSSqlInput should be a row based, rather than iteration. That might be the source of the problem.
tMSqlInput docs
tMSSqlInput executes a DB query with a strictly defined order which
must correspond to the schema definition. Then it passes on the field
list to the next component via a Main row link.
I'm trying to monitor a repository in HDFS to read and process data in files copied to it (to copy files from local system to HDFS I use hdfs dfs -put ), sometimes it generates the problem : Spark Streaming: java.io.FileNotFoundException: File does not exist: .COPYING so I read the problems in forums and the question here Spark Streaming: java.io.FileNotFoundException: File does not exist: <input_filename>._COPYING_
According to what I read the problem is linked to Spark streaming reading the file before it finishes being copied in HDFS and on Github :
https://github.com/maji2014/spark/blob/b5af1bdc3e35c53564926dcbc5c06217884598bb/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala , they say that they corrected the problem but only for FileInputDStream as I could see but I'm using textFileStream
When I tried to use FileInputDStream the IDE throws an error the Symbol is not accessible from this place.
Does anyone know how to filter out the files that are still COPYING because I tried :
var lines = ssc.textFileStream(arg(0)).filter(!_.contains("_COPYING_")
but that didn't work and it's expected because the filter should be applied on the name of the file process I guess which I can't access
As you can see I did plenty of research before asking the question but didn't get lucky ,
Any help please ?
So I had a look: -put is the wrong method. Look at the final comment: you have to use -rename in your shell script to have an atomical transaction on the HDFS.
When using the COPY command to load Amazon Redshift with a manifest, suppose one of the files contains an error.
Is there a way to just log the error for that file, but continue loading the other files?
The manifest file indicates whether a file is mandatory and whether an error should be generated if a file is not found. (Using a Manifest to Specify Data Files)
The COPY command will retry if it cannot read a file. (Errors When Reading Multiple Files)
The COPY command can specify a MAXERRORS parameter that permits a certain number of errors before the COPY command fails. (MAXERROR)
When loading data from files, Amazon Redshift will report any errors in the STL_LOAD_ERRORS table. (STL_LOAD_ERRORS)
As said above, the maxerror property should satisfy the above requirement.
In addition, copy-noload property checks the validity of the data without loading. Running with NOLOAD parameter is much faster as it only parses the file