SparkJob generate file on remote server - scala

I need some advice for the following problem:
I have a Spark cluster with Cassandra.
I need to write a spark job (using Scala) to extract some informations out of Cassandra. I need to generate a file with the result and put it on another server (where there is no Spark).
My question is: What is the best solution for that ?
1. Generate the file on the same server as spark and then do a scp to copy it on my destination server ?
2. Is there another way to generate the file right on my destination server ?
Thanks.

A better way to do so would be to compute the results and store them in some directory in HDFS (server with spark) and nfs mount this directory to some path in your destination server (server without spark).
Let me know if this helped. Cheers.

Related

Executing Batch service in Azure Data factory using python script

Hi i've been trying to execute a custom activity in ADF which receives csv file from the container (A) after further transformation on the data set, transformed DF stored into another csv file in a same container (A).
I've written the transformation logic in python and have it stored in the same container (A).
Error raises here, when i execute the pipeline it returns an error *can't find the specified file *
Nothing wrong in the connections, Is anything wrong in batch Account or pools!!
can anyone tell me where to place the python script..!!!
Install azure batch explorer and make sure to choose proper configuration for virtual machine (dsvm-windows) which will ensure python is already in place in the virtual machine where your code is being run.
This video explains the steps
https://youtu.be/_3_eiHX3RKE

Export a CSV file from AS400 to my pc through Cl program

I want to export a database file that is created through a query, from the AS400 machine to my pc in the form of a csv file.
Is there a way to create that connection of the AS400 and my pc through a cl program?
An idea of what I want to do can be derived from the following code:
CLRPFM DTABASENAME
RUNQRY QRY(QRYTEST1)
CHGVAR VAR(&PATH) VALUE('C:\TESTS')
CHGVAR VAR(&PATH1) VALUE('C:\TESTS')
CHGVAR VAR(&CMD) VALUE(%TRIM(&PATH) *CAT '/DTABASENAME.CSV' !> &PATH !> &PATH1)
STRPCO PCTA(*YES)
STRPCCMD PCCMD(&CMD) PAUSE(*YES)
where I somehow get my database file, give the path that I want it to be saved in, in my pc , and lastly run the pc command accordingly
Take a look at
Copy From Query File (CPYFRMQRYF)
Which will allow you to create a database physical file from the query.
You may also want to look at
Copy To Import File (CPYTOIMPF)
Which will copy data from a database physical file to an Integrated File System (IFS) stream file (such as .CSV); which are the type of files you'd find on a PC.
ex:
CPYTOIMPF FROMFILE(MYLIB/MYPF) TOSTMF('/home/myuser/DTABASENAME.CSV') RCDDLM(*CRLF) DTAFMT(*DLM) STRDLM(*DBLQUOTE) STRESCCHR(*STRDLM) RMVBLANK(*TRAILING)
FLDDLM(',')
However, there's no single command to transfer data to your PC. Well technically, I suppose that's not true. If you configure a (SMB or NFS) file share on your PC and configure the IBM SMB or NFS client; you could in fact CPYTOIMPF directly to that file share or use the Copy Object (CPY) command to copy from the IFS to the network share.
If your PC has an FTP server available, you could send the data via the IBM i's FTP client. Similarly, if you have a SSH server on your PC, OpenSSL is available via PASE and SFTP or SCP could be used. You could also email the file from the i.
Instead of trying to send the file to your PC from the i. An easier solution would be to kick off a process on the PC that runs the download. My preference would be a Access Client Solution (ACS) data transfer.
You configure and save (as a .dtfx file) the transfer
Then you can kick it off with a
STRPCCMD cmd('java -jar C:\ACS\acsbundle.jar /plugin=download C:\testacs.dtfx')
More detailed information can be found in the Automating ACS Data Transfer document
The ACS download compoent is SQL based, so you could probably remove the need to use Query/400 at all
Assuming that you have your IFS QNTC mapped to your network domain. You could use the command CPYTOIMPF to copy the data directly from an IBMI DB2 file to a network directory.
This sample would result in a CSV file.
CPYTOIMPF FROMFILE(file) TOSTMF('//QNTC/servername or ip/path/filename.csv') STMFCCSID(*PCASCII) RCDDLM(*CRLF) STRDLM(*NONE)
Use the FLDDLM(';') option in addition to make semicolon separated values, omit it to use comma as value separator.

Some questions about google Data fusion

I am discovering the tool and I have some questions:
-what do you exactly mean by the type File in (Source, Sink),
-is it also possible to send the result of the pipeline directly to a FTP server
I check the documentation, but I did not find this information
thank you
Short answer: File refers to the filesystem where the pipelines run. In Data Fusion context if you are using File sink the contents will be written to HDFS on Dataproc cluster.
Data Fusion has SFTP put actions that can be used to write to SFTP. Here is a simple pipeline of how to write to SFTP from GCS.
Step1: GCS Source to File Sink - This writes the content of GCS to HDFS on Dataproc when the pipeline is run
Step 2: SFTP Put action, that takes the output of File sink and upload to SFTP.
You need to configure the output path of File the same as source path in SFTP

Talend: Using tfilelist to access files from a shared network path

I have a Talend job that searches a directory and then uploads it to our database.
It's something like this: dbconnection>twaitforfile>tfilelist>fileschema>tmap>db
I have a subjobok that then commits the data into the table iterates through the directory and movies files to another folder.
Recently I was instructed to change the directory to a shared network path using the same components as before (I originally thought of changing components to tftpfilelist, etc.)
My question being how to direct it to the shared network path. I was able to get it to go through using double \ but it won't read any of the new files arriving.
Thanks!
I suppose if you use tWaitForFile on the local filesystem Talend/Java will hook somehow into the folder and get a message if a new file is being put into it.
Now, since you are on a network drive first of all this is out of reach of the component. Second, the OS behind the network drive could be different.
I understand your job is running all the time, listening. You could change the behaviour to putting a tLoop first which would check the file system for new files and then proceed. There must be some delta check in how the new files get recognized.

talend , mongoDB connection

I am facing a problem with mongo DB connection.
I have succefully imported tMongo components it to my Talend Open Studio 5.1.1 and by copying the mongo 1.3.jar file to lib/java folder, my Mongo DB jobs are running successfully, but the problem is even if I provide some fake server path(IP) and fake port for mongoDB, my job is running without an error and it is giving me 1 row with no data. and same goes with right IP and port.
How do I resolve it.
I think the connection is not working. As you must be knowing, mongoDB checks that the connection is actually working or not when you perform a query on it.
(Yeah, it doesn't check for a successful connection when you just connect to it ).
I would suggest to instead add the mongoDB components present in Talend for Big Data by following the steps below:
Components provided for MongoDB are :
tMongoDBInput, tMongoDBOutput, tMongoDBConnection etc.
Or you can Download the components from http://www.talendforge.org/exchange/ and search for Mongo instead of using Talend Big Data. But I would suggest use Talend for big Data for it.
The components will be zipped format , Unzip the same. In Talend Big data you will find the components in Component folder.
Copy these Unzipped Components to the installation Path of TOS.
C:TalendTOS_DI-Win32-r84309V5.1.1pluginsorg.talend.designer.components.localprovider_5.1.1.r84309components
Copy the mongo-1.3.jar file in the component folder into the C:TalendTOS_DI-Win32-r84309-V5.1.1libjava
In many systems you might not be able to see this file then go with ADMINISTRATOR priviliges.
optional for few systems——>>> Inside index.xml add
save index.xml
Restart TOS
Then you will be able to use them as normal components.
Cheers!
The reason for the Job running without any error could be due to the connection / meta-data you have used for the Mongo Connector. It doesn't is not possible for the job to run without any error even after giving fakepath.
I guess you might configured (re-modified) the repository connection but using a built-in meta data for component.