Prefect: Is is possible to access storage like NAS with multiple machines in Prefect? - deployment

I have set up a Prefect backend server on a remote machine. I was able to connect local agents from different other machines to the server by modifying the config.toml in the .prefect folder:
[server]
endpoint = "http://server_ip:port/graphql"
[server.ui]
apollo_url = "http://server_ip:port/graphql"
As it stands, I can create a local agent on each machine, register flows and run them on the respective machines. Now I would like to have a central computer where I can develop and register my flows.
Unfortunately, when I run a flow on Machine B, registered on Machine A, I get a "Module not Found" error message. I have read that the error comes from machines only looking for the flows in their local storage.
Without using Git, GCS, etc., is it possible to use, for example, a NAS where all flows are stored and which all machines can use to access the flows?
if so, how must flows, agents, and storage be configured? Unfortunately, I have not found any good documentation on this.
Many applications use Docker agents and have similar problems, or use remote storage directly.

There is no native NAS storage interface available in the core library, but we provide recipes and guidance on how you may solve the ModuleNorFoundError - check out this Discourse wiki page which dives into how you may solve that

I was able to find a solution to my answer. The prerequisite is shared storage (e.g. a NAS), which is accessible on all machines under the same path. In this storage, the flows are stored in the form of .py files. Flows and used local Agents do not need any special preparations.
I simply registered my flows with
prefect register --project "PREFECT_PROJECT_NAME" --path "PATH_TO_.py"
in CLI.
I was able to deploy all my flows from machine A and execute them from/schedule them on any other machine

Related

Transfer MongoDB dump on external hard drive to google cloud platform

As a part of my thesis project, I have been given a MongoDB dump of size 240GB which is on my external hard drive. I'll have to use this data to run my python scripts for a short duration. However, since my dataset is huge and I cannot mongoimport on my local mongodb server (since I don't have enough internal memory), my professor gave me a $100 google cloud platform coupon so I can use the google cloud computing resources.
So far I have researched that I can do it this way:
Create a compute engine in GCP and install mongodb on remote engine. Transfer the MongoDB dump to remote instance and run the scripts to get the output.
This method works well but I'm looking for a method to create a remote database server in GCP so I that I can run my scripts locally, which is something like one of the following.
Creating a remote mongodb server on GCP so that I can establish a remote mongo connection to run my scripts locally.
Transferring the mongodb dump to google's datastore so then I can use the datastore API to remotely connect and run my scripts locally.
I have given a thought of using MongoDB atlas but because of the size of the data, I will be billed hugely and I cannot use my GCP coupon.
Any help or suggestions on how of either of the two methods can be implemented is appreciated.
There is 2 parts to your question
First, you can create a compute engine VM with MongoDB installed and load your backup on it. Then, open the right firewall rules for allowing the connexion from your local environment to the Google Compute Engine VM. The connexion will be performed with a simple login/password.
You can use a static IP on your VM. By the way, in case of reboot on the VM you will keep the same IP (and it will be easier for your local connexion).
Second, BE CAREFUL to datastore. It's a good product, serverless NoSQL database, document oriented, but it's absolutely not the MongoDB equivalent. You can't perform aggregate, you are limited in search capabilities,... It's designed for specific use case (I don't know yours, but don't think that is the MongoDB equivalent!).
Anyway, if you use Datastore, you will have to use a service account or to install Google Cloud SDK on your local environment to be authenticated and to be able to request Datastore API. No login/password in this case.

How Do Service Connections Work For On-Prem Agents Connecting To On-Prem Services?

This question is purposefully general because I'm trying to understand things more from an architectural perspective, because that will impact which group I need to contact. My team is using Azure DevOps (cloud) with on-prem build agents. The agents connect to ADO via a proxy.
We use several tools in-house provided by vendors with ADO plugins in the Marketplace that require us to set up service connections. Because the services are installed on-prem, the endpoints we enter are not available via the Web (e.g. https://vendor-product.my-company.com).
If I log into the build machine and open up IE, I am able to connect to the service endpoint URL. However, whenever I try to run a task from ADO, it fails with some kind of connection-related issue ("The underlying connection was closed: An unexpected error occurred on a send", "Task ended with an exception: Error: read ECONNRESET", etc.).
The way I thought it worked, all the work takes place on the build machine itself, so the calls would be going from my-build-server.my-company.com to https://vendor-product.my-company.com. Those error messages though make me wonder if the connection is actually coming from https://dev.azure.com.
So the questions I have are:
For situations like this, is the connection to a service endpoint going to be seen as coming from my on-prem build agent, or from ADO (or does it vary based on how the vendor writes their plugin)?
If the answer to #1 is "it varies", is there any way for me to tell just from the plugin itself without having to contact the vendor? (In my experience some of the vendor reps don't understand how the cloud works.)
and/or
Because my build agent was configured to use a proxy when I set it up, is it going to use that proxy for all connections, even internal ones? I think I can set up a proxy bypass list for the agents but I presently only have read access to the build box. I can request temporary elevated access but I'd need some level of confidence that's what the issue is.
Hope I explained the situation clearly, thanks in advance for any insight.

Is there a way to upload data from FTP server to Amazon S3? [duplicate]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 years ago.
Improve this question
Is there a way to connect to an Amazon S3 bucket with FTP or SFTP rather than the built-in Amazon file transfer interface in the AWS console? Seems odd that this isn't a readily available option.
There are three options.
You can use a native Amazon Managed SFTP service (aka AWS Transfer for SFTP), which is easier to set up.
Or you can mount the bucket to a file system on a Linux server and access the files using the SFTP as any other files on the server (which gives you greater control).
Or you can just use a (GUI) client that natively supports S3 protocol (what is free).
Managed SFTP Service
In your Amazon AWS Console, go to AWS Transfer for SFTP and create a new server.
In SFTP server page, add a new SFTP user (or users).
Permissions of users are governed by an associated AWS role in IAM service (for a quick start, you can use AmazonS3FullAccess policy).
The role must have a trust relationship to transfer.amazonaws.com.
For details, see my guide Setting up an SFTP access to Amazon S3.
Mounting Bucket to Linux Server
Just mount the bucket using s3fs file system (or similar) to a Linux server (e.g. Amazon EC2) and use the server's built-in SFTP server to access the bucket.
Install the s3fs
Add your security credentials in a form access-key-id:secret-access-key to /etc/passwd-s3fs
Add a bucket mounting entry to fstab:
<bucket> /mnt/<bucket> fuse.s3fs rw,nosuid,nodev,allow_other 0 0
For details, see my guide Setting up an SFTP access to Amazon S3.
Use S3 Client
Or use any free "FTP/SFTP client", that's also an "S3 client", and you do not have setup anything on server-side. For example, my WinSCP or Cyberduck.
WinSCP has even scripting and .NET/PowerShell interface, if you need to automate the transfers.
Update
S3 now offers a fully-managed SFTP Gateway Service for S3 that integrates with IAM and can be administered using aws-cli.
There are theoretical and practical reasons why this isn't a perfect solution, but it does work...
You can install an FTP/SFTP service (such as proftpd) on a linux server, either in EC2 or in your own data center... then mount a bucket into the filesystem where the ftp server is configured to chroot, using s3fs.
I have a client that serves content out of S3, and the content is provided to them by a 3rd party who only supports ftp pushes... so, with some hesitation (due to the impedance mismatch between S3 and an actual filesystem) but lacking the time to write a proper FTP/S3 gateway server software package (which I still intend to do one of these days), I proposed and deployed this solution for them several months ago and they have not reported any problems with the system.
As a bonus, since proftpd can chroot each user into their own home directory and "pretend" (as far as the user can tell) that files owned by the proftpd user are actually owned by the logged in user, this segregates each ftp user into a "subdirectory" of the bucket, and makes the other users' files inaccessible.
There is a problem with the default configuration, however.
Once you start to get a few tens or hundreds of files, the problem will manifest itself when you pull a directory listing, because ProFTPd will attempt to read the .ftpaccess files over, and over, and over again, and for each file in the directory, .ftpaccess is checked to see if the user should be allowed to view it.
You can disable this behavior in ProFTPd, but I would suggest that the most correct configuration is to configure additional options -o enable_noobj_cache -o stat_cache_expire=30 in s3fs:
-o stat_cache_expire (default is no expire)
specify expire time(seconds) for entries in the stat cache
Without this option, you'll make fewer requests to S3, but you also will not always reliably discover changes made to objects if external processes or other instances of s3fs are also modifying the objects in the bucket. The value "30" in my system was selected somewhat arbitrarily.
-o enable_noobj_cache (default is disable)
enable cache entries for the object which does not exist. s3fs always has to check whether file(or sub directory) exists under object(path) when s3fs does some command, since s3fs has recognized a directory which does not exist and has files or subdirectories under itself. It increases ListBucket request and makes performance bad. You can specify this option for performance, s3fs memorizes in stat cache that the object (file or directory) does not exist.
This option allows s3fs to remember that .ftpaccess wasn't there.
Unrelated to the performance issues that can arise with ProFTPd, which are resolved by the above changes, you also need to enable -o enable_content_md5 in s3fs.
-o enable_content_md5 (default is disable)
verifying uploaded data without multipart by content-md5 header. Enable to send "Content-MD5" header when uploading a object without multipart posting. If this option is enabled, it has some influences on a performance of s3fs when uploading small object. Because s3fs always checks MD5 when uploading large object, this option does not affect on large object.
This is an option which never should have been an option -- it should always be enabled, because not doing this bypasses a critical integrity check for only a negligible performance benefit. When an object is uploaded to S3 with a Content-MD5: header, S3 will validate the checksum and reject the object if it's corrupted in transit. However unlikely that might be, it seems short-sighted to disable this safety check.
Quotes are from the man page of s3fs. Grammatical errors are in the original text.
Answer from 2014 for the people who are down-voting me:
Well, S3 isn't FTP. There are lots and lots of clients that support S3, however.
Pretty much every notable FTP client on OS X has support, including Transmit and Cyberduck.
If you're on Windows, take a look at Cyberduck or CloudBerry.
Updated answer for 2019:
AWS has recently released the AWS Transfer for SFTP service, which may do what you're looking for.
Or spin Linux instance for SFTP Gateway in your AWS infrastructure that saves uploaded files to your Amazon S3 bucket.
Supported by Thorntech
Amazon has released SFTP services for S3, but they only do SFTP (not FTP or FTPES) and they can be cost prohibitive depending on your circumstances.
I'm the Founder of DocEvent.io, and we provide FTP/S Gateways for your S3 bucket without having to spin up servers or worry about infrastructure.
There are also other companies that provide a standalone FTP server that you pay by the month that can connect to an S3 bucket through the software configuration, for example brickftp.com.
Lastly there are also some AWS Marketplace apps that can help, here is a search link. Many of these spin up instances in your own infrastructure - this means you'll have to manage and upgrade the instances yourself which can be difficult to maintain and configure over time.
WinSCp now supports S3 protocol
First, make sure your AWS user with S3 access permissions has an “Access key ID” created. You also have to know the “Secret access key”. Access keys are created and managed on Users page of IAM Management Console.
Make sure New site node is selected.
On the New site node, select Amazon S3 protocol.
Enter your AWS user Access key ID and Secret access key
Save your site settings using the Save button.
Login using the Login button.
Filezilla just released a Pro version of their FTP client. It connects to S3 buckets in a streamlined FTP like experience. I use it myself (no affiliation whatsoever) and it works great.
As other posters have pointed out, there are some limitations with the AWS Transfer for SFTP service. You need to closely align requirements. For example, there are no quotas, whitelists/blacklists, file type limits, and non key based access requires external services. There is also a certain overhead relating to user management and IAM, which can get to be a pain at scale.
We have been running an SFTP S3 Proxy Gateway for about 5 years now for our customers. The core solution is wrapped in a collection of Docker services and deployed in whatever context is needed, even on-premise or local development servers. The use case for us is a little different as our solution is focused data processing and pipelines vs a file share. In a Salesforce example, a customer will use SFTP as the transport method sending email, purchase...data to an SFTP/S3 enpoint. This is mapped an object key on S3. Upon arrival, the data is picked up, processed, routed and loaded to a warehouse. We also have fairly significant auditing requirements for each transfer, something the Cloudwatch logs for AWS do not directly provide.
As other have mentioned, rolling your own is an option too. Using AWS Lightsail you can setup a cluster, say 4, of $10 2GB instances using either Route 53 or an ELB.
In general, it is great to see AWS offer this service and I expect it to mature over time. However, depending on your use case, alternative solutions may be a better fit.

Remote execute Power Shell scripts to collect data

I am looking to collect data snapshot on a random interval from various machines in our network that we don't own, but may get access to install an agent to collect these data.
These machines are either in a domain or work-group and kind of data i get are based on the role they play and information they have. The machines are "Windows Server 2003" and above and I do not want to install anything on those machines before i get started, so thought I can use the PowerShell scripts that I can remote invoke form my server and pass the script it has to run to return the data.
I was wondering if this is possible to do that with the PowerShell scripts and as this is supposed to run in a secure environment, is there any major security implications with this approach. i.e. do I need to do anything on the client machines that can make them vulnerable to security threats.
BTW these machines are not exposed to internet and are behind a firewall.
I would appreciate if you point me to any other alternatives that can be useful for my analysis.
Regards
Kiran

AWS deployment without using SSH

I've read some articles recently on setting up AWS infrastructure w/o enabling SSH on Ec2 instances. My web app requires a binary to run. So how can I deploy my application to an ec2 instance w/o using ssh?
This was the article in question.
http://wblinks.com/notes/aws-tips-i-wish-id-known-before-i-started/
Although doable, like the article says, it requires to think about servers as ephemeral servers. A good example of this is web services that scale up and down depending on demand. If something goes wrong with one of the servers you can just terminate your server and spin up another one.
Generally, you can accomplish this using a pull model. For example at bootup pull your code from a git/mecurial repository and then execute scripts to setup your instance. The script will setup all the monitoring required to determine whether your server and application are up and running appropriately. You would still need an SSH client for this if you want to pull your code using ssh. (Although you could also do it through HTTPS)
You can also use configuration management tools that don't use ssh at all like Puppet or Chef. Essentially your node/server will pull all your application and server configuration from the Puppet master or the Chef server. The Puppet agent or Chef client would then perform all the configuration/deployment/monitoring changes for your application to run.
If you with this model I think one of the most critical components is monitoring. You need to know at all times if there's something wrong with one of your server and in the event something goes wrong discard the server and spin up a new one. (Even better if this whole process is automated)
Hope this helps.