Accessing streamsets web UI on another node in a cluster than where installed, which file system does it 'look in'? - streamsets

I have a cluster of machines hosting hadoop (MapR) and have install streamsets on one of the nodes (say node002) following the RPM documentation. However, I am accessing the web UI for the data collector from another node, node001.
My question is, when I specify files paths (eg. an origin directory), which file system is the web UI going to be referring to? Eg. if I put an origin directory as /home/myuser/mydata, will the pipeline created in the web UI be looking for that directory in node001 or node002? New to using streamsets, so a more detailed answer would be appreciated. Thanks.
** Ultimately I am asking this because I am currently getting "FileNotFound" and "permission denied" errors while trying to follow the documentation's tutorial and am trying to debug the situation.

From the streamsets community forums: It will be the path to the local file on the machine running that particular SDC instance.
The FileNotFound and permission errors have to do with the fact that the default user for the sdc service is a user called sdc. Still working on how to fix this part, but can produce a workable prototype by setting the read and write access for the directories in question to allow public access (still need to work on this part, but this answers the posted question).

Related

MLflow Artifacts Storing artifacts(google cloud storage) but not displaying them in MLFlow UI

I am working on a docker environment(docker-compose) with a jupyter notebook docker image and a postgres docker image for running ML models and using google cloud storage to store the model artifacts. Storing the models on the cloud storage works fine but i can't get to show them within the MLFlow UI. I have seen similar problems but non of the solutions used google cloud storage as the storage location for artifacts. The error message says the following Unable to list artifacts stored under <gs-location> for the current run. Please contact your tracking server administrator to notify them of this error, which can happen when the tracking server lacks permission to list artifacts under the current run's root artifact directory.What could possibly be causing this problem?
I had the exactly the same issue. Keywords are docker-compose, google cloud storage, success in storing in GCS, but failure in listing artifacts in UI.
In my case, it turns out that in docker-compose file, if you assign the env vars by reading from a .env file (eg. GOOGLE_APPLICATION_CREDENTIALS), the server might start before the assignment. The quick solve is to assign the env var directly with key environment: instead of using key env_file:.
For sensitive data that you still need to put in .env file, you can add wait time for the server, and add depends on: in docker-compose file to make sure that the database container starts before the mlflow server if you are using database-backed store.
I faced a same issue when running mlflow from local. The issue got resolved after adding GOOGLE_APPLICATION_CREDENTIALS to the environment variables.
https://googleapis.dev/python/google-api-core/latest/auth.html

Is there a way to upload data from FTP server to Amazon S3? [duplicate]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 years ago.
Improve this question
Is there a way to connect to an Amazon S3 bucket with FTP or SFTP rather than the built-in Amazon file transfer interface in the AWS console? Seems odd that this isn't a readily available option.
There are three options.
You can use a native Amazon Managed SFTP service (aka AWS Transfer for SFTP), which is easier to set up.
Or you can mount the bucket to a file system on a Linux server and access the files using the SFTP as any other files on the server (which gives you greater control).
Or you can just use a (GUI) client that natively supports S3 protocol (what is free).
Managed SFTP Service
In your Amazon AWS Console, go to AWS Transfer for SFTP and create a new server.
In SFTP server page, add a new SFTP user (or users).
Permissions of users are governed by an associated AWS role in IAM service (for a quick start, you can use AmazonS3FullAccess policy).
The role must have a trust relationship to transfer.amazonaws.com.
For details, see my guide Setting up an SFTP access to Amazon S3.
Mounting Bucket to Linux Server
Just mount the bucket using s3fs file system (or similar) to a Linux server (e.g. Amazon EC2) and use the server's built-in SFTP server to access the bucket.
Install the s3fs
Add your security credentials in a form access-key-id:secret-access-key to /etc/passwd-s3fs
Add a bucket mounting entry to fstab:
<bucket> /mnt/<bucket> fuse.s3fs rw,nosuid,nodev,allow_other 0 0
For details, see my guide Setting up an SFTP access to Amazon S3.
Use S3 Client
Or use any free "FTP/SFTP client", that's also an "S3 client", and you do not have setup anything on server-side. For example, my WinSCP or Cyberduck.
WinSCP has even scripting and .NET/PowerShell interface, if you need to automate the transfers.
Update
S3 now offers a fully-managed SFTP Gateway Service for S3 that integrates with IAM and can be administered using aws-cli.
There are theoretical and practical reasons why this isn't a perfect solution, but it does work...
You can install an FTP/SFTP service (such as proftpd) on a linux server, either in EC2 or in your own data center... then mount a bucket into the filesystem where the ftp server is configured to chroot, using s3fs.
I have a client that serves content out of S3, and the content is provided to them by a 3rd party who only supports ftp pushes... so, with some hesitation (due to the impedance mismatch between S3 and an actual filesystem) but lacking the time to write a proper FTP/S3 gateway server software package (which I still intend to do one of these days), I proposed and deployed this solution for them several months ago and they have not reported any problems with the system.
As a bonus, since proftpd can chroot each user into their own home directory and "pretend" (as far as the user can tell) that files owned by the proftpd user are actually owned by the logged in user, this segregates each ftp user into a "subdirectory" of the bucket, and makes the other users' files inaccessible.
There is a problem with the default configuration, however.
Once you start to get a few tens or hundreds of files, the problem will manifest itself when you pull a directory listing, because ProFTPd will attempt to read the .ftpaccess files over, and over, and over again, and for each file in the directory, .ftpaccess is checked to see if the user should be allowed to view it.
You can disable this behavior in ProFTPd, but I would suggest that the most correct configuration is to configure additional options -o enable_noobj_cache -o stat_cache_expire=30 in s3fs:
-o stat_cache_expire (default is no expire)
specify expire time(seconds) for entries in the stat cache
Without this option, you'll make fewer requests to S3, but you also will not always reliably discover changes made to objects if external processes or other instances of s3fs are also modifying the objects in the bucket. The value "30" in my system was selected somewhat arbitrarily.
-o enable_noobj_cache (default is disable)
enable cache entries for the object which does not exist. s3fs always has to check whether file(or sub directory) exists under object(path) when s3fs does some command, since s3fs has recognized a directory which does not exist and has files or subdirectories under itself. It increases ListBucket request and makes performance bad. You can specify this option for performance, s3fs memorizes in stat cache that the object (file or directory) does not exist.
This option allows s3fs to remember that .ftpaccess wasn't there.
Unrelated to the performance issues that can arise with ProFTPd, which are resolved by the above changes, you also need to enable -o enable_content_md5 in s3fs.
-o enable_content_md5 (default is disable)
verifying uploaded data without multipart by content-md5 header. Enable to send "Content-MD5" header when uploading a object without multipart posting. If this option is enabled, it has some influences on a performance of s3fs when uploading small object. Because s3fs always checks MD5 when uploading large object, this option does not affect on large object.
This is an option which never should have been an option -- it should always be enabled, because not doing this bypasses a critical integrity check for only a negligible performance benefit. When an object is uploaded to S3 with a Content-MD5: header, S3 will validate the checksum and reject the object if it's corrupted in transit. However unlikely that might be, it seems short-sighted to disable this safety check.
Quotes are from the man page of s3fs. Grammatical errors are in the original text.
Answer from 2014 for the people who are down-voting me:
Well, S3 isn't FTP. There are lots and lots of clients that support S3, however.
Pretty much every notable FTP client on OS X has support, including Transmit and Cyberduck.
If you're on Windows, take a look at Cyberduck or CloudBerry.
Updated answer for 2019:
AWS has recently released the AWS Transfer for SFTP service, which may do what you're looking for.
Or spin Linux instance for SFTP Gateway in your AWS infrastructure that saves uploaded files to your Amazon S3 bucket.
Supported by Thorntech
Amazon has released SFTP services for S3, but they only do SFTP (not FTP or FTPES) and they can be cost prohibitive depending on your circumstances.
I'm the Founder of DocEvent.io, and we provide FTP/S Gateways for your S3 bucket without having to spin up servers or worry about infrastructure.
There are also other companies that provide a standalone FTP server that you pay by the month that can connect to an S3 bucket through the software configuration, for example brickftp.com.
Lastly there are also some AWS Marketplace apps that can help, here is a search link. Many of these spin up instances in your own infrastructure - this means you'll have to manage and upgrade the instances yourself which can be difficult to maintain and configure over time.
WinSCp now supports S3 protocol
First, make sure your AWS user with S3 access permissions has an “Access key ID” created. You also have to know the “Secret access key”. Access keys are created and managed on Users page of IAM Management Console.
Make sure New site node is selected.
On the New site node, select Amazon S3 protocol.
Enter your AWS user Access key ID and Secret access key
Save your site settings using the Save button.
Login using the Login button.
Filezilla just released a Pro version of their FTP client. It connects to S3 buckets in a streamlined FTP like experience. I use it myself (no affiliation whatsoever) and it works great.
As other posters have pointed out, there are some limitations with the AWS Transfer for SFTP service. You need to closely align requirements. For example, there are no quotas, whitelists/blacklists, file type limits, and non key based access requires external services. There is also a certain overhead relating to user management and IAM, which can get to be a pain at scale.
We have been running an SFTP S3 Proxy Gateway for about 5 years now for our customers. The core solution is wrapped in a collection of Docker services and deployed in whatever context is needed, even on-premise or local development servers. The use case for us is a little different as our solution is focused data processing and pipelines vs a file share. In a Salesforce example, a customer will use SFTP as the transport method sending email, purchase...data to an SFTP/S3 enpoint. This is mapped an object key on S3. Upon arrival, the data is picked up, processed, routed and loaded to a warehouse. We also have fairly significant auditing requirements for each transfer, something the Cloudwatch logs for AWS do not directly provide.
As other have mentioned, rolling your own is an option too. Using AWS Lightsail you can setup a cluster, say 4, of $10 2GB instances using either Route 53 or an ELB.
In general, it is great to see AWS offer this service and I expect it to mature over time. However, depending on your use case, alternative solutions may be a better fit.

MongoDB replica set in Azure "Waiting for role to start... Calling OnRoleStart()"

I have a problem trying to implement a mongodb replica set as a worker role instance in Windows Azure. In the Windows Azure portal, one of the instances is shown as busy with the status:
Waiting for role to start... Calling OnRoleStart()
I have checked all the settings and everything seems to be ok, what could the problem be?
Denis Markelov's blog post helped me solve this problem. The solution is mainly his, however I had to take an extra step to get it to work and thought others might find it useful.
Solution from blog:
Windows Azure reuses virtual machines for roles, so after a fresh
deployment on a hard drive you can find files that were created during
previous sessions. If MongoDB was terminated improperly - there might
be a lock file ("persisted mutex" analogue), because of which MongoDB
refuses to start. It is located at the drive with a label
"WindowsAzureDrive" (say it is F:), at the path:
F:\data\mongod.lock
In the case of a production use this situation might require a
recovery procedures, but if you are just in the process of initial
setup - it is safe to remove this file, letting MongoDB to start
again.
I was having this problem and did as suggested, however I was still having the same problem. So I took a look at the log file at
C:\Resources\Directory\.MongoDB.WindowsAzure.MongoDBRole.MongodLogDir\mongod.txt
And saw that another file was also giving an error. In order to fix the problem, you also have to delete the file local.ns in the same directory as mongod.lock.

Umbraco on Azure: can I change hostname?

I've deployed in Windows Azure a website made with Umbraco, using
Windows Azure Accelerator for Umbraco.
For development and test i used a test Hostname. Now it's time to switch to the official DNS hostname..
How can I change current hostname?
Actually i configured hostname at deployment time (the only way i know to do this) but i can't deploy again, since many files have been changed working on website on Azure.
EDIT
Let me explain: at the step prompt in the image (during web site deploying) I used as Domain Name "test.mywebsite.com", and configured real DNS.
Now the website is configured, so I'd like to make mywebsite.com point to that site;
But is'nt enough if i configure mywebsite DNS! Shall I deploy again? An will I lose any of the changes I made?
I'd like to make two comments on your question:
1) In order to host your Azure application under a custom host name, you will need to sign up with a DNS provider that supports C-NAME records (most do). I suggest someone like GoDaddy.com because by default C-NAME records can only resolve your "www.domainname.com" records and cannot do anything for queries where "www." is dropped from the URL. DNS providers like GoDaddy also have an option to redirect all traffic destined for "domainname.com" to a URL of your choice. This is a huge deal for Azure apps. Frankly speaking, it is somewhat disappointing that for all the PaaS and IaaS features of Azure, DNS was not included in the overall package.
2) I am a little worried when you say that you can no longer redeploy your app due to the changes made. Can you elaborate on that? Have you made changes to the application's code running on VM's in Azure without going through redeployment process? If so, this is a huge no-no. Your VM's running in Azure are not "permanent". Microsoft and your redeployment process can (and will) re-stage those VM's to the original package at any given time. Microsoft will re-image your VM's at least once a month during their monthly OS upgrades. But they can also do so when they need to move your VM to another rack, etc. Whatever changes that you make to your app must be either stored in source-control before deployment or in a permanent storage facility like SQL Azure, Azure Storage, etc.
HTH
Finally i think that the answers to my questions are:
-Shall I deploy again? Yes, i must deploy again
-Will I lose any of the changes I made? Many changes will be mantained since are stored into DB. But I have to do many activities to make new website work!
This answer confirms my theory:
In my case, I created and uploaded a site with a name, let's say
http://www.contoso.com and then paid a domain from a registrar let's say
http://www.example.com, when I mapped
http://MyAcceleratorsService.cloudapp.net/ to my new domain
( http://www.example.com ) and tried to open that domain I got the home page of
the Accelerator and not the uploaded site.
I had to upload the site again to Azure (using UploadUmbracoSite.cmd
from Accelerator application) and when uploading enter the same domain
name as the one I registered: http://www.example.com. Then, I was able to
browse my uploaded site as expected.
As for your question, will upload site again using
UploadUmbracoSite.cmd (is in the Setup folder) and will enter the new
domain name when requested.
Exactly what I was trying to avoid.. but the only solution, i suppose.
Well it was not easy to publish again, i got errors of many type (i suppose tied to some components that i've installed after deploy and that are not installed in new deployed website).. i'm going to solve them.
Edit
Completed my work:
- loads of different attempts, no-one worked
- CTP backup of DB
- deleted DB and website
- new full deploy of umbraco
- CTP restore of DB
finally:
-all work on content is OK
-all work on styles, pages, templates is lost
Changing hostname is hard; dont'use test hostname but definitive hostname from the beginning.
If anyone has suggest, i'll be pleased to test it, anyway
This is not really an answer to your question, but it might be a solution to your problem: Use a CNAME record to make the production DNS name point to your development name. E.g. www.productionname.com will the point to www.testname.com. I am not sure if everything will just work out of the box, but it seems to be worth a try.
This requires, that your hosting provider allows you to set up CNAME records.
http://en.wikipedia.org/wiki/CNAME_record

OpenLdap redirect on write

I am currently trying to setup a redirect on write for an installation of OpenLdap 2.2.
I have two instances running. One is configured to be read-only (only read access, database specified as read-only) and has redirect configured to point to the second instance. The second instance is configured to allow for the desired write permissions.
When I attempt a modify on the first instance it fails as expected but does not send back the referral. Am I missing a piece of the configuration? Am I even on the right path? Any guidance would be greatly appreciated. Thanks.
In the database section of you slapd.conf do you add the redirection like this ? :
updateref "ldap://master-host:port/"
So, it turns out the best way to do this is to go ahead and set up replication using slurpd and point all requests at the slave instance. Unfortunately you can't set up the master and slave on the same host (for obvious reasons, but still), so I had to spin up a second VM to get this going.
Honestly, if I was not trying to replicate a redirect problem it wouldn't be worth it, but I have to duplicate a production issue.
For more information on slapd and specifically slurpd, the OpenLDAP documentation is actually crazy helpful: slurpd config for OpenLDAP 2.2