I am able to read Mapr Files by using TextIO.Read with the files put in hadoop folder and able to write files in hadoop folder. But I am not sure if I need to use org.apache.beam.sdk.io.hdfs since Mapr files are based on HDFS.
Thanks.
All Beam file-based IOs transparently interact with various supported filesystems and this is the recommended way of interacting with them from Beam. There is no need to explicitly use classes from the io.hdfs package.
Related
We have multiple python projects, and are considering converting them to use pyproject.toml instead of setup.py.
Is there a simple way to automate this?
While dealing with some pyproject.toml issues, I encountered this project:
https://pypi.org/project/ini2toml/
The projects description:
This project is experimental and under active development Issue reports and contributions are very welcome.
The original purpose of this project is to help migrating setup.cfg files to PEP 621, but by extension it can also be used to convert any compatible .ini/.cfg file to .toml.
While this only helps to turn .cfg/.ini files to PEP 621 .toml, there is another project that turn setup.py files into cfg files.
https://github.com/gvalkov/setuptools-py2cfg
This script helps convert existing setup.py files to setup.cfg in the format expected by setuptools.
Combining these two processes by writing a script you could potentially come up with an automated way to transform the files. I have not tried this as of yet but would be interested if you could make this idea work :)
pdm supports importing metadata from various existing and older metadata files, including setup.py.
You can either:
run pdm init on your project root (with the old metadata files) and follow the instructions, or
run pdm import setup.py explicitly.
See Import project metadata from existing project files for more details.
I'm relying on shell calls to 7z (LGPL) for an important part of a project I'm working on: specifically, opening .cbr files. The problem I have is that there is no guarantee that I will be able to find it on a user's computer (assuming it's even on their computer).
Is there some way to keep its binaries inside my compiled tool, so I don't have to worry about calling them externally? (I have the impression that this is what jar files are for, but I'm not sure.)
Or if that's not possible, what is the standard way of going about this?
Typically speaking, this is where you would want to get a library dependency to handle the unzipping of files. Some people use Apache Commons Compress, which would require this library dependency in your sbt build definition:
libraryDependencies += "org.apache.commons" % "commons-compress" % "1.5" // Or whatever version you need
Alternatively, you can include the exe file in a resources file that will get included with your build - assuming that the executable doesn't need to be installed at the system level. This can be as simple as creating the src/main/resources directory and putting the file in there. Your jar will only work on compatible system architectures, though, so think twice before going this route. Unless there is a specific reason that 7-zip needs to be used to unpack the file, it's better to use a Java or Scala-compatible library and avoid having to make the shell calls.
When using Perl script as mapper & reducer in Hadoop streaming, how we can manage perl module dependencies.
I want to use "Net::RabbitMQ" in my perl mapper & reducer script.
Is there any standard way in perl/hadoop streaming to handle dependencies similar to the DistributedCache (for Hadoop java MR)
There are a couple of ways to handle dependencies including specifying a custom library path or creating a packed binary of your Perl application with PAR::Packer. There are some examples of how to accomplish these tasks in the Examples section of the Hadoop::Streaming POD, and the author includes a good description of the process, as well as some considerations for the different ways to handle dependencies. Note that the suggestions provided in the Hadoop::Streaming documentation about handling Perl dependencies are not specific to that module.
Here is an excerpt from the documentation for Hadoop::Streaming (there are detailed examples therein, as previously mentioned):
All perl modules must be installed on each hadoop cluster machine. This proves to be a challenge for large installations. I have a local::lib controlled perl directory that I push out to a fixed location on all of my hadoop boxes (/apps/perl5) that is kept up-to-date and included in my system image. Previously I was producing stand-alone perl files with PAR::Packer (pp), which worked quite well except for the size of the jar with the -file option. The standalone files can be put into hdfs and then included with the jar via the -cacheFile option.
I need to allow users to upload a zip file via a web form. The server is running Linux with an Apache web server. Are there advantages to using a module like Archive::Zip to extract this archive or should I just execute a system call to unzip with backticks?
According to the Archive::Zip documentation you'd be better off using Archive::Extract:
If you are just going to be extracting zips (and/or other archives) you are recommended to look at using Archive::Extract instead, as it is much easier to use and factors out archive-specific functionality.
That's interesting because Archive::Extract will try Archive::Zip first and then fall back to the unzip binary if it fails. So it does seem that Archive::Zip is the preferred option.
Archive::Zip uses Compress::Raw::Zlib which is a low level interface to the zlib system library; so it's not a pure Perl implementation meaning it's going to be similar in performance to unzip. So, in other words, from a performance perspective there's no reason to pick unzip ahead of Archive::Zip.
If you execute the binary unzip, your process will fork/exec and
instantiate a new process
consume more memory (for the duration of the spawned process)
You'll also have to configure with the correct path to unzip. Given all this, I would strongly prefer the library approach.
One concern is with memory. We have found the hard way (production web server crashed) that Archive::Tar had a memory leak. So while overall using a module instead of a system call to an external command is a good idea (see other responses for reasoning), you need to make sure the module has no gotchas.
Should I check in *.mo translation files into my version control system?
This is a general question. But in particular I'm working on Django projects with git repositories.
The general answer is:
if you do need those files to compile or to deploy (in shot: to "work" with) your component (set of files queried from your VCS), then yes, they should be stored in it (here: in Git).
This is the same for other kind of files (like project files for instance)
.mo files are particular:
django-admin.py compilemessages utility.
This tool runs over all available .po files and creates .mo files, which are binary files optimized for use by gettext
Meaning:
you should be able to rebuild them every time you need them (guarantying in effect that they are in synch with their .po couterparts)
Git is not so good with binary storage and that would avoid it to store a full version for every changes
So the specific answer is not so clear-cut:
if your po files are stables and will not evolve too often, you could definitively store the .mo file
you should absolutely store a big README file explaning how to generate mo from po files.
The general answer is to not store generated contents in version control.
You can include it in tarball, if it requires rare tools, or even have separate repository or disconnected branch with only those generated files (like 'html' and 'man' branches in git.git repository).
For asked question Jakub answer is pretty neat.
But one could ask:
So where should I store such files? Should I generate them every time I deploy my code?
And for that... it depends. You could deploy it in tarball (as Jakub sugested) or even better - create pip or system package (RPM for fedora, DEB for debian, etc.).