Branching pipeline // fanout-fanin : Process each file in separate branch - apache-beam

The wordcount example is great -- but limited.
Imagine we want to transform each file in the shakespeare folder, and the processing is much more intense than counting words.
Is something like this possible within the same pipeline, without manually specifying the different branches?
Not This
# This processes all the files in the same "branch" in Dataflow.
p | beam.Create([/shakespeare/*.txt]) | MatchAll() | ...
Like This
┌───────────────────────────┐
│ Start: /shakespeare/*.txt │
└────────────┬──────────────┘
│
│
┌────────────▼──────────────┐
│ Expand glob (MatchAll) │
└────────────┬──────────────┘
│
┌─────────┬──────────────┼────────┬──────────────┐
│ │ │ │ │
┌─────▼─┐ ┌──▼────┐ ┌──────▼┐ ┌───▼───┐ ┌───▼─────┐
│ File1 │ │ File2 │ │ File3 │ │ File4 │ ... │ File123 │
└───┬───┘ └──┬────┘ └───┬───┘ └───┬───┘ └───┬─────┘
│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
┌───▼───┐ ┌──▼────┐ ┌───▼───┐ ┌───▼───┐ ┌───▼───┐
│ done │ │ done │ │ done │ │ done │ │ done │
└───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘
│ │ │ │ │
│ │ │ │ │
└────────────┴──────────┴──┬────────┴──────────────┘
│
┌──────▼──────┐
│Merge Results│
└─────────────┘

one solution that comes to my mind is to partition your collection into multiple collections.
Partition Separates elements in a collection into multiple output collections. The partitioning function contains the logic that determines how to separate the elements of the input collection into each resulting partition output collection.
see Apache beam doc for more info and sample:
https://beam.apache.org/documentation/transforms/python/elementwise/partition/

Related

Complete files in a different directory in zsh

Does the compadd command for ZSH not support completion when some characters are entered?
I have a executable file called 'index_for_test.js',and i add a shell script to .zshrc.
$PATH:
/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/hanqing/Development/compadd-test
the index_for_test.js at the root of /Users/hanqing/Development/compadd-test
./
├── dir1
│ ├── a.js
│ └── b.js
├── dir2
│ ├── a.ts
│ └── b.ts
└── index_for_test.js
total 24
drwxr-xr-x 6 hanqing staff 192 11 4 13:45 .
drwxr-xr-x 10 hanqing staff 320 11 4 13:42 ..
-rw-r--r--# 1 hanqing staff 6148 11 4 14:04 .DS_Store
drwxr-xr-x 4 hanqing staff 128 11 4 13:43 dir1
drwxr-xr-x 4 hanqing staff 128 11 4 13:43 dir2
-rwxr-xr-x 1 hanqing staff 155 11 4 13:50 index_for_test.js
// index_for_test.js
#! /usr/bin/env node
const fs =require('fs')
const path=require('path')
const files=fs.readdirSync(path.join(process.cwd()))
console.log(files.join('\n'))
the script be added to .zshrc :
_index_for_test_completion() {
local abc=(`index_for_test.js`)
echo '\nabc:\n'
echo $abc'\n'
compadd -- $abc
}
compdef _index_for_test_completion index_for_test.js
When input index_for_test.js followed by input a space then press tab, it work find.But when input index_for_test.js ../ then press tab,it does not show completion list, even if compadd accept the arguments.
Image:
normal: index_for_test.js
error: index_for_test.js ../
Expect
If this is my mistake,please let me know the reason,thanks.
In addition,if the behavior is right,I wonder that how to achieve completion like cd command;
behavior of cd
index_for_test.js only prints names of files in the current directory. A file name in the current directory cannot start with ../, so there is no completion starting with ../.
If you want to complete files in a directory, you need to pass this directory to your completion script, and have it work in that directory.
In addition, you should complete directory names inside zsh, if all directories may potentially contain interesting files. If you want to allow only certain directories, have your script complete directories.
In addition, your script is broken when file names contain whitespace. Use a null byte as the separator: file names can't contain null bytes.
Untested code. May need some tweaking.
#!/usr/bin/env node
const fs = require('fs');
const path = require('path');
const files = fs.readdirSync(process.argv[2]);
console.log(files.join('\0'))
_index_for_test_completion() {
local dir=$words[CURRENT] files
if [[ ! -d $dir ]]; then dir=$dir:h; fi
files=(${(ps:\0:)$(index_for_test.js $dir))
print -lr '' abc: $files
if [[ $dir != $words[CURRENT] ]]; then dir+="/"; fi
compadd -- $^dir$files
}
compdef _index_for_test_completion index_for_test.js

Move folders that have more than one file into another directory

POWER SHELL ERROR Picture of code and directories
I would like to create a batch file that moves all folders that contain more than one file to another directory.
I tried the code below
mkdir "OOOO3_MORE_THAN_ONE"
for dir in *; do
# if the file is a directory
if [ -d "$dir" ]; then
# count number of files
count=$(find "$dir" -type f | wc -l)
#i f count=2 then move
if [ "$count" -le 1 ]; then
# move dir
mv -v "$dir" /completepath/"OOOO3_MORE_THAN_ONE"
fi
fi
done
I just get a new folder without any folders inside. The folders with multiple files did not move to the new directory
I also tried the below code, it's a little different, but still resulted in an empty folder
#! /bin/bash -p
shopt -s nullglob # glob patterns that match nothing expand to nothing
shopt -s dotglob # glob patterns expand names that start with '.'
destdir='subset'
[[ -d $destdir ]] || mkdir -- "$destdir"
for dir in * ; do
[[ -L $dir ]] && continue # Skip symbolic links
[[ -d $dir ]] || continue # Skip non-directories
[[ $dir -ef $destdir ]] && continue # Skip the destination dir.
numfiles=$(find "./$dir//." -type f -print | grep -c //)
(( numfiles > 1 )) && mv -v -- "$dir" "$destdir"
done
Alright, you have two problems (as originally posted).
(1) you are attempting to move to /completepath/OOOO3_MORE_THAN_ONE after creating "OOOO3_MORE_THAN_ONE" in the current working directory. Unless you are executing the script in /completepath when the directory ./OOOO3_MORE_THAN_ONE is created, your calls to mv -v "$dir" /completepath/"OOOO3_MORE_THAN_ONE"will fail. (the double-quotes are superfluous here)
(2) you have a "chicken-and-the-egg" problem because you:
mkdir "OOOO3_MORE_THAN_ONE"
before you call:
for dir in *; do
(OOOO3_MORE_THAN_ONE is included in '*', but not excluded before your calls to find and mv)
So you will effectively try and move OOOO3_MORE_THAN_ONE below itself when its name is reached in your list.
So How To Correct the Problems?
Rearrange your code. Since your Question is tagged [sh] (POSIX shell), you do not have the benefit of arrays available to pre-store the count and dir names, but you can always use a temporary file created with mktemp. You will want to read through each directory entry identified with for dir in * and write all count and dir information out to your temporary file before you start changing the directory structure. Then you can simply loop over the entries in your temporary file, checking if $count -gt 1 and moving to your new $dir if so, e.g.
#!/bin/sh
## initialize newdir from 1st argument (or default: OOOO3_MORE_THAN_ONE)
newdir="${1:-OOOO3_MORE_THAN_ONE}"
## set your complete path from 2nd arg (or '.' by default)
cmpltpath="${2:-.}"
## now create a temporary file to hold count dirname pairs
tmpfile=$(mktemp)
for dir in *; do ## write count and dirname pairs to temporary file
[ -d "$dir" ] && echo "$(find "$dir" -type f | wc -l) $dir" >> "$tmpfile"
done
## now create the directory to move to using cmpltpath, exit on failure
mkdir -p "$cmpltpath/$newdir" || exit 1
## read count and dirname from tmpfile & move if count > 1
while read -r count dir || [ -n "$dir" ]; do
## if count=2 then move
if [ "$count" -gt 1 ]; then
## move to dir
mv -v "$dir" "$cmpltpath/$newdir"
fi
done < "$tmpfile"
rm "$tmpfile" ## tidy up and remove tmpfile (or set trap after mktemp)
(note: the script takes the directory name to create and move files below as the first argument (positional parameter) for the script, and the complete path (absolute or relative) to precede your new directory)
(also note: if you have bash (or another advanced shell that supports associative arrays), you can simply save the directory name and count within an associative array keyed on directory name and avoid using a temporary file altogether)
Original directory
Using a directory tree where each subdirectory d1, d2, d3 has 1, 2 or 3 files below them, e.g.:
$ tree
.
├── d1
│   └── file1
├── d2
│   ├── file1
│   └── file2
├── d3
│   ├── file1
│   ├── file2
│   └── file3
└── mvscript.sh
Example Use/Resulting Directory Structure
Now running the script will move all directories with greater than 1 file below into your new directory:
$ sh mvscript.sh
'd2' -> './OOOO3_MORE_THAN_ONE/d2'
'd3' -> './OOOO3_MORE_THAN_ONE/d3'
$ tree
.
├── OOOO3_MORE_THAN_ONE
│   ├── d2
│   │   ├── file1
│   │   └── file2
│   └── d3
│   ├── file1
│   ├── file2
│   └── file3
├── d1
│   └── file1
└── mvscript.sh
Your Second Approach
Your second approach is not bad, but unless you have some special requirements and need to match dot-files, you may want to adjust GLOBIGNORE instead of setting dotglob as specified in man bash under the Pathname Expansion section. Also note there is no space between the #! and /bin/bash on the first line.
A basic tweak of your second attempt could be:
#!/bin/bash
destdir='subset'
mkdir -p -- "$destdir" || exit 1
for dir in * ; do
[[ -L $dir ]] && continue # Skip symbolic links
[[ -d $dir ]] || continue # Skip non-directories
[[ $dir -ef $destdir ]] && continue # Skip the destination dir.
numfiles=$(find "$dir" -type f -printf ".\n" | wc -l)
(( numfiles > 1 )) && mv -v -- "$dir" "$destdir"
done
Example Use/Output
A similar test would result in:
$ bash mvscript2.sh
'd2' -> 'subset/d2'
'd3' -> 'subset/d3'
$ tree
.
├── d1
│   └── file1
├── mvscript2.sh
└── subset
├── d2
│   ├── file1
│   └── file2
└── d3
├── file1
├── file2
└── file3

Linux: Recursively find all .txt files that don't have a matching .tif

I am using Debian Linux. I'm a newbie. I'll do my best to ask in the simplest way I know.
I have a pretty deep tree of directories on a drive that contain thousands of .tif files and .txt files. I'd like to recursively find (list) all .txt files that do not have a matching .tif file (basename). The .tif files and .txt files are also located in separate directories throughout the tree.
In simple form it could look like this...
directory1: hf-770.tif, hf-771.tif, hf-772.tif
directory2: hf-770.txt, hf-771.txt, hf-771.txt, hr-001.txt, tb-789.txt
I need to find (list) hr-001.txt and tb-789.txt as they do not have a matching .tif file. Again the directory tree is quite deep with multiple sub-directories throughout.
I researched and experimented with variations of the following commands but cannot seem to make it work. Thank you so much.
find -name "*.tif" -name "*.txt" | ls -1 | sed 's/\([^.]*\).*/\1/' | uniq
You can write a shell script for this:
#!/bin/bash
set -ue
while IFS= read -r -d '' txt
do
tif=$(basename "$txt" | sed s/\.txt$/.tif/)
found=$(find . -name "$tif")
if [ -z "$found" ]
then
echo "$txt has no tif"
fi
done < <(find . -name \*.txt -print0)
This has a loop over all .txt files it finds in the current directory or below. For each found file, it replaces the .txt extension with .tif, then tries to find that file. If it cannot find it (returned text is empty), it prints the .txt file name.
robert#saaz:$ tree
.
├── bar
│   └── a.txt
├── foo
│   ├── a.tif
│   ├── b.tif
│   ├── c.tif
│   └── d.txt
└── txt-without-tif
2 directories, 6 files
robert#saaz:$ bash txt-without-tif
./foo/d.txt has no tif

MobaXterm Busybox strange setup

I am using MobaXterm portable.
I found a strange setup, summarized here.
External commands in /bin work fine. E.g., with /bin/ssh.exe I can ssh ok.
Internal commands are
"redirected" to busybox, as
$ which cat
/bin/cat
$ ll /bin/cat
lrwxrwxrwx 1 USER001 UsersGrp 16 Jul 24 07:42 /bin/cat -> /bin/busybox.exe
at the same time aliased to files that apparently do not exist.
$ type cat
cat is aliased to `/bin/cat.exe'
These aliases apparently take precedence over files in PATH, so the commands do not work.
$ cat myfile
bash: /bin/cat.exe: No such file or directory
If I unalias, cat does not look for /bin/cat.exe but for /bin/busybox.exe, and everything is "back to normal".
$ unalias cat
$ cat myfile
Hello world
...
How can I get normal behaviour (either without aliases or with the presence of the alias targets)?
I mean not to write my own unaliases in .bashrc, this shouldn´t be needed.
Moreover, perhaps I would be breaking something.
Why would MobaXterm setup things like this?
PS: In the initial state, even ls does not work, for the same reason.
But ll works, since
$ type ll
ll is aliased to `_bbf ls -l'
$ type _bbf
_bbf is a function
...
How can I get normal behaviour?
Workarounds:
unaliasing by hand, so /bin/busybox.exe is actually used.
Below I add a script for that.
Copying .exe files from the temporary root dir when it is available, so the external versions are used.
Why would MobaXterm setup things like this?
When not using a Persistent root (/) directory, this is obtained
$ which cat
/bin/cat
$ ll /bin/cat
-rwxr-xr-x 1 RY16205 UsersGrp 49703 jul. 28 07:12 /bin/cat
$ type cat
cat is aliased to `/bin/cat.exe'
$ ll /bin/cat.exe
-rwxr-xr-x 1 USER001 UsersGrp 49703 jul. 28 07:12 /bin/cat.exe
$ cat myfile
Hello world
...
$ unalias cat
$ type cat
cat is hashed (/bin/cat)
$ cat myfile
Hello world
...
So any of the two cats work (internal busybox and external versions; I do not know if they are exactly the same).
This is because /bin points to C:\Users\user001\AppData\Local\Temp\Mxt108\bin and cat.exe is there.
But when using a Persistent root (/) directory, /bin points to <Persistent root (/) directory\bin, and cat.exe is not created there.
The former temporary root dir is removed as soon as MXT is closed.
So this is probably a configuration error from MobaXterm.
If so, the only option seems a workaround, as above.
Script for unaliasing:
#!/bin/bash
export ROOTDIR_WIN=$(cygpath -wl /)
if [[ ${ROOTDIR_WIN##*\\} == "Mxt108" ]] ; then
# Not using a Persistent root dir. Do not need to unalias.
echo "Not using a Persistent root dir. Do not need to unalias."
else
# Using a Persistent root dir. Need to unalias.
exe_linux_list="bash busybox chmod cygstart cygtermd cygwin-console-helper dircolors dwm_w32 echo grep ls MoTTY ssh ssh-pageant test twm_w32 wc xkbcomp_w32 XWin_MobaX"
for exe_linux in ${exe_linux_list} ; do
if [[ $(type -t ${exe_linux}) == "alias" ]] ; then
#type ${exe_linux}
unalias ${exe_linux}
fi
done
fi
On my MobaXterm system, /etc/profile sources /etc/baseprofile which includes aliases for all of these sorts of things, i.e.
alias "cat"="_bbf cat"
and checking that from my command prompt yields what I would expect:
$ type cat
cat is aliased to `_bbf cat'
Have you changed your system somehow so that /etc/baseprofile is not being sourced? Or have you changed /etc/baseprofile?
It also appears that you've installed the regular GNU Coreutils package, as I don't have a /bin/cat.exe.
$ ll /bin/cat.exe
ls: /bin/cat.exe: No such file or directory
Perhaps that's where your problem started but the _bbf function is supposed to handle that. Which again leads me to the belief that you've changed /etc/baseprofile somehow.
At most time, it is cool. This error maybe caused by wrong path match of cat.exe.
As for me, when I run git log, the same error message comes out. It is due to PATH variable. There are two dirs and both of them contain git.exe. One of them is half-done with a small size. And Mobaxterm choose it. :D
I confirm this by run which git and it will give the actual path out.
I fix it by
alias git='/drives/C/Program\ Files/Git/mingw64/bin/git.exe'
The following is my dirs.
├─cmd
│ git-gui.exe
│ git-lfs.exe
│ git.exe # oops
│ gitk.exe
│ start-ssh-agent.cmd
│ start-ssh-pageant.cmd
├─mingw64
│ ├─bin
│ │ git-upload-pack.exe
│ │ git.exe # the right one

Use diff and ignore empty directories

This is my tree
├── test
│   ├── dir1
│   └── dir2
│   ├── file
│   └── file2
└── test2
└── dir2
├── file
└── file2
I use diff: diff -r test/ test2/
Only in test: dir1
So the only difference is that there is an empty directory (dir1) in in test/ which does not exist in test2/.
I want to ignore empty directories as a difference. So I want in this case that diff tells me that the content of test/ is the same as the content of test2/
How can I achieve this?
I found an way of doing it but i'm not really happy with it.
diff can be told to exclude files matching a pattern. Sadly the pattern only works on filename, so my solution may exclude more directories (and files too) than expected.
Here it is, fwiw :
diff $(find test -empty -type d -exec sh -c 'echo -n "-x $(basename $1) "' _ {} \;) -r test/ test2/
I added the following command
find test -empty -type d -exec sh -c 'echo -n "-x $(basename $1) "' _ {} \;)
which outputs the basename of every empty dir preceding by diff's exclude option -x. With your example tree, it would output : -x dir1