I don't see a corresponding option in yq to what jq has:
-r output raw strings, not JSON texts;
My exact usage is:
cat example.xml | yq --input-format xml --output-format json
Related
The synopsis for xmlstarlet fo says
XMLStarlet Toolkit: Format XML document
Usage: xmlstarlet fo [<options>] <xml-file>
where <options> are
-n or --noindent - do not indent
-t or --indent-tab - indent output with tabulation
-s or --indent-spaces <num> - indent output with <num> spaces
-o or --omit-decl - omit xml declaration <?xml version="1.0"?>
--net - allow network access
-R or --recover - try to recover what is parsable
-D or --dropdtd - remove the DOCTYPE of the input docs
-C or --nocdata - replace cdata section with text nodes
-N or --nsclean - remove redundant namespace declarations
-e or --encode <encoding> - output in the given encoding (utf-8, unicode...)
-H or --html - input is HTML
-h or --help - print help
When I run
cat unformatted.html | xmlstarlet fo -H -R --encode utf-8
I am returned the error message
failed to load external entity "utf-8"
In my limited experience, xmlstarlet fo especially, needs the stdin dash to work (better).
In your example, the 'unformatted.html' contents are piped to xmlstarlet.
But xmlstarlet fo doesn't 'see' the piped input, if you don't use a - (dash).
It assumes that the last argument (utf-8) is the filename ("external entity") whose contents you're trying to format. Obviously, there's no such file. Just to be on the safe side, I'd also enclose the encoding argument with double quotes, like so: "utf-8".
Altering your statement to
xmlstarlet fo -H -R --encode "utf-8" unformatted.html
should do the trick.
The cat is unnecessary, I'd think.
I am trying to separate out parts of a path as follows. My input path takes the following possible forms:
bucket
bucket/dir1
bucket/dir1/dir2
bucket/dir1/dir2/dir3
...
I want to separate the first part of the path (bucket) from the rest of the string if present (dir1/dir2/dir3/...), and store both in separate variables.
The following gives me something close to what I want:
❯ BUCKET=$(echo "bucket/dir1/dir2" | sed 's#\(^[^\/]*\)[\/]\(.*\)#\1#')
❯ EXTENS=$(echo "bucket/dir1/dir2" | sed 's#\(^[^\/]*\)[\/]\(.*\)#\2#')
echo $BUCKET $EXTENS
❯ bucket dir1/dir2
HOWEVER, it fails if I only have bucket as input (without a slash):
❯ BUCKET=$(echo "bucket" | sed 's#\(^[^\/]*\)[\/]\(.*\)#\1#')
❯ EXTENS=$(echo "bucket" | sed 's#\(^[^\/]*\)[\/]\(.*\)#\2#')
echo $BUCKET $EXTENS
❯ bucket bucket
... because, in the absence of the first '/', no capture happens, so no substitution takes place. When the input is just 'bucket' I would like $EXTENS to be set to the empty string "".
Thanks!
For something so simple you could use bash built-in instead of launching sed:
$ path="bucket/dir1/dir2"
$ bucket="${path%%/*}"
$ extens="${path#$bucket}"
$ printf '|%s|%s|\n' "$bucket" "$extens"
|bucket|/dir1/dir2|
$ path="bucket"
$ bucket="${path%%/*}"
$ extens="${path#$bucket}"
$ printf '|%s|%s|\n' "$bucket" "$extens"
|bucket||
But if you really want to use sed and capture groups:
$ declare -a bucket_extens
$ mapfile -td '' bucket_extens < <(printf '%s' "bucket/dir1/dir2" | sed -E 's!([^/]*)(.*)!\1\x00\2!')
$ printf '|%s|%s|\n' "${bucket_extens[#]}"
|bucket|/dir1/dir2|
$ mapfile -td '' bucket_extens < <(printf '%s' "bucket" | sed -E 's!([^/]*)(.*)!\1\x00\2!')
$ printf '|%s|%s|\n' "${bucket_extens[#]}"
|bucket||
We use the extended regex (-E) to simplify a bit, and ! as separator of the substitute command. The first capture group is simply anything not containing a slash and the second is everything else, including nothing if there's nothing else.
In the replacement string we separate the two capture groups with a NUL character (\x00). We then use mapfile to assign the result to bash array bucket_extens.
The NUL trick is a way to deal with file names containing spaces, newlines... NUL is the only character that cannot be part of a file name. The -d '' option of mapfile indicates that the lines to map are separated by NUL instead of the default newline.
Don't capture anything. Instead, just match what you don't want and replace it with nothing:
BUCKET=$(echo "bucket" | sed 's#/.*##'). # bucket
BUCKET=$(echo "bucket/dir1/dir2" | sed 's#/.*##') # bucket
EXTENS=$(echo "bucket" | sed 's#[^/]*##') # blank
EXTENS=$(echo "bucket/dir1/dir2" | sed 's#[^/]*##') # /dir1/dir2
As you are putting a slash in the regex. the string with no slashes will not
match. Let's make the slash optional as /\?. (A backslash before ?
is requires due to the sed BRE.) Then would you please try:
#!/bin/bash
#path="bucket/dir1/dir2"
path="bucket"
bucket=$(echo "$path" | sed 's#\(^[^/]*\)/\?\(.*\)#\1#')
extens=$(echo "$path" | sed 's#\(^[^/]*\)/\?\(.*\)#\2#')
echo "$bucket" "$extens"
You don't need to prepend a backslash to a slash.
By convention, it is recommended to use lower cases for user variables.
I am currently working with a large .tsv.gz file that contains two columns that looks something like this:
xxxyyy 408261
yzlsdf 408260null408261
zlkajd 408258null408259null408260
asfzns 408260
What I'd like to do is find all the rows that contain "null" and replace it with a comma ",". So that the result would look like this:
xxxyyy 408261
yzlsdf 408260,408261
zlkajd 408258,408259,408260
asfzns 408260
I have tried using the following command but did not work:
sed -i 's/null/,/g' 46536657_1748327588_combined_copy.tsv.gz
Unzipping the file and trying it again also does not work with a tsv file.
I've also tried opening the unzipped file in a text editor to manually find and replace. But the file is too huge and would crash.
Try:
zcat comb.tsv.gz | sed 's/null/,/g' | gzip >new_comb.tsv.gz && mv new_comb.tsv.gz comb.tsv.gz
Because this avoids unzipping your file all at once, this should save on memory.
Example
Let's start with this sample file:
$ zcat comb.tsv.gz
xxxyyy 408261
yzlsdf 408260null408261
zlkajd 408258null408259null408260
asfzns 408260
Next, we run our command:
$ zcat comb.tsv.gz | sed 's/null/,/g' | gzip >new_comb.tsv.gz && mv new_comb.tsv.gz comb.tsv.gz
By looking at the output file, we can see that the substitutions were made:
$ zcat comb.tsv.gz
xxxyyy 408261
yzlsdf 408260,408261
zlkajd 408258,408259,408260
asfzns 408260
Suppose I have some json data given below:
{"name":"alon","department":"abc","id":"ss12sd"}
{"name":"kate","department":"xyz","id":"ajsj3" }
{"name":"sam","department":"abc","id":"xx1d2"}
I want to filter data based on particular department and save it in a different json file. From above data suppose I want to filter all the data whose department is 'abc' and save it in some new json file. How I can do this using jq. I am checking it's manual from here but didn't understood that much.
jq 'map(select(.department == "abc")) ' yourfile.json
A flexible template might be like this:
jq --arg key department --arg value abc \
'.[] | select(.[$key] == $value)' input_file.json > output_file.json
This way you can change the criteria at the arguments stage rather than the expression.
Implementing that into a shell script might look like this:
myscript.sh
#!/usr/bin/env bash
key="$1"
value="$2"
file="$3"
outfile="$4"
jq --arg key "$1" --arg value "$2" \
'.[] | select(.[$key] == $value)' "$3" > "$4"
Which you would invoke like so:
./myscript.sh department abc input.json output.json
Edit: Changed ."\($key)" to .[$key] - thanks #peak
I need to extract certain json data (that have datalist member) from the log file, but only of map value is not 200.
right now I have two sed scripts, one extracts json data from a log file:
sed -n 's/.*\({\"datalist\".*}\).*/\1/p' full.log > new.log
the other one skips data if map field has value 200:
sed -n '/.*\"map\":\"200\".*/!p' new.log > map.log
how to combine these two into one?
UPD: I have accepted answer for now, but I wonder why
sed -n 's/.*\({\"datalist\".*\"map\":\"\(?!200\)\".*}\).*/\1/p' full.log > new.log
doesn't work
This might work for you (GNU sed):
sed -n '/"map":"200"/!s/.*\({"datalist".*}\).*/\1/p' full.log > new.log
Strip out the "map:200" lines with grep before sending to sed:
grep -v "\"map\":\"200\"" full.log | sed -n 's/.*\({\"datalist\".*}\).*/\1/p' > new.log