Collecting data from web sites

Collecting data from web sites - xidel

I have two web pages
Page 1:
<data>
<item>
<name>Item 1</name>
<url>http://someUrl.html</url>
</item>
</data>
Page 2: http://someUrl.html
<data>
<info>Info 1</info>
<info>Info 2</info>
<info>Info 3</info>
</data>
I want to crawl page 1 and follow all the links there and generate the following output
Item 1, Info 1
Item 1, Info 2
Item 1, Info 3
...
How can i achieve this using Xidel?

I recently found Xidel, so I'm no expert, but in my opinion it's an extremely powerful swiss-knife commandline scrape tool, that should be known by many more people.
Now, to answer your question I think the following (using html-templates) does exactly what you want:
xidel -q page1.html --extract-exclude=name -e "<name>{name:=text()}</name>*" -f "<url>{link:=text()}</url>*" -e "<info>{string-join(($name, text()), ', ')}</info>*" --hide-variable-names
Or, even shorter with CSS selectors:
xidel -q page1.html --extract-exclude=name -e "name:=css('name')" -f "link:=css('url')" -e "css('info')/string-join(($name,.),', ')" --hide-variable-names
Or, shortest with XPath:
xidel -q page1.html --extract-exclude=name -e name:=//name -f link:=//url -e "//info/string-join(($name,.),', ')" --hide-variable-names
The shortest line possible (but not in CSV format) would be:
xidel -q page1.html -e //name,//info -f //url
The above commands are for Windows, so make sure to swap the quotes <-> double quotes when on mac/ux!
If you need explanation for the different parts in the lines, just ask... :-) Cheers!

You're talking about "all the links there", so instead of what you posted I'm going to assume as input:
<data>
<item>
<name>Item 1</name>
<url>http://someUrl1.html</url>
</item>
<item>
<name>Item 2</name>
<url>http://someUrl2.html</url>
</item>
<item>
<name>Item 3</name>
<url>http://someUrl3.html</url>
</item>
</data>
Linux:
xidel -s input.html -e 'for $item in //item for $info in doc($item/url)//info return $item/name||", "||$info'
#or
xidel -s input.html -e '
for $item in //item
for $info in doc($item/url)//info
return
$item/name||", "||$info
'
Windows:
xidel -s input.html -e "for $item in //item for $info in doc($item/url)//info return $item/name||', '||$info"
#or
xidel -s input.html -e ^"^
for $item in //item^
for $info in doc($item/url)//info^
return^
$item/name^|^|', '^|^|$info^
"
The 1st for-loop iterates over every <item>-node. The 2nd for-loop opens the url and iterates over every <info>-node. And the return clause is a simple string concatenation.
The output in this case:
Item 1, Info 1
Item 1, Info 2
Item 1, Info 3
Item 2, Info 4
Item 2, Info 5
Item 2, Info 6
Item 3, Info 7
Item 3, Info 8
Item 3, Info 9

Related

Linux command Line find and replace

i have a file.txt with contents
2021-12-03;12.20.31;13;00000.00;00000.00;NO LINK
2021-12-03;12.33.31;15;00199.94;00000.00;Status OK
2021-12-03;12.35.33; 2;01962.33;00015.48;;Status OK
2021-12-03;13.05.31;13;00000.00;00000.00;NO LINK
so what command to output like below
2021-12-03;12:20:31;13;00000.00;00000.00;NO LINK
2021-12-03;12:33:31;15;00199.94;00000.00;Status OK
2021-12-03;12:35:33; 2;01962.33;00015.48;Status OK
2021-12-03;13:05:31;13;00000.00;00000.00;NO LINK
note.
cut -b 12-19 file.txt (is time)
Thanks for your help.
Rido

I assumed that the lines you want to modify are contained in a file (which I called filea.txt). The script should solve your problem.
Contents of the file 'filea.txt':
$> cat filea.txt
2021-12-03;12.20.31;13;00000.00;00000.00;NO LINK
2021-12-03;12.33.31;15;00199.94;00000.00;Status OK
2021-12-03;12.35.33; 2;01962.33;00015.48;;Status OK
2021-12-03;13.05.31;13;00000.00;00000.00;NO LINK
Script File:
$> cat refrm
#!/usr/bin/bash
in_file="filea.txt"
while read -r line || [ -n "$line" ];
do
line=$(echo "${line}" | sed -E 's/;{2,}/;/g')
IFS=$'\n'
line=$(echo ${line} | sed 's/;/\n/g')
arr=($(IFS='\n' ; echo "${line}"))
for ((n=0; n < ${#arr[*]}; n++))
do
if [[ ${arr[n]} =~ ^[0-9]{2}\.[0-9]{2}\.[0-9]{2} ]];
then
arr[n]=`echo ${arr[n]} | sed 's/\./:/g'`
fi
done
nline=$(IFS=";" ; echo "${arr[*]}")
echo "${nline}"
done < ${in_file}
Output:
$> refrm
2021-12-03;12:20:31;13;00000.00;00000.00;NO LINK
2021-12-03;12:33:31;15;00199.94;00000.00;Status OK
2021-12-03;12:35:33; 2;01962.33;00015.48;Status OK
2021-12-03;13:05:31;13;00000.00;00000.00;NO LINK

what is the best way to extract filled data from a static form?

I have some federal pdf forms with filled data init. Lets say for example i765 and I have the data of this form available in a text format, with duly filled in details. How can I extract the data from this form with minimum parsing. Lets say how can write a script that identifies "difference" , which in itself is nothing but the filled information.
For eg: if a line contains..
SSN: (Whitespace) and the actual filled in form has SSN: ABC!##456
so the filled in information is nothing but ABC!##456 which just a difference between the strings . Is there a known approach that i can follow. Any pointers are much appreciated.

If we are talking about Linux Tools then you could try various solutions , like:
$ join -t"=" -a1 -o 0,2.2 <(sort emptyform) <(sort filledform) # "=" is used as delimiter
Or even awk without sorting requirements:
$ awk 'BEGIN{FS=OFS="="}NR==FNR{a[$1]=$2;next}{if ($1 in a) {print;delete a[$1]}} \
END{print "\n Missing fields:";for (i in a) print i,a[i]}' empty filled
Testing:
cat <<EOF >empty
Name=""
Surname=""
Age=""
Address=""
Kids=""
Married=""
EOF
cat <<EOF >filled
Name="George"
Surname="Vasiliou"
Age="42"
Address="Europe"
EOF
join -t"=" -a1 -o 0,2.2 <(sort empty) <(sort filled)
#Output:
Address="Europe"
Age="42"
Kids=
Married=
Name="George"
Surname="Vasiliou"
awk output
awk 'BEGIN{FS=OFS="="}NR==FNR{a[$1]=$2;next}{if ($1 in a) {print;delete a[$1]}} \
END{print "\nnot completed fields:";for (i in a) print i,a[i]}' empty filled
Name="George"
Surname="Vasiliou"
Age="42"
Address="Europe"
not completed fields:
Married=""
Kids=""
Especially in awk if you remove the print from {if ($1 in a) {print;delete a[$1]}} the END section will print out for you only the missing fields.
Another alternative with a nice visual interface is with diff utility:
$ diff -y <(sort empty) <(sort filled)
Address="" | Address="Europe"
Age="" | Age="42"
Kids="" | Name="George"
Married="" | Surname="Vasiliou"
Name="" <
Surname="" <

How to print some free text in addition to SED extract

Well-known SED command to extract a first line and print to another file
sed -n '1 p' /p/raw.txt | cat >> /p/001.txt ;
gives an output in /p/001.txt like
John Doe
But how to modify this command above to add some free text and have, for example, the output like
Name: John Doe
Thanks for any hint to try.

You can do that in a single command (and no sub-shells):
sed 's/^/Name: /;q' /p/raw.txt >> /p/001.txt
This prefixes "Name: " in front of the first line, prints it, then quits so you don't process additional lines. Add a line number before the q to print all lines up to (and including) that number. The output is appended to /p/001.txt just like your original code.
If you want a range of lines:
sed -n '3,9{s/^/Name: /;p}9q' /p/raw.txt >> /p/001.txt
This reads from lines 3-9, performs the substitution, prints, then quits after line 9.
If you want specific lines, I recommend awk:
awk 'NR==3 || NR==9 { print "Name: " $0 } NR>=9 { exit }' /p/raw.txt >> /p/001.txt
This has two clauses. One says the number of record (line number) is either 3 or 9, in which case we print the prefix and the line. The other tells us to stop reading the file after the 9th record.
Here are two more commands to show how awk can act on just the first line(s) or a given range:
awk '{ print "Name: " $0 } NR >= 1 { exit }' /p/raw.txt >> /p/001.txt
awk '3 <= NR { print "Name: " $0 } NR >= 9 { exit }' /p/raw.txt >> /p/001.txt
It appears you're continuously building one file from the other. Consider:
tail -Fn0 /p/raw.txt |sed 's/^/Name: /' >> /p/001.txt
This will run continuously, adding only new entries (added after the command is run) to /p/001.txt
Perhaps you have lots of duplicates to resolve?
awk 'NR != FNR { $0 = "Name: " $0 } !s[$0]++' \
/p/001.txt /p/raw.txt > /tmp/001.txt && mv /tmp/001.txt /p/001.txt
This folds together the previously saved names with any new names, printing names only once (!s[$0]++ is true when s[$0] is zero (its default state), but after the evaluation, it increments to one, making it false on the second occurrence. When a bare clause has no action, the line is printed.) Because we're reading the output file, we need a temporary output. Upon its successful completion, we then move it atop the target output file.

printf "Name : %s\n" "$(sed -n '1p;q' /p/raw.txt)" >/p/001.txt
should do it. If sed is not a requirement do
echo -e "Name : $(sed -n '1p;q' /p/raw.txt)" >/p/001.txt
Note
The q option with the sed quits it without processing any more commands or input.
The -e option tells echo to interpret escape sequences. This is a peculiarity of bash shell.

Delete records in a file with Null value in certain fields through Unix

I have a Pipe delimited file (sample below) and I need to delete records which has Null value in fields 2(email),4(mailing-id),6(comm_id). In this sample, row 2,3,4 should be deleted. The output should be saved to another file. If 'awk' is the best option, please let me know a way to achieve this
id|email|date|mailing-id|seg_id|comm_id|oyb_id|method
|-fabianz-#yahoo.com|2010-06-23 11:47:00|0|1234|INCLO|1000002|unknown
||2010-06-23 11:47:00|0|3984|INCLO|1000002|unknown
|-maddog-#web.md|2010-06-23 11:47:00|0||INCLO|1000002|unknown
|-mse-#hanmail.net|2010-06-23 11:47:00|0||INCLO|1000002|unknown
|-maine-mei#web.md.net|2010-06-23 11:47:00|0|454|INCLO|1000002|unknown

Here is an awk solution that may help. However, to remove rows 2, 3 and 4, it is necessary to check for null vals in fields 2 and 5 only (i.e. not fields 2, 4 and 6 like you have stated). Am I understanding things correctly? Here is the awk to do what you want:
awk -F "|" '{ if ($2 == "" || $5 == "") next; print $0 }' file.txt > results.txt
cat results.txt:
id|email|date|mailing-id|seg_id|comm_id|oyb_id|method
|-fabianz-#yahoo.com|2010-06-23 11:47:00|0|1234|INCLO|1000002|unknown
|-maine-mei#web.md.net|2010-06-23 11:47:00|0|454|INCLO|1000002|unknown
HTH

Steve is right, it is field 2 and 5 that are missing in the sample given. Email missing for line two and the seq_id missing for line three and four
This is a slightly simplified version of steve's solution
awk -F "|" ' $2!="" && $5!=""' file.txt > results.txt
If column 2,4 and 6 are the important one, the solution would be:
awk -F "|" ' $2!="" && $4!="" && $6!=""' file.txt > results.txt

This might work for you:
sed 'h;s/[^|]*/\n&/2;s/[^|]*/\n&/4;s/[^|]*/\n&/6;/\n|/d;x' file.txt > results.txt

zenity list and for loop

for i in $(seq 1 10); do
echo 'bla bla'
echo 'xxx'
echo $i
done | select=$(zenity --list --title="title" --text="text" --column="X" --column="Y" --column="Z");
I try to create a checklist with zenity, my problem is that $select is always empty.
I try to do it in few other ways, like this one:
for i in $(seq 1 10)
do
x="bla bla"
y="xxx"
z="$i"
table="$table '$x' '$y' '$z'"
done
eval zenity --list --title="title" --text="text" --column="X" --column="Y" --column="Z" $table
In this way the $select variable isn't empty but if there are spaces in some variable (like $x for example) zenity split it to 2 (or more) columns.
I need other solution or any fix for my code(s)?
Thanks!

You can try this other approach:
#!/bin/bash
for i in $(seq 1 10)
do
echo "bla bla"
echo "xxx"
echo "$i"
done | zenity --list --title="title" --text="text" --column="X" --column="Y" --column="Z"
Each line populate the table from the first column to the last, and then again on a new row, until the input stream ends.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Collecting data from web sites - xidel

Related

Linux command Line find and replace

what is the best way to extract filled data from a static form?

How to print some free text in addition to SED extract

Delete records in a file with Null value in certain fields through Unix

zenity list and for loop

Categories

Resources