Which block represents a WARC-Block-Digest? - common-crawl

At Line 09 below there is this line: WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ
Line 01: WARC/1.0
Line 02: WARC-Type: request
Line 03: WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/
Line 04: Content-Type: application/http;msgtype=request
Line 05: WARC-Date: 2018-11-03T17:20:02Z
Line 06: WARC-Record-ID: <urn:uuid:e44bc1ea-61a1-4200-b94f-60042456f638>
Line 07: WARC-IP-Address: 54.230.195.16
Line 08: WARC-Warcinfo-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb>
Line 09: WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ
Line 10: Content-Length: 141
Line 11:
Line 12: GET /vital-signs/carbon-dioxide/ HTTP/1.1
Line 13: User-Agent: Wget/1.15 (linux-gnu)
Line 14: Accept: */*
Line 15: Host: climate.nasa.gov
Line 16: Connection: Keep-Alive
WARC's specs say that The WARC-Block-Digest is an optional parameter indicating the algorithm name and calculated value of a digest applied to the full block of the record.
I've been trying to figure out what full block of the record refers to. Is it line 11 to 16? Or Line 12 to 16? Or Line 1 to 16 (without line 9)? I've tried hashing those possibilities but can't get the sha1 (base 32) value above.

A WARC record of a HTTP GET requests has three parts (cf. the WARC spec):
the WARC header
the HTTP request header
the payload which is empty (note: a POST requests would include a non-empty payload)
The payload digest of the record is the base32-encoded SHA-1 of the empty string. A proof using Linux command-line tools:
$> echo -n "" | openssl dgst -binary -sha1 | base32
3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
A WARC record has the form:
warc-record = header CRLF
block CRLF CRLF
(see WARC spec: record model)
The "full" block should include everything up to the trailing \r\n\r\n. This means lines 11 to 17. Note: also the HTTP GET request ends with \r\n\r\n (a trailing blank line):
$> cat request
GET /vital-signs/carbon-dioxide/ HTTP/1.1
User-Agent: Wget/1.15 (linux-gnu)
Accept: */*
Host: climate.nasa.gov
Connection: Keep-Alive
$> tail -n2 request | hexdump -C
00000000 43 6f 6e 6e 65 63 74 69 6f 6e 3a 20 4b 65 65 70 |Connection: Keep|
00000010 2d 41 6c 69 76 65 0d 0a 0d 0a |-Alive....|
0000001a
$> cat request | openssl dgst -binary -sha1 | base32
CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ

Related

how to find offset of a pattern from binary file (without grep -b)

I want to get a byte offset of a string pattern from a binary file on embedded linux platform.
If I can use "grep -b" option, It would be best way but It is not supported on my machine.
machine does not support
ADDR=`grep -oba <pattern string> <file path> | cut -d ":" -f1`
Here the manual of grep command on the machine.
root# grep --help
BusyBox v1.29.3 () multi-call binary.
Usage: grep \[-HhnlLoqvsriwFE\] \[-m N\] \[-A/B/C N\] PATTERN/-e PATTERN.../-f FILE \[FILE\]...
Search for PATTERN in FILEs (or stdin)
-H Add 'filename:' prefix
-h Do not add 'filename:' prefix
-n Add 'line_no:' prefix
-l Show only names of files that match
-L Show only names of files that don't match
-c Show only count of matching lines
-o Show only the matching part of line
-q Quiet. Return 0 if PATTERN is found, 1 otherwise
-v Select non-matching lines
-s Suppress open and read errors
-r Recurse
-i Ignore case
-w Match whole words only
-x Match whole lines only
-F PATTERN is a literal (not regexp)
-E PATTERN is an extended regexp
-m N Match up to N times per file
-A N Print N lines of trailing context
-B N Print N lines of leading context
-C N Same as '-A N -B N'
-e PTRN Pattern to match
-f FILE Read pattern from file
Since that option isn't available, I'm looking for an alternative.
the combination of hexdump and grep can be also useful
such as
ADDR=`hexdump <file path> -C | grep <pattern string> | cut -d' ' -f1`
But if pattren spans multiple lines, it will not be found.
Is there a way to find the byte offset of a specific pattern with a Linux command?
Set the pattern as the record separator in awk. The offset of the occurrence is the length of the first record. BusyBox awk treats RS as an extended regular expression, so add backslashes before any of .[]\*+?^$ in the pattern string.
<myfile.bin awk -v RS='pattern' '{print length($0); exit}'
If the pattern contains a null byte, you need a little extra work. Use tr to exchange null bytes with some byte value that doesn't appear in the pattern. For example, if the pattern's hex dump is 00002a61:
<myfile.bin tr '\0!' '!\0' | awk -v RS='!!-A' '{print length($0); exit}'
If the pattern is not found, this prints the length of the whole file. So if you aren't sure whether the pattern is present, you need again some extra work. Append some text that can't be part of a pattern match to the file, so that you know that if there's a match, it won't be at the very end of the file. Then, if the pattern is present, the file will contain at least two records. But if the pattern is not present, the file only contains the first record (without a record separator after it).
{ cat myfile.bin; echo garbage; } |
awk -v RS='pattern' '
NR==1 {n = length($0)}
NR==2 {print n; found = 1; exit}
END {exit !found}
'
Something like this?
hexdump -C "$file" |
awk -v pattern="$pattern" 'residue { matched = ($0 ~ "\\|" residue)
if (matched) print $1; residue = ""; if (matched) next }
$0 ~ pattern { print $1 }
{ for(i=length(pattern)-1; i>0; i--)
if ($0 ~ substr(pattern, 1, i) "\\|$") { residue=substr(pattern, i+1); break } }'
The offset is just the first field from the hexdump output; if you need the precise location of the match, this requires some additional massaging to figure out the offset to add to the address, or subtract if it was wrapped.
Briefly tested in a clean-slate Busybox Docker container where hexdump -C output looks like this:
/ # hexdump -C /etc/resolv.conf
00000000 23 20 44 4e 53 20 72 65 71 75 65 73 74 73 20 61 |# DNS requests a|
00000010 72 65 20 66 6f 72 77 61 72 64 65 64 20 74 6f 20 |re forwarded to |
00000020 74 68 65 20 68 6f 73 74 2e 20 44 48 43 50 20 44 |the host. DHCP D|
00000030 4e 53 20 6f 70 74 69 6f 6e 73 20 61 72 65 20 69 |NS options are i|
00000040 67 6e 6f 72 65 64 2e 0a 6e 61 6d 65 73 65 72 76 |gnored..nameserv|
00000050 65 72 20 31 39 32 2e 31 36 38 2e 36 35 2e 35 0a |er 192.168.65.5.|
00000060 20 | |

Zend FW response object and image data - adding linefeed?

I have run into a problem using php in Zend Framweork to dynamically scale images for return as mime-type image/jpeg.
The problem manifests itself as firefox reporting 'cannot display image because it contains errors'. This is the same problem reported in: return dynamic image zf2
To replicate the problem, I removed any file IO and copied code from a similar stack overflow example verbatim (in a zend FW action controller):
$resp = $this->getRespose();
$myImage = imagecreate(200,200);
$myGray = imagecolorallocate($myImage, 204, 204, 204);
$myBlack = imagecolorallocate($myImage, 0, 0, 0);
imageline($myImage, 15, 35, 120, 60, $myBlack);
ob_start();
imagejpeg($myImage);
$img_string = ob_get_contents();
$scaledSize = ob_get_length();
ob_end_clean();
imagedestroy($myImage);
$resp->setContent($img_string);
$resp->getHeaders()->addHeaders(array(
'Content-Type' => $mediaObj->getMimeType(),
'Content-Transfer-Encoding' => 'binary'));
$resp->setStatusCode(Response::STATUS_CODE_200);
return $resp;
When I use wget to capture the jpeg response, I notice a '0A' as the first byte of the output rather than the field separator 'FF'. There is no such '0A' in the data captured in the buffer, nor in the response's content member. Attempting to open the wget output with GIMP fails, unless I remove the 0A. I am guessing that Zend FW is using the line-feed as a field separator for the response fields vs. the content, but I'm not sure if that is the problem, or if it is, how to fix it.
My response fields look OK:
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Sat, 02 Jun 2018 23:30:09 GMT
Server: Apache/2.4.7 (Ubuntu) OpenSSL/1.0.1f
Set-Cookie: PHPSESSID=nsgk1o5au7ls4p5g6mr9kegoeg; path=/; HttpOnly
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
Content-Transfer-Encoding: binary
Content-Length: 1887
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: image/jpeg
Here is the dump of the first few bytes of the wget with the jpeg stream that fails:
00000000 0a ff d8 ff e0 00 10 4a 46 49 46 00 01 01 01 00 |.......JFIF.....|
00000010 60 00 60 00 00 ff fe 00 3e 43 52 45 41 54 4f 52 |......>CREATOR|
Any idea where the '0A' is coming from? I am running zend framework 2.5.1, PHP 7.2.2
Thank you Tim Fountain.
I did finally find the offending file buried in some doctrine entities I had created. Sure enough, a stray "?>" with an empty line.
Much appreciated

SonarQube 6.3 LDAP/SSO UTF-8 encoding

We‘re using LDAP/SSO in my company which provides the username in UTF-8 format to SonarQube.
However LDAP/SSO sends the username in UFT-8 but SonarQube requires Latin1/ISO-8859. There is no way to change the encoding on LDAP/SSO or SonarQube.
Result wrong umlauts:
Andrü Tingö = Andr«Ã Ting¼Ã OR äëüö = äëüÃ
Is there any workaround?
I wanted to post this as comment, but I need 50 reputations to write comments.
We are using simplesamlphp for SSO as IdP and SP. IdP takes cn, givenName and sn from LDAP, which has UTF-8 values. Loginnames/Usernames are us-ascii only.
If the user comes to Sonar, the non-us-ascii characters are incorrect - they were converted from ... to utf-8, even they already are in utf-8.
If I use the attributes from IDP in PHP which sends the page in UTF-8, then characters are correct.
I did just now one test. In our Apache Config we set the X-Forwarded-Name to MCAC_ATTR_CN attribute what SP get from IdP. Original configuration is:
RequestHeader set X-Forwarded-Name "expr=%{reqenv:MCAC_ATTR_CN}"
Now I have added fixed string in UTF-8:
RequestHeader set X-Forwarded-Name "expr=%{reqenv:MCAC_ATTR_CN} cäëöüc"
The "c" characters are only separators to see the encoded text better.
The hexdump of this configuration line is:
0000750: 09 0909 5265 7175 6573 7448 6561 ...RequestHea
0000760: 6465 7220 7365 7420 582d 466f 7277 6172 der set X-Forwar
0000770: 6465 642d 4e61 6d65 2022 6578 7072 3d25 ded-Name "expr=%
0000780: 7b72 6571 656e 763a 4d43 4143 5f41 5454 {reqenv:MCAC_ATT
0000790: 525f 434e 7d20 63c3 a4c3 abc3 b6c3 bc63 R_CN} c........c
00007a0: 220a ".
As you can see, there are fixed utf-8 characters "ä" c3a4 "ë" c3ab "ö" c3b6 "ü" c3bc.
From LDAP comes follwing name:
xxxxxx xxxxx xxxx äëüö
In Apache config is appended " cäëöüc", therefore resulting name should be:
xxxxxx xxxxx xxxx äëüö cäëöüc
But in Sonar, the name is displayed as
xxxxxx xxxxx xxxx äëüö cäëöüc
You get similar result if you convert follwing text:
xxxxxx xxxxx xxxx äëüö cäëöüc
from ISO-8859-1 to UTF-8:
echo "xxxxxx xxxxx xxxx äëüö cäëöüc" | iconv -f iso-8859-2 -t utf-8
xxxxxx xxxxx xxxx äÍßÜ cäÍÜßc
The "¤" character is utf-8 char c2 a4:
00000000: c2a4 0a ...
I have made tcpdump on loopback to get communications from apache proxy module to sonarqube and even there you can see correct UTF-8 characters c3a4 c3ab c3bc c3b6 comming from IdP and then between "c"s you can see c3a4 c3ab c3b6 c3bc comming direct from apache.
00000000 47 45 54 20 2f 61 63 63 6f 75 6e 74 20 48 54 54 GET /acc ount HTT
...
00000390 58 2d 46 6f 72 77 61 72 64 65 64 2d 4e 61 6d 65 X-Forwar ded-Name
000003A0 3a 20 72 6f 62 65 72 74 20 74 65 73 74 32 20 77 : xxxxxx xxxxx x
000003B0 6f 6c 66 20 c3 a4 c3 ab c3 bc c3 b6 20 63 c3 a4 xxx .... .... c..
000003C0 c3 ab c3 b6 c3 bc 63 0d 0a ......c. .
...
The system has locales set to en_US.UTF-8, if this matters.
So Sonar gets really UTF-8 Text from Apache (direct config or from IdP) but then something probably converts this utf-8 text as if it was iso-8859 text to utf-8 again and makes nonsense.
Do you have any idea now? Could this be something in sonar or in wrapper or somewhere some options set incorrectly?
Regards,
Robert.

combine one colum from two files into a thrid file

I have two files a.txt and b.txt which contains the following data.
$ cat a.txt
0x5212cb03caa111e0
0x5212cb03caa113c0
0x5212cb03caa115c0
0x5212cb03caa117c0
0x5212cb03caa119e0
0x5212cb03caa11bc0
0x5212cb03caa11dc0
0x5212cb03caa11fc0
0x5212cb03caa121c0
$ cat b.txt
36 65 fb 60 7a 5e
36 65 fb 60 7a 64
36 65 fb 60 7a 6a
36 65 fb 60 7a 70
36 65 fb 60 7a 76
36 65 fb 60 7a 7c
36 65 fb 60 7a 82
36 65 fb 60 7a 88
36 65 fb 60 7a 8e
I want to generate a third file c.txt that contains
0x5212cb03caa111e0 36 65 fb 60 7a 5e
0x5212cb03caa113c0 36 65 fb 60 7a 64
0x5212cb03caa115c0 36 65 fb 60 7a 6a
Can I achieve this using awk? How do I do this?
use paste command:
paste a.txt b.txt
paste is really the shortest solution, however if you're looking for awk solution as stated in question then:
awk 'FNR==NR{a[++i]=$0;next} {print a[FNR] "\t" $0}' a.txt b.txt
Here is an awk solution that only stores two lines in memory at a time:
awk '{ getline b < "b.txt"; print $0, b }' OFS='\t' a.txt
Lines from a.txt are implicitly stored in $0 and for each line in a.txt a line is read from b.txt by getline.

Insert shell code

I got a small question.
Say I have the following code inside a console application :
printf("Enter name: ");
scanf("%s", &name);
I would like to exploit this vulnerability and enter the following shell code (MessageboxA):
6A 00 68 04 21 2F 01 68 0C 21 2F 01 6A 00 FF 15 B0 20 2F 01
How can I enter my shell code (Hex values) through the console ?
If I enter the input as is, it treats the numbers as chars and not as hex values.
Thanks a lot.
You could use as stdin a file with the desired content or use the echo command.
Suppose your shell code is AA BB CC DD (obviously this is not a valid shellcode):
echo -e "\xAA\xBB\xCC\xDD" | prog