Unix shell scripting - delete multiple lines from a file based on a pattern - perl

The format of the file is:
header - like 0001datetime|number of records
0010 some data
0012 Roll number (eg)
0020 some data
.
.
.
0070 some data
0010 some data
0012 Roll number (eg)
0020 some data
.
.
.
0070 some data
trailer - like 0099datetime|numberof records
Requirement - A list of Roll numbers will be given and the the records (0010-0070) needs to be removed for those numbers. There might be any number of fields between 0010 and 0070 but roll number is always at 0012.
Each record always starts with 0010 and end with 0070.
Could anyone please help with this?

Aside of header and trailer line, your file consists of blocks. Each block starts with a 0010 line and ends with a 0070 line.
One straightforward algorithm would be to always read one whole block into memory, and write it out only if the 0012 line (denoting the roll number) is not one of the roll numbers to be deleted.
You are tagging this question by "shell". However, while a solution using just, i.e., bash or zsh is technically possible, I think it's much easier to use a more flexible language for this - Ruby, Perl, Python etc. Maybe you can tag the problem with the language of your choice?
If you are not familiar yet with one of those languages, my personal recommendation would be Ruby, because I find it quick to learn; but all of them are equally well suitable for this problem.

Related

Unexpected behavior of HTML::Entities

I'm a newbie user of Perl's HTML::Entities routine decode_entities() to
convert headlines scraped from news media websites.
Here's a good result:
Before: Texas grand jury clears Planned Parenthood, indicts its accusers
After: Texas grand jury clears Planned Parenthood, indicts its accusers
But here's a puzzling result:
Before: Big changes could be coming to Utah’s criminal justice system
After: Big changes could be coming to Utahâs criminal justice system
Notice that not only was the ’ code not converted to a single quote, the wasn't decoded into a space, unlike in the first example.
What's going on?
The difference between your first and second example is that the first one does not contain any code points above 255, while the second one does. So, the first string can be displayed according to the native 8-bit character set of your system (most likely ISO 8859-1/Latin 1), but the second cannot. The reason for this, according to perlunicode, is that "using a code point above 255 implies Unicode for the whole string".
Since you now have Unicode characters in your string, you'll need to properly encode your text for output, otherwise you'll see "strange characters" (just like the ones in your example!). Since you didn't provide a Minimal, Complete, and Verifiable example, I'm not sure what your output method is, but let's just assume STDOUT to make things easy. There are a couple different ways to encode your text into an octet stream:
Manually, using the Encode module
Automatically, using the correct I/O layer
I prefer the second option because it's less tedious. To do that, we'll just call binmode() on STDOUT:
use strict;
use warnings;
use HTML::Entities;
my $str = 'Big changes could be coming to Utah’s criminal justice system';
my $decoded = decode_entities($str);
binmode(STDOUT, ':encoding(UTF-8)');
printf("%s\n%vx\n", $decoded, $decoded);
Output:
$ perl foo.pl
Big changes could be coming to Utah’s criminal justice system
42.69.67.20.63.68.61.6e.67.65.73.20.63.6f.75.6c.64.20.62.65.20.63.6f.6d.69.6e.67.20.74.6f.20.55.74.61.68.2019.73.20.63.72.69.6d.69.6e.61.6c.20.6a.75.73.74.69.63.65.a0.73.79.73.74.65.6d
You can see that there's code point 2019 (right single quotation mark) between characters 68 and 73 (h and s, respectively), and also an a0 (non-breaking space) between 65 and 73, which would be e and s.
In addition to the aforementioned perlunicode reference, you should also read perluniintro, perlunitut (short!), and perlunifaq if you're interested in learning more about how Perl handles Unicode and character encoding in general.

How to encode ASCII text in binary opcode instructions?

I do not need a refresher on anything. I am generally asking how would one encode a data string within the data segment of a binary file for execution on bare metal.
Purpose? Say I am writing a bootloader, and I need to define the static data used in the file to represent the string to move to a memory address, for example, with the Genesis TMSS issue.
I assume that binary encoding of an address is literally-translated-as binary equivalent to its hexadecimal representation in Motorola 68000 memory-mapping, so that's not an issue for now.
The issue is ... how do I encode strings/characters/glyphs in binary code to encode within an M68000k opcode? I read the manual, references, etc., and none quite touch this(from what I read through).
Say I want to encode move.l #'SEGA', $A14000. I would get this resulting opcode (without considering how to encode the ASCII characters):
0010 1101 0100 0010 1000 0000 0000 0000
Nibble 1 = MOVE LONG, Nibble 2 = MEMORY ADDRESSING MODE, Following three bytes equals address.
My question is, do I possibly encode each string in literal ASCII per character as part of the preceeding MAM nibble of the instruction?
I am confused at this point, and was hoping somebody might know how to encode data text within an instruction.
Well I had experienced programming in 4 different assembly languages,and Motorola M68HC11 is one of them.In my experience, ASCII is just for display purposes. The CPU at low level treats everything as binary values, it cannot distinguish between ASCII characters and other characters. Although higher assembly languages like x86 support instructions like AAA(ASCII adjust for addition) which makes sure that after addition of two ASCII numbers the result is still a legal ASCII number.
So mainly it's assembler dependent, if the assembler supports the instruction move.l #'SEGA', $A14000 this might work, but since you are not using an assembler and directly writing op-codes, you have to encode the ascii into binary, example ascii number '1'(0x31) will be encoded as 0000 0000 0011 0001 in 16 bit representation. Also in my experience there's no machine code which can move the whole string. SO in microcode the first character is fetched, then copied to destination address, then second character is fetched and copied it into second location and so on..
Assuming that instruction size is 32 bits long and immediate addressing mode is supported the first two nibbles would suggest the move instruction and the immediate addressing type, next two nibbles would be the binary encoded character and remaining would be the address you want to copy it to. Hope this helps

Using text2pcap (or equivilent) to merge multiple plain text packets into one pcap

I'm trying to merge multiple plain text packets into one large pcap file. I have been using text2pcap on each individual text file, then using mergecap on all the pcaps to create my final output. However, that's really slow, as it involves writing out that each pcap file, merging them all together, and then deleting all the single pcaps. I'm looking to speed that up by sending multiple text files into text2pcap at once.
Unfortunately, from what I understand, text2pcap requires the ofsets on the text file to be correct, and since I'm merging multiple different packets, I'm starting over at 0000 multiple times, and I think that's causing my errors.
So, assuming I have a packet that looks like this:
0000 30 00 20
0010 59 23 00
and another packet that looks like this:
0000 23 50 2c
0010 a4 23 f1
How would I best convert the two of them into a single pcap file?
You can also use PDD - Packet Dump Decode.
You can find an example in my article at LoveMyTool.

Doing a hash by hand/mathematically

I want to learn how to do a hash by hand (like with paper and pencil). Is this feasible? Any pointers on where to learn about this would be appreciated.
That depends on the hash you want to do. You can do a really simple hash by hand pretty easily -- for example, one trivial one is to take the ASCII values of the string, and add them together, typically doing something like a left-rotate between characters. So, to hash the string "Hash", we'd start with the ASCII values of the letters (in hex): 48 61 73 68. We'll add those together, rotating our result left 4 bits (in a 16-bit word) between letters:
0048 + 0061 = 00A9
00A9 <<< 4 = 0A90
0A90 + 0073 = 0B03
B03 <<< 4 = B030
B030 + 68 = B098
Result: B098
Doing a cryptographic hash by hand would be a rather different story. It's certainly still possible, but would be extremely tedious, to put it mildly. A cryptographic hash is typically quite a bit more complex, and (more importantly) almost always has a lot of "rounds", meaning that you basically repeat a set of steps a number of times to get from the input to the output. Speaking from experience, just stepping through SHA-1 in a debugger to be sure you've implemented it correctly is a pain -- doing it all by hand would be pretty awful (but as I said, certainly possible anyway).
You can start by looking at
Hash function
I would suggest trying a CRC, since it seems to me to be the easiest to do by hand: https://en.wikipedia.org/wiki/CRC32#Computation .
You can do a smaller length than standard (it's usually 32 bit) to make things easier.

VerQueryValue and multi codepage Unicode characters

In our application we use VerQueryValue() API call to fetch version info such as ProductName etc. For some applications running on a machine in Traditional Chinese (code page 950), the ProductName which has Unicode sequences that span multiple code pages, some characters are not translated properly. For instance,in the sequence below,
51 00 51 00 6F 8F F6 4E A1 7B 06 74
Some characters are returned as invalid Unicode 0x003f (question mark)
In the above sequence, the Unicode '8F 6F' is not picked up & converted properly by the WinAPI call and is just filled with the invalid Unicode '00 3F' - since '8F 6F' is present in codepage 936 only (ie., Simplified Chinese)
The .exe has just one translation table - as '\StringFileInfo\080404B0' - which refers to a language ID of '804' for Traditional Chinese only
How should one handle such cases - where the ProductName refers to Unicode from both 936 and 950 even though the translation table has one entry only ? Is there any other API call to use ?
Also, if I were to right-click on the exe and view 'details' tab, it shows the Productname correctly ! So it appears Microsoft uses a different API call or somehow
handle this correctly. I need to know how it so done.
Thanks in advance,
Venkat
It looks somewhat waierd to have contents compatible with codepage1 only in a block marked as codepage2. This is the source of your problem.
The best way to handle multi-codepages issues is obviously to turn your app to a Unicode-aware application. There will be no conversion to any codepages anymore, which will make everyone happy.
The LANGID (0804) is only an indication about the language of the contents in the block. If a version info has several blocks, you may program your app to lookup the block in the language of your user.
When you call VerQueryValue() in an ANSI application, this LANGID is not taken into account when converting the Unicode contents to ANSI: You're ANSI, so Windows assume you only understand the machine's default ANSI codepage.
Note about display in console
Beware of the console! It's an old creature that is not totally Unicode-aware. It is based on codepages. Therefore, you should expect display problems which can't be addressed. Even worse: It uses its own codepage (called OEM codepage) which may be different that the usual ANSI codepage (Although for East Asian languages, OEM codepage = ANSI codepage).
HTH.