I am trying to parse large pcap files with libpcap but there is a file limitation so my files are separated at 2gb. I have 10 files of 2gb and I want to parse them at one shot. Is there a possibility to feed this data on an interface sequentially (each file separately) so that libpcap can parse them on the same run?
I am not aware of any tools that will allow you to replay more than one file at a time.
However, if you have the disk space, you can use mergecap to merge the ten files into a single file and then replay that.
Mergecap supports merging the packets according to
chronological order of each packet's timestamp in each file
ignoring the timestamps and performing what amounts to a packet version of 'cat'; write the contents of the first file to the output, then the next input file, then the next.
Mergecap is part of the Wireshark distribution.
I had multiple 2GB pcap files. Used the following one liner to go through each pcap file sequentially and with display filter. This worked without merging the pcap files (avoided using more disk space and cpu)
for i in /mnt/tmp1/tmp1-pcap-ens1f1-tcpdump* ; do tcpdump -nn -r $i host 8.8.8.8 and tcp ; done
**Explanation:**
for loop
/mnt/tmp1/tmp1-pcap-ens1f1-tcpdump* # path to files with * for wildcard
do tcpdump -nn -r $i host 8.8.8.8 and tcp # tcpdump not resolving ip or port numbers and reading each file in sequence
done #
Note: Please remember to adjust the file path and display filter according to your needs.
Related
I recorded a pcap file with wireshark and now I want to get rid of the pcap header.
Means: I want my .pcap file to start directly with my ethernet frame (source & destination MAC) and get rid of the PCAP headers.
libpcap files only have a 24 bytes header at the start. The following command does the trick:
cat in.pcap | tail -c +25 > out.pcap
Im running the gsutil cp command in parallel (with the -m option) on a directory with 25 4gb json files (that i am also compressing with the -z option).
gsutil -m cp -z json -R dir_with_4g_chunks gs://my_bucket/
When I run it, it will print out to terminal that it is copying all but one of the files. By this I mean that it prints one of these lines per file:
Copying file://dir_with_4g_chunks/a_4g_chunk [Content-Type=application/octet-stream]...
Once the transfer for one of them is complete, it says that it'll be copying the last file.
The result of this is that there is one file that only starts to copy only when one of the others finishes copying, significantly slowing down the process
Is there a limit to the number of files I can upload with the -m option? Is this configurable in the boto config file?
I was not able to find the .boto file on my Mac (as per jterrace's answer above), instead I specified these values using the -o switch:
gsutil -m -o "Boto:parallel_thread_count=4" cp directory1/* gs://my-bucket/
This seemed to control the rate of transfer.
From the description of the -m option:
gsutil performs the specified operation using a combination of
multi-threading and multi-processing, using a number of threads and
processors determined by the parallel_thread_count and
parallel_process_count values set in the boto configuration file. You
might want to experiment with these value, as the best value can vary
based on a number of factors, including network speed, number of CPUs,
and available memory.
If you take a look at your .boto file, you should see this generated comment:
# 'parallel_process_count' and 'parallel_thread_count' specify the number
# of OS processes and Python threads, respectively, to use when executing
# operations in parallel. The default settings should work well as configured,
# however, to enhance performance for transfers involving large numbers of
# files, you may experiment with hand tuning these values to optimize
# performance for your particular system configuration.
# MacOS and Windows users should see
# https://github.com/GoogleCloudPlatform/gsutil/issues/77 before attempting
# to experiment with these values.
#parallel_process_count = 12
#parallel_thread_count = 10
I'm guessing that you're on Windows or Mac, because the default values for non-Linux machines is 24 threads and 1 process. This would result in copying 24 of your files first, then the last 1 file afterward. Try experimenting with increasing these values to transfer all 25 files at once.
The Objective
I'm trying to achieve the following:
capture network traffic containing a conversation in the FIX protocol
extract the individual FIX messages from the network traffic into a "nice" format, e.g. CSV
do some data analysis on the exported "nice" format data
I have achieved this by:
using pcap to capture the network traffic
using tshark to print the relevant data as a CSV
using Python (pandas) to analyse the data
The Problem
The problem is that some of the captured TCP packets contain more than one FIX message, which means that when I do the export to CSV using tshark I don't get a FIX message per line. This makes consuming the CSV difficult.
This is the tshark commandline I'm using to extract the relevant FIX fields as CSV is:
tshark -r dump.pcap \
-R \'(fix.MsgType[0]=="G" or fix.MsgType[0]=="D" or fix.MsgType[0]=="8" or \ fix.MsgType[0]=="F") and fix.ClOrdID != "0"\' \
-Tfields -Eseparator=, -Eoccurrence=l -e frame.time_relative \
-e fix.MsgType -e fix.SenderCompID \
-e fix.SenderSubID -e fix.Symbol -e fix.Side \
-e fix.Price -e fix.OrderQty -e fix.ClOrdID \
-e fix.OrderID -e fix.OrdStatus'
Note that I'm currently using "-Eoccurrence=l" to get just the last occurrence of a named field in the case where there is more than one occurrence of a field in the packet. This is not an acceptable solution as information will get thrown away when there are multiple FIX messages in a packet.
This is what I expect to see per line in the exported CSV file (fields from one FIX message):
16.508949000,D,XXX,XXX,YTZ2,2,97480,34,646427,,
This is what I see when there is more than one FIX message (three is this case) in a TCP packet and the commandline flag "-Eoccurrence=a" is used:
16.515886000,F,F,G,XXX,XXX,XXX,XXX,XXX,XXX,XTZ2,2,97015,22,646429,646430,646431,323180,323175,301151,
The Question
Is there a way (not necessarily using tshark) to extract each individual, protocol specific message from a pcap file?
Better Solution
Using tcpflow allows this to be done properly without leaving the commandline.
My current approach is to use something like:
tshark -nr <input_file> -Y'fix' -w- | tcpdump -r- -l -w- | tcpflow -r- -C -B
tcpflow ensures that the TCP stream is followed, so no FIX messages are missed (in the case where a single TCP packet contains more than 1 FIX message). -C writes to the console and -B ensures binary output. This approach is not unlike following a TCP stream in Wireshark.
The FIX delimiters are preserved which means that I can do some handy grepping on the output, e.g.
... | tcpflow -r- -C -B | grep -P "\x0135=8\x01"
to extract all the execution reports. Note the -P argument to grep which allows the very powerful perl regex.
A (Previous) Solution
I'm using Scapy (see also Scapy Documentation, The Very Unofficial Dummies Guide to Scapy) to read in a pcap file and extract each individual FIX message from the packets.
Below is the basis of the code I'm using:
from scapy.all import *
def ExtractFIX(pcap):
"""A generator that iterates over the packets in a scapy pcap iterable
and extracts the FIX messages.
In the case where there are multiple FIX messages in one packet, yield each
FIX message individually."""
for packet in pcap:
if packet.haslayer('Raw'):
# Only consider TCP packets which contain raw data.
load = packet.getlayer('Raw').load
# Ignore raw data that doesn't contain FIX.
if not 'FIX' in load:
continue
# Replace \x01 with '|'.
load = re.sub(r'\x01', '|', load)
# Split out each individual FIX message in the packet by putting a
# ';' between them and then using split(';').
for subMessage in re.sub(r'\|8=FIX', '|;8=FIX', load).split(';'):
# Yield each sub message. More often than not, there will only be one.
assert subMessage[-1:] == '|'
yield subMessage
else:
continue
pcap = rdpcap('dump.pcap')
for fixMessage in ExtractFIX(pcap):
print fixMessage
I would still like to be able to get other information from the "frame" layer of the network packet, in particular the relative (or reference) time. Unfortunately, this doesn't seem to be available from the Scapy packet object - it's topmost layer is the Ether layer as shown below.
In [229]: pcap[0]
Out[229]: <Ether dst=00:0f:53:08:14:81 src=24:b6:fd:cd:d5:f7 type=0x800 |<IP version=4L ihl=5L tos=0x0 len=215 id=16214 flags=DF frag=0L ttl=128 proto=tcp chksum=0xa53d src=10.129.0.25 dst=10.129.0.115 options=[] |<TCP sport=2634 dport=54611 seq=3296969378 ack=2383325407 dataofs=8L reserved=0L flags=PA window=65319 chksum=0x4b73 urgptr=0 options=[('NOP', None), ('NOP', None), ('Timestamp', (581177, 2013197542))] |<Raw load='8=FIX.4.0\x019=0139\x0135=U\x0149=XXX\x0134=110169\x015006=20\x0150=XXX\x0143=N\x0152=20121210-00:12:13\x01122=20121210-00:12:13\x015001=6\x01100=SFE\x0155=AP\x015009=F3\x015022=45810\x015023=3\x015057=2\x0110=232\x01' |>>>>
In [245]: pcap[0].summary()
Out[245]: 'Ether / IP / TCP 10.129.0.25:2634 > 10.129.0.115:54611 PA / Raw'
I have two pcap files
$ capinfos cap1_stego0.pcap
File name: cap1_stego0.pcap
File type: Wireshark/tcpdump/... - libpcap
File encapsulation: Raw IP
Number of packets: 713
and
$ capinfos cap1_wlan0.pcap
File name: cap1_wlan0.pcap
File type: Wireshark/tcpdump/... - libpcap
File encapsulation: Ethernet
I want to merge them, but the incapsulation is different. If i use
mergecap -v -w asd.pcap cap1_stego0.pcap cap1_wlan0.pcap -T rawip
or
mergecap -v -w asd.pcap cap1_wlan0.pcap cap1_stego0.pcap -T rawip
Wireshark doesn't recognize the second past file and shows packets of cap1_wlan0.pcap or packets of cap1_stego0.pcap as raw packet data respectively. Also using "tcpslice" to remove ethernet layer of cap1_wlan0.pcap (to have both file with rawip encapsulation) show me unrecognized packet data.
How can i do? there is a way to merge pcap with different encapsulation or to convert eth->rawip or rawip->eth? Thank you.
One way to convert a RAW_IP file to an ethernet encapsulated file (which can then be merged with other ethernet-encapsulated files):
Use tshark to get a hex dump of the packets from the RAW_IP file:
tshark -nxr pcap-file-name | grep -vP "^ +\d" > foo.txt
( grep is used to remove the "summary" lines from the tshark output).
Use text2pcap to convert back to a pcap file while adding dummy
ethernet headers:
text2pacp -e 0x0800 foo.txt foo.pcap
If you want to keep the timestamps, you'll have to play around a bit with the tshark output
to get a text file which contains the timestamps in a format which text2pcap will accept and also contains the hex packet info.
[[
Does tcpslice have an option to remove ethernet headers ?
(Looking at the man page, it appears that tcpslice is used to extract time-ranges from a pcap file).
If you do have a way to remove ethernet headers from a capture file, you must make sure the resulting pcap file has an encapsulation type of RAW_IP before trying to read it with wireshark, mergecap , etc).
Also note that the -T switch to mergecap just forces the encapsulation type specified in the file; The actual encapsulation isn't altered (i.e., no bytes are added/changed/deleted).
]]
For merge pcap files try alternative utility - tcpmerge
sample merge command:
./tcpmerge asd.pcap cap1_wlan0.pcap cap1_stego0.pcap OUTFILEMERGED.pcap
I have a file, its contents are identical. It is passed into gzip and only the compressed form is stored. I'd like to be able to generate the zip again, and only update my copy should they differ. As it stands diffing tools (diff, xdelta, subversion) see the files as having changed.
Premise, I'm storing a mysqldump of an important database into a subversion repository. It is my intention that a cronjob periodically dump the db, gzip it, and commit the file. Currently, every time the file is dumped and then gzipped it is considered as differing. I'd prefer not to have my revision numbers needlessly increase every 15m.
I realize I could dump the file as just plain text, but I'd prefer not as it's rather large.
The command I am currently using to generate the dumps is:
mysqldump $DB --skip-extended-insert | sed '$d' | gzip -n > $REPO/$DB.sql.gz
The -n instructs gzip to remove the filename/timestamp information. The sed '$d' removes the last line of the file where mysqldump places a timestamp.
At this point, I'm probably going to revert to storing it in a plain text fashion, but I was curious as to what kind of solution there is.
Resolved, Mr. Bright was correct, I had mistakenly used a capital N when the correct argument was a lowercase one.
The -N instructs gzip to remove the
filename/timestamp information.
Actually, that does just the opposite. -n is what tells it to forget the original file name and time stamp.
I think gzip is preserving the original date and timestamp on the file(s) which will cause it to produce a different archive.
-N --name
When compressing, always save the original file
name and time stamp; this is the default. When
decompressing, restore the original file name and
time stamp if present. This option is useful on
systems which have a limit on file name length or
when the time stamp has been lost after a file
transfer.
But watchout: two gzips made at different times of the same unchanged file differ. This is because the gzip is itself timestamped with the gzip creation date - this is written to the header of the gzip file. Thus the apparently different gzips can contain the exact same content.