Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 6 years ago.
Improve this question
I'm having an issue with tagging MP4/M4A files. The tagging operation goes A-OK. Well, I had an issue with the stco atom, but I fixed that. But now, when I play the MP4 file, mplayer gives me an error:
[mov,mp4,m4a,3gp,3g2,mj2 # 0x29db0a0] wrong sample count
However, the file does play.
Does anybody know what I'm missing? Here's what I do in order to add my tag atoms to the MP4 file. I have a feeling I'm not updating a certain atom just like the stco atom needed to be updated with the new absolute file position references.
Read in up to the 'moov' atom
Update 'moov's size to include the size of my tags (which are non-existent prior to the operation)
Write out all data (including updated 'moov' size) to the new file
Read in up to the 'stco' atom and the 4 bytes following (version and flag info that doesn't need to be changed).
Write out 'stco' header to the new file
Read in, process, and read out each 4-byte absolute file location to move them up by the size of the udta atom I'm going to be adding. Write each updated 4-byte location to the new file.
Write out the 'udta' atom (which directly follow 'stco') the new file.
Copy the remainder of the input file (the 'mdat' atom) to the new file.
Here's an AtomicParsley dump of the file structure:
Atom ftyp # 0 of size: 36, ends # 36
Atom moov # 36 of size: 61886, ends # 61922
Atom mvhd # 44 of size: 108, ends # 152
Atom iods # 152 of size: 33, ends # 185
Atom trak # 185 of size: 32935, ends # 33120
Atom tkhd # 193 of size: 92, ends # 285
Atom mdia # 285 of size: 32835, ends # 33120
Atom mdhd # 293 of size: 32, ends # 325
Atom hdlr # 325 of size: 37, ends # 362
Atom minf # 362 of size: 32758, ends # 33120
Atom smhd # 370 of size: 16, ends # 386
Atom dinf # 386 of size: 36, ends # 422
Atom dref # 394 of size: 28, ends # 422
Atom stbl # 422 of size: 32698, ends # 33120
Atom stts # 430 of size: 24, ends # 454
Atom stsd # 454 of size: 106, ends # 560
Atom mp4a # 470 of size: 90, ends # 560
Atom esds # 506 of size: 54, ends # 560
Atom stsz # 560 of size: 29548, ends # 30108
Atom stsc # 30108 of size: 40, ends # 30148
Atom stco # 30148 of size: 2972, ends # 33120
Atom udta # 33120 of size: 28802, ends # 61922
Atom meta # 33128 of size: 28794, ends # 61922
Atom hdlr # 33140 of size: 34, ends # 33174
Atom ilst # 33174 of size: 28748, ends # 61922
Atom ©ART # 33182 of size: 33, ends # 33215
Atom data # 33190 of size: 25, ends # 33215
Atom ©nam # 33215 of size: 77, ends # 33292
Atom data # 33223 of size: 69, ends # 33292
Atom ©alb # 33292 of size: 34, ends # 33326
Atom data # 33300 of size: 26, ends # 33326
Atom covr # 33326 of size: 28596, ends # 61922
Atom data # 33334 of size: 28588, ends # 61922
Atom mdat # 61922 of size: 2742564, ends # 2804486
Dang, another stupid question I guess. I was yet another ID-ten-T programming error. When I processed the 'stco' atom I only read in 12 bytes (size, atom name, version, flags) and forgot to read in the 'total entries' 4-byte section. So, what happened was I ended up adding the size of the 'udta' atom to the 'total entries' block, which caused the FFmpeg error. I was able to figure this out by looking at the FFmpeg source and the double-checking the structure of 'stco'.
Related
Using J I am trying to do something similar to the following example shown on page 128 of Mastering Dyalog APL by Bernard Legrand (2009). I have not been able to find a direct conversion of this code into J, which is what I want.
Here's the example:
BHCodes ← 83 12 12 83 43 66 50 81 12 83 14 66 etc...
BHAmounts ← 609 727 458 469 463 219 431 602 519 317 663 631...
13.3.2 - First Question
We would like to focus on some selected countries (14, 43, 50, 37, and 66) and
calculate the total amount of their sales. Let’s first identify which
items of BHCodes are relevant:
Selected ← 14 43 50 37 66
BHCodes ∊ Selected
0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 ⇦ Identifies sales in the selected countries only.
Then we can apply this filter to the amounts, and add them up:
(BHCodes ∊ Selected) / BHAmounts
463 219 431 663 631 421
+/ (BHCodes ∊ Selected) / BHAmounts
2828
+/ (BHCodes e. Selected) # BHAmounts
For your purposes here, APL's ∊ is J's e. (Member (In)) and APL's / is J's # (Copy).
Notes:
APL's ∊ and J's e. are not completely equivalent as APL's ∊ looks for every element in its left argument among the elements of its right argument, while J's e. looks for every major cell. of its left argument in the major cells of its right argument.
APL's / and J's # are not completely equivalent as APL's / operates along the trailing axis while J's # operates along the leading axis. APL does have ⌿ though, which operates along the leading axis. There are more nuances, but they are not relevant here.
I have read the PNG specifications too much times and still confused how I should interpret the IDAT chunk. I have it decompressed using zlib and got all of the bytes that my IDAT chunk got.
I made an example image using krita. It's an 3x2 PNG image containing a different color every pixel.
See the 3 by 2 PNG image here
According to the PNG specification about filters it says that when the first byte of the IDAT chunk is 1 the filter method that have been applied is
Filtered(byte) = Original(byte) - Original(previous_byte)
With that formula in mind I decompressed my IDAT chunk (which was 29 bytes in length to store only 6 pixels). The first byte (which is byte number 0) contains the value 1. That is where the formula comes from.
Byte# Vaue
0 1
1 224
2 215
3 200
4 227
5 241
6 48
7 2
8 36
9 225
10 1
11 253
12 255
13 195
14 245
15 182
16 244
17 232
18 245
19 57
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
The first pixel is supposed to be RGB(224, 215, 200) which I reconstructed with a RGB to color converter. This seems pretty much the same color as the original pixel in the image. Here are my thoughts about all the color pixels.
Pixel 1: RGB(224, 215, 200) [read from byte 1, byte2 and byte3]
Pixel 2: RGB(195, 200, 248) [because byte 4:227 byte5:241 byte6:48]
Pixel 3: RGB(197, 236, 217) [because byte 7:2 byte8:36 byte9:225]
Pixel 4: RGB(198, 233, 217) [because byte10:1 byte11:253 byte12:255]
Pixel 5: RGB(137, 222, 142) [because byte13:195 byte14:245 byte15:182]
Pixel 6: RGB(107, 198, 131) [because byte16:244 byte17:232 byte18:245]
I have used the formula to get all the values from the pixels.
Reconstructing pixel 1, 2 and 3 looks pretty much the same, but pixel 4, 5 and 6 are not what I have expected. I think I am not reading the IDAT chunk the correct way. That could explain why there are 29 bytes for only 6 pixels RGB. I expected 19 bytes because 3 times 6 is 18 and 1 byte for the filtering method.
The IHDR says that the bit depth is 8 and the color type is 2. From the table in the specifications it says that each pixel is an R, G and B triple. Could someone point me to the right direction to read the IDAT chunk and explain it's length?
Your decompressed result length of 29 is not correct, which may have lead to your confusion.
Your image is 3x2 RGB pixels. That would be 3*3 * 2 = 18 bytes of data, plus 1 extra byte per row; a total of 20 bytes. Somehow you got an extra 9 dummy bytes, not part of the compressed data.
(I reconstructed your tiny image from the larger one and happily got the exact same numbers, else the explanation would necessarily be purely theoretical. For ease, I determined the offset of the zipped data with a hex viewer.)
>>> with open ('3x2b.png','rb') as f:
... result = f.seek (0x6a)
... data = f.read()
...
>>> d = zlib.decompress(data)
>>> print ([x for x in d])
[1, 224, 215, 200, 227, 241, 48, 2, 36, 225, 1, 253, 255, 195, 245, 182, 244, 232, 245, 57]
This 'unpacks' to the following two rows, with 3 RGB pixel values each:
filter RGB RGB RGB
1 (224,215,200) (227,241,48) (2,36,225)
1 (253,255,195) (245,182,244, (232,245,57)
All these values may be relative to an earlier result: the last complete row read before it, or the pixel to its left. For the first row, you must assume a row of all zeroes; the value "left" of the first pixel must be assumed to be 0 as well.
You see the two bytes marked 'filter'? That is where you went wrong. Each row has a filter byte of its own. You used the filter byte itself for the calculation of the second row.
Adding (the inverse of the "Sub" filter as indicated by the filter 1) yields in
; start of row 0, filter is 1 and 'initial pixel' is (0,0,0)
(224,215,200) (224+227,215+241,200+48)
=(195,200,248)
(195+2,200+36,248+225)
=(197,236,217)
; restart for row 1, filter is 1 again and start value (0,0,0):
(253,255,195) (253+245,255+182,195+244)
=(242,181,183)
(242+232,181+245,183+57)
=(218,170,240)
... exactly the colors I started out with.
This is Filter 1 ("Sub") and so uses the values to its left; for Filter 2 ("Up"), you need to use the corresponding byte in the previously decoded row, and for Average and Paeth, you need both.
I want to first apologize for the biological nature of this post. I thought I should post some background first. I have a set of gene files that contain anywhere from one to five DNA sequences from different species. I used a bash shell script to perform blastn with each gene file as a query and a file of all transcriptome sequences (all_transcriptome_seq.fasta) from the five species as the subject. I now want to process these output files (and there are many) so that I can get all subject sequences that hit into one file per gene, with duplicate sequences removed (except to keep one), and ensure I'm getting the length of the sequences that actually hit the query.
Here is what the blastn output looks like for one gene file (columns: qseqid qlen sseqid slen qframe qstart qend sframe sstart send evalue bitscore pident nident length)
Acur_01000750.1_OFAS014956-RA-EXON04 248 Apil_comp17195_c0_seq1 1184 1 1 248 1 824 1072 2e-73 259 85.60 214 250
Acur_01000750.1_OFAS014956-RA-EXON04 248 Atri_comp5613_c0_seq1 1067 1 2 248 1 344 96 8e-97 337 91.16 227 249
Acur_01000750.1_OFAS014956-RA-EXON04 248 Acur_01000750.1 992 1 1 248 1 655 902 1e-133 459 100.00 248 248
Acur_01000750.1_OFAS014956-RA-EXON04 248 Btri_comp17734_c0_seq1 1001 1 1 248 1 656 905 5e-69 244 84.40 211 250
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Atri_comp5613_c0_seq1 1067 1 2 250 1 344 96 1e-60 217 82.33 205 249
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Acur_01000750.1 992 1 1 250 1 655 902 5e-69 244 84.40 211 250
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Btri_comp17734_c0_seq1 1001 1 1 250 1 656 905 1e-134 462 100.00 250 250
I've been working on a perl script that would, in short, take the sseqid column to pull out the corresponding sequences from the all_transcriptome_seq.fasta file, place these into a new file, and trim the transcripts to the sstart and send positions. Here is the script, so far:
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dumper;
############################################################################
# blastn_post-processing.pl v. 1.0 by Michael F., XXXXXX
############################################################################
my($progname) = $0;
############################################################################
# Initialize variables
############################################################################
my($jter);
my($com);
my($t1);
if ( #ARGV != 2 ) {
print "Usage:\n \$ $progname <infile> <transcriptomes>\n";
print " infile = tab-delimited blastn text file\n";
print " transcriptomes = fasta file of all transcriptomes\n";
print "exiting...\n";
exit;
}
my($infile)=$ARGV[0];
my($transcriptomes)=$ARGV[1];
############################################################################
# Read the input file
############################################################################
print "Reading the input file... ";
open (my $INF, $infile) or die "Unable to open file";
my #data = <$INF>;
print #data;
close($INF) or die "Could not close file $infile.\n";
my($nlines) = $#data + 1;
my($inlines) = $nlines - 1;
print "$nlines blastn hits read\n\n";
############################################################################
# Extract hits and place sequences into new file
############################################################################
my #temparray;
my #templine;
my($seqfname);
open ($INF, $infile) or die "Could not open file $infile for input.\n";
#temparray = <$INF>;
close($INF) or die "Could not close file $infile.\n";
$t1 = $#temparray + 1;
print "$infile\t$t1\n";
$seqfname = "$infile" . ".fasta";
if ( -e $seqfname ) {
print " --> $seqfname exists. overwriting\n";
unlink($seqfname);
}
# iterate through the individual hits
for ($jter=0; $jter<$t1; $jter++) {
(#templine) = split(/\s+/, $temparray[$jter]);
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
# print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
} # end for ($jter=0; $jter<$t1...
# Arguments for "extract_from_genome2"
# // argv[1] = name of genome file
# // argv[2] = gi number for contig
# // argv[3] = start of subsequence
# // argv[4] = end of subsequence
# // argv[5] = name of output sequence
Using this script, here is the output I'm getting:
>Apil_comp17195_c0_seq1
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
As you can see, it's pretty close to what I'm wanting. Here are the two issues I have and cannot seem to figure out how to resolve with my script. The first is that a sequence may occur more than once in the sseqid column, and with the script in its current form, it will print out duplicates of these sequences. I only need one. How can I modify my script to not duplicate sequences (i.e., how do I only retain one but remove the other duplicates)? Expected output:
>Apil_comp17195_c0_seq1
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
The second is the script is not quite extracting the right base pairs. It's super close, off by one or two, but its not exact.
For example, take the first subject hit Apil_comp17195_c0_seq1. The sstart and send values are 824 and 1072, respectively. When I go to the all_transcriptome_seq.fasta, I get
AAGATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAAC
at that base pair range, not
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
as outputted by my script, which is what I'm expecting. You will also notice that the sequence outputted by my script is slightly shorter than it should be. Does anyone know how I can fix these issues in my script?
Thanks, and sorry for the lengthy post!
Edit 1: a solution was offered that work for some of the infiles. However, some were causing the script to output fewer sequences than expected. Here is one such infile with 9 hits, from which I was expecting only 4 sequences.
Note: this issue has been largely resolved based on the solution provided below the answer section
Apil_comp16418_c0_seq1_OFAS000119-RA-EXON01 1587 Apil_comp16418_c0_seq1 2079 1 1 1587 1 416 2002 0.0 2931 100.00 1587 1587
Apil_comp16418_c0_seq1_OFAS000119-RA-EXON01 1587 Atri_comp13712_c0_seq1 1938 1 1 1587 1 1651 75 0.0 1221 80.73 1286 1593
Apil_comp16418_c0_seq1_OFAS000119-RA-EXON01 1587 Ctom_01003023.1 2162 1 1 1406 1 1403 1 0.0 1430 85.07 1197 1407
Atri_comp13712_c0_seq1_OFAS000119-RA-EXON01 1441 Apil_comp16418_c0_seq1 2079 1 1 1437 1 1866 430 0.0 1170 81.43 1175 1443
Atri_comp13712_c0_seq1_OFAS000119-RA-EXON01 1441 Atri_comp13712_c0_seq1 1938 1 1 1441 1 201 1641 0.0 2662 100.00 1441 1441
Atri_comp13712_c0_seq1_OFAS000119-RA-EXON01 1441 Acur_01000228.1 2415 1 1 1440 1 2231 797 0.0 1906 90.62 1305 1440
Ctom_01003023.1_OFAS000119-RA-EXON01 1289 Apil_comp16418_c0_seq1 2079 1 3 1284 1 1714 430 0.0 1351 85.69 1102 1286
Ctom_01003023.1_OFAS000119-RA-EXON01 1289 Acur_01000228.1 2415 1 1 1287 1 2084 797 0.0 1219 83.81 1082 1291
Ctom_01003023.1_OFAS000119-RA-EXON01 1289 Ctom_01003023.1 2162 1 1 1289 1 106 1394 0.0 2381 100.00 1289 1289
Edit 2: There is still an occasional output lacking fewer sequences than expected, although not as many after incorporating modifications to my script from Edit 1 suggestion (i.e., accounting for reverse direction). I cannot figure out why the script would be outputting fewer sequences in these other cases. Below the infile in question. The output is lacking Btri_comp15171_c0_seq1:
Apil_comp19456_c0_seq1_OFAS000248-RA-EXON07 2464 Apil_comp19456_c0_seq1 3549 1 1 2464 1 761 3224 0.0 4551 100.00 2464 2464
Apil_comp19456_c0_seq1_OFAS000248-RA-EXON07 2464 Btri_comp15171_c0_seq1 3766 1 1 2456 1 3046 591 0.0 1877 80.53 1985 2465
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457 Apil_comp19456_c0_seq1 3549 1 1 2457 1 3214 758 0.0 1879 80.54 1986 2466
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457 Atri_comp28646_c0_seq1 1403 1 1256 2454 1 1401 203 0.0 990 81.60 980 1201
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457 Btri_comp15171_c0_seq1 3766 1 1 2457 1 593 3049 0.0 4538 100.00 2457 2457
You can use hash to remove duplicates
The bellow code remove duplicates depending on their subject length (keep larger subject length rows).
Just update your # iterate through the individual hits part with
# iterate through the individual hits
my %filterhash;
my $subject_length;
for ($jter=0; $jter<$t1; $jter++) {
(#templine) = split(/\s+/, $temparray[$jter]);
$subject_length = $templine[9] -$templine[8];
if(exists $filterhash{$templine[2]} ){
if($filterhash{$templine[2]} < $subject_length){
$filterhash{$templine[2]}= $subject_length;
}
}
else{
$filterhash{$templine[2]}= $subject_length;
}
}
my %printhash;
for ($jter=0; $jter<$t1; $jter++) {
(#templine) = split(/\s+/, $temparray[$jter]);
$subject_length = $templine[9] -$templine[8];
if(not exists $printhash{$templine[2]})
{
$printhash{$templine[2]}=1;
if(exists $filterhash{$templine[2]} and $filterhash{$templine[2]} == $subject_length ){
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
# print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
}
}
else{
if(exists $filterhash{$templine[2]} and $filterhash{$templine[2]} == $subject_length ){
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
#print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
}
}
} # end for ($jter=0; $jter<$t1...
Hope this will help you.
Edit part update
for negative stand you need to replace
$subject_length = $templine[9] -$templine[8];
with
if($templine[8] > $templine[9]){
$subject_length = $templine[8] -$templine[9];
}else{
$subject_length = $templine[9] -$templine[8];
}
You also need to update your extract_from_genome2 code for negative strand sequences.
Firstly it is a very simple example:
In a text file ('test1.txt'), the content is:
Formally, the
What I want to get is an array with the ASCII encoding result like:
dat_ascii = [70 111 114 109 97 108 108 121 44 32 116 104 101]
In the result, every char is translated to ASCII code, even space and common.
Now I have a text file like 10MB full with English text. I want to read it and translate every char to ASCII code and put them into a matrix (with every 4096 char per line, many lines).
How can I do this in Matlab?
You can easily convert every thing in ASCII with :
double, you just cast to double your string.
And to revert it, just do char
Example :
myStr = 'I have 2 apple.'
myStr =
I have 2 apple.
myASCII = double(myStr)
myASCII =
73 32 104 97 118 101 32 50 32 97 112 112 108 101 46
myChar = char(myASCII)
myChar =
I have 2 apple.
In order to read text file in MATLAB, you need to open the text file and read
>> filePtr = fopen('test1.txt')
and then use the file pointer to read the data and convert to ASCII values:
>> ASCIIValues = double(textscan(filePtr, '%c')); ASCIIValues{:}
Note: Use the appropriate formatting argument when you try to read a text file. In my case, I neglect all whitespaces. For documentation, read http://www.mathworks.com/help/matlab/ref/textscan.html
We run a video service streaming movies to smartphones (iOS&Android).
We are encoding in H.264+AAC and using the mp4 container.
We have a problem that long movies (60 minutes+) take a very long time to
start playing and have tracked this down to the large size of moov
atom for these movies.
For 110 minute movies the atom is as large as 4.2Mb which obviously takes a long
time to download to a smart-phone over 3G!
Is there anyway to make the moov atom smaller? We can reduce it bit
by dropping the audio sampling rate, but obviously anything below 22kHz
would not really be acceptable.
We are using ffmpeg as the encoder, and MP4Box to move the metadata
to the front of the file. Is there any way to get it to make
a smaller moov? Any other encoders out there which make a smaller moov?
For example...
Big size (280 Mb, 1h 49min) streamable mp4 (h.264, AAC) file have a big header size (4.2 Mb). File was encoded by two pass ffmpeg and MP4Box for replacing metadata into beginning of the file:
/usr/bin/ffmpeg -i /var/lib/encoder/incoming/2388 -aspect 320:210 -threads 8 -vcodec libx264 -profile baseline -level 13 -flags +loop+mv4 -cmp 256 -partitions +parti4x4+parti8x8+partp4x4+partp8x8+partb8x8 -me_method hex -subq 7 -trellis 1 -refs 5 -bf 0 -me_range 16 -g 250 -keyint_min 25 -sc_threshold 40 -i_qfactor 0.71 -qmin 10 -qmax 51 -qdiff 4 -b:v 270k -maxrate 270k -bufsize 270k -g 30 -passlogfile /tmp/mediaservice/3100/video-IPH.ffmpeg -an -f rawvideo -pass 1 -y /dev/null
/usr/bin/ffmpeg -i /var/lib/encoder/incoming/2388 -aspect 320:210 -threads 8 -vcodec libx264 -profile baseline -level 13 -flags +loop+mv4 -cmp 256 -partitions +parti4x4+parti8x8+partp4x4+partp8x8+partb8x8 -me_method hex -subq 7 -trellis 1 -refs 5 -bf 0 -me_range 16 -g 250 -keyint_min 25 -sc_threshold 40 -i_qfactor 0.71 -qmin 10 -qmax 51 -qdiff 4 -b:v 270k -maxrate 270k -bufsize 270k -g 30 -passlogfile /tmp/mediaservice/3100/video-IPH.ffmpeg -acodec libfaac -ac 2 -b:a 32k -ar 44100 -f mp4 -pass 2 -y /var/lib/encoder/encoded/3100/video-IPH.mp4
/usr/bin/MP4Box -quiet -tmp /tmp/mediaservice/3100/ -inter 500 /var/lib/encoder/encoded/3100/video-IPH.mp4
Media info (audio sample rate = 44100):
General
Count : 278
Count of stream of this kind : 1
Kind of stream : General
Kind of stream : General
Stream identifier : 0
Count of video streams : 1
Count of audio streams : 1
Video_Format_List : AVC
Video_Format_WithHint_List : AVC
Codecs Video : AVC
Audio_Format_List : AAC
Audio_Format_WithHint_List : AAC
Audio codecs : AAC LC
Complete name : 1348645218_970458_2465.iph.mp4
File name : 1348645218_970458_2465.iph.mp4
File extension : mp4
Format : MPEG-4
Format : MPEG-4
Format/Extensions usually used : mp4 m4v m4a m4b m4p 3gpp 3gp 3gpp2 3g2 k3g jpm jpx mqv ismv isma f4v
Commercial name : MPEG-4
Format profile : Base Media
Internet media type : video/mp4
Codec ID : isom
Codec ID/Url : http://www.apple.com/quicktime/download/standalone.html
Codec : MPEG-4
Codec : MPEG-4
Codec/Extensions usually used : mp4 m4v m4a m4b m4p 3gpp 3gp 3gpp2 3g2 k3g jpm jpx mqv ismv isma f4v
File size : 272703970
File size : 260 MiB
File size : 260 MiB
File size : 260 MiB
File size : 260 MiB
File size : 260.1 MiB
Duration : 6556027
Duration : 1h 49mn
Duration : 1h 49mn 16s 27ms
Duration : 1h 49mn
Duration : 01:49:16.027
Overall bit rate : 332767
Overall bit rate : 333 Kbps
Stream size : 4230761
Stream size : 4.03 MiB (2%)
Stream size : 4 MiB
Stream size : 4.0 MiB
Stream size : 4.03 MiB
Stream size : 4.035 MiB
Stream size : 4.03 MiB (2%)
Proportion of this stream : 0.01551
HeaderSize : 4230683
DataSize : 268473217
FooterSize : 70
IsStreamable : Yes
File last modification date : UTC 2012-09-26 12:38:19
File last modification date (local) : 2012-09-26 21:38:19
Writing application : Lavf54.6.100
Video
Count : 201
Count of stream of this kind : 1
Kind of stream : Video
Kind of stream : Video
Stream identifier : 0
ID : 1
ID : 1
Format : AVC
Format/Info : Advanced Video Codec
Format/Url : http://developers.videolan.org/x264.html
Commercial name : AVC
Format profile : Baseline#L1.3
Format settings : 5 Ref Frames
Format settings, CABAC : No
Format settings, CABAC : No
Format settings, ReFrames : 5
Format settings, ReFrames : 5 frames
Format settings, GOP : M=1, N=30
Internet media type : video/H264
Codec ID : avc1
Codec ID/Info : Advanced Video Coding
Codec ID/Url : http://www.apple.com/quicktime/download/standalone.html
Codec : AVC
Codec : AVC
Codec/Family : AVC
Codec/Info : Advanced Video Codec
Codec/Url : http://developers.videolan.org/x264.html
Codec/CC : avc1
Codec profile : Baseline#L1.3
Codec settings : 5 Ref Frames
Codec settings, CABAC : No
Codec_Settings_RefFrames : 5
Duration : 6556017
Duration : 01:49:16.017
Bit rate : 270000
Bit rate : 270 Kbps
Width : 480
Width : 480 pixels
Height : 270
Height : 270 pixels
Pixel aspect ratio : 1.000
Display aspect ratio : 1.778
Display aspect ratio : 16:9
Rotation : 0.000
Frame rate mode : CFR
Frame rate mode : Constant
FrameRate_Mode_Original : VFR
Frame rate : 29.970
Frame rate : 29.970 fps
Frame count : 196484
Resolution : 8
Resolution : 8 bits
Colorimetry : 4:2:0
Color space : YUV
Chroma subsampling : 4:2:0
Bit depth : 8
Bit depth : 8 bits
Scan type : Progressive
Scan type : Progressive
Interlacement : PPF
Interlacement : Progressive
Bits/(Pixel*Frame) : 0.070
Stream size : 220159060
Stream size : 210 MiB (81%)
Stream size : 210 MiB
Stream size : 210 MiB
Stream size : 210 MiB
Stream size : 210.0 MiB
Stream size : 210 MiB (81%)
Proportion of this stream : 0.80732
Writing library : x264 - core 125
Writing library : x264 core 125
Writing library/Name : x264
Writing library/Version : core 125
Encoding settings : cabac=0 / ref=5 / deblock=1:0:0 / analyse=0x1:0x131 / me=hex / subme=7 / psy=1 / psy_rd=1.00:0.00 / mixed_ref=1 / me_range=16 / chroma_me=1 / trellis=1 / 8x8dct=0 / cqm=0 / deadzone=21,11 / fast_pskip=1 / chroma_qp_offset=-2 / threads=8 / lookahead_threads=1 / sliced_threads=0 / nr=0 / decimate=1 / interlaced=0 / bluray_compat=0 / constrained_intra=0 / bframes=0 / weightp=0 / keyint=30 / keyint_min=16 / scenecut=40 / intra_refresh=0 / rc_lookahead=30 / rc=2pass / mbtree=1 / bitrate=270 / ratetol=1.0 / qcomp=0.60 / qpmin=10 / qpmax=51 / qpstep=4 / cplxblur=20.0 / qblur=0.5 / vbv_maxrate=270 / vbv_bufsize=270 / nal_hrd=none / ip_ratio=1.40 / aq=1:1.00
Tagged date : UTC 2012-09-25 07:21:37
Audio
Count : 169
Count of stream of this kind : 1
Kind of stream : Audio
Kind of stream : Audio
Stream identifier : 0
ID : 2
ID : 2
Format : AAC
Format/Info : Advanced Audio Codec
Commercial name : AAC
Format profile : LC
Codec ID : 40
Codec : AAC LC
Codec : AAC LC
Codec/Family : AAC
Codec/CC : 40
Duration : 6556027
Duration : 1h 49mn
Duration : 1h 49mn 16s 27ms
Duration : 1h 49mn
Duration : 01:49:16.027
Bit rate mode : VBR
Bit rate mode : Variable
Bit rate : 58955
Bit rate : 59.0 Kbps
Maximum bit rate : 270000
Maximum bit rate : 270 Kbps
Channel(s) : 2
Channel(s) : 2 channels
Channel positions : Front: L R
Channel positions : 2/0/0
Sampling rate : 44100
Sampling rate : 44.1 KHz
Samples count : 289120791
Compression mode : Lossy
Compression mode : Lossy
Stream size : 48314149
Stream size : 46.1 MiB (18%)
Stream size : 46 MiB
Stream size : 46 MiB
Stream size : 46.1 MiB
Stream size : 46.08 MiB
Stream size : 46.1 MiB (18%)
Proportion of this stream : 0.17717
Tagged date : UTC 2012-09-25 07:21:37
Moov atom info (/moov/trak[0] - video, /moov/trak[1] - audio) sample rate 44100:
(look stsz and stts nodes in trak)
Atom ftyp # 0 of size: 32, ends # 32
Atom moov # 32 of size: 4230651, ends # 4230683
Atom mvhd # 40 of size: 108, ends # 148
Atom trak # 148 of size: 868970, ends # 869118
Atom tkhd # 156 of size: 92, ends # 248
Atom edts # 248 of size: 36, ends # 284
Atom elst # 256 of size: 28, ends # 284
Atom mdia # 284 of size: 868834, ends # 869118
Atom mdhd # 292 of size: 32, ends # 324
Atom hdlr # 324 of size: 45, ends # 369
Atom minf # 369 of size: 868749, ends # 869118
Atom vmhd # 377 of size: 20, ends # 397
Atom dinf # 397 of size: 36, ends # 433
Atom dref # 405 of size: 28, ends # 433
Atom stbl # 433 of size: 868685, ends # 869118
Atom stsd # 441 of size: 149, ends # 590
Atom avc1 # 457 of size: 133, ends # 590
Atom avcC # 543 of size: 47, ends # 590
Atom stts # 590 of size: 24, ends # 614
Atom stss # 614 of size: 26340, ends # 26954
Atom stsc # 26954 of size: 52, ends # 27006
Atom stsz # 27006 of size: 785956, ends # 812962
Atom stco # 812962 of size: 56156, ends # 869118
Atom trak # 869118 of size: 3361468, ends # 4230586
Atom tkhd # 869126 of size: 92, ends # 869218
Atom edts # 869218 of size: 36, ends # 869254
Atom elst # 869226 of size: 28, ends # 869254
Atom mdia # 869254 of size: 3361332, ends # 4230586
Atom mdhd # 869262 of size: 32, ends # 869294
Atom hdlr # 869294 of size: 45, ends # 869339
Atom minf # 869339 of size: 3361247, ends # 4230586
Atom smhd # 869347 of size: 16, ends # 869363
Atom dinf # 869363 of size: 36, ends # 869399
Atom dref # 869371 of size: 28, ends # 869399
Atom stbl # 869399 of size: 3361187, ends # 4230586
Atom stsd # 869407 of size: 91, ends # 869498
Atom mp4a # 869423 of size: 75, ends # 869498
Atom esds # 869459 of size: 39, ends # 869498
**Atom stts # 869498 of size: 2135816, ends # 3005314**
Atom stsc # 3005314 of size: 39712, ends # 3045026
**Atom stsz # 3045026 of size: 1129400, ends # 4174426**
Atom stco # 4174426 of size: 56160, ends # 4230586
Atom udta # 4230586 of size: 97, ends # 4230683
Atom meta # 4230594 of size: 89, ends # 4230683
Atom hdlr # 4230606 of size: 33, ends # 4230639
Atom ilst # 4230639 of size: 44, ends # 4230683
Atom ©too # 4230647 of size: 36, ends # 4230683
Atom data # 4230655 of size: 28, ends # 4230683
Atom mdat # 4230683 of size: 268473217, ends # 272703900
Atom free # 272703900 of size: 8, ends # 272703908
Atom free # 272703908 of size: 62, ends # 272703970
------------------------------------------------------
Total size: 272703970 bytes; 50 atoms total. AtomicParsley version: 0.9.0 (utf8)
Media data: 268473217 bytes; 4230753 bytes all other atoms (1.551% atom overhead).
Total free atom space: 70 bytes; 0.000% waste. Padding available: 0 bytes.
------------------------------------------------------
After reencoding this movie with audio sample rate 11025 header size much less:
Media info (audio sample rate = 11025): (crop duplicate info)
General
***
HeaderSize : 1276359
Video
***
Audio
Count : 169
Count of stream of this kind : 1
Kind of stream : Audio
Kind of stream : Audio
Stream identifier : 0
ID : 2
ID : 2
Format : AAC
Format/Info : Advanced Audio Codec
Commercial name : AAC
Format profile : LC
Codec ID : 40
Codec : AAC LC
Codec : AAC LC
Codec/Family : AAC
Codec/CC : 40
Duration : 6556132
Duration : 1h 49mn
Duration : 1h 49mn 16s 132ms
Duration : 1h 49mn
Duration : 01:49:16.132
Bit rate mode : VBR
Bit rate mode : Variable
Bit rate : 37991
Bit rate : 38.0 Kbps
Maximum bit rate : 128000
Maximum bit rate : 128 Kbps
Channel(s) : 2
Channel(s) : 2 channels
Channel positions : Front: L R
Channel positions : 2/0/0
Sampling rate : 11025
Sampling rate : 11.025 KHz
Samples count : 72281355
Compression mode : Lossy
Compression mode : Lossy
Stream size : 31134257
Stream size : 29.7 MiB (12%)
Stream size : 30 MiB
Stream size : 30 MiB
Stream size : 29.7 MiB
Stream size : 29.69 MiB
Stream size : 29.7 MiB (12%)
Proportion of this stream : 0.12327
Tagged date : UTC 2012-09-25 13:20:28
Moov atom info (/moov/trak[0] - video, /moov/trak[1] - audio) sample rate 11025:
Atom ftyp # 0 of size: 32, ends # 32
Atom moov # 32 of size: 1276327, ends # 1276359
Atom mvhd # 40 of size: 108, ends # 148
Atom trak # 148 of size: 821662, ends # 821810
Atom tkhd # 156 of size: 92, ends # 248
Atom edts # 248 of size: 36, ends # 284
Atom elst # 256 of size: 28, ends # 284
Atom mdia # 284 of size: 821526, ends # 821810
Atom mdhd # 292 of size: 32, ends # 324
Atom hdlr # 324 of size: 45, ends # 369
Atom minf # 369 of size: 821441, ends # 821810
Atom vmhd # 377 of size: 20, ends # 397
Atom dinf # 397 of size: 36, ends # 433
Atom dref # 405 of size: 28, ends # 433
Atom stbl # 433 of size: 821377, ends # 821810
Atom stsd # 441 of size: 149, ends # 590
Atom avc1 # 457 of size: 133, ends # 590
Atom avcC # 543 of size: 47, ends # 590
Atom stts # 590 of size: 24, ends # 614
Atom stss # 614 of size: 26340, ends # 26954
Atom stsc # 26954 of size: 52, ends # 27006
Atom stsz # 27006 of size: 785956, ends # 812962
Atom stco # 812962 of size: 8848, ends # 821810
Atom trak # 821810 of size: 454452, ends # 1276262
Atom tkhd # 821818 of size: 92, ends # 821910
Atom edts # 821910 of size: 36, ends # 821946
Atom elst # 821918 of size: 28, ends # 821946
Atom mdia # 821946 of size: 454316, ends # 1276262
Atom mdhd # 821954 of size: 32, ends # 821986
Atom hdlr # 821986 of size: 45, ends # 822031
Atom minf # 822031 of size: 454231, ends # 1276262
Atom smhd # 822039 of size: 16, ends # 822055
Atom dinf # 822055 of size: 36, ends # 822091
Atom dref # 822063 of size: 28, ends # 822091
Atom stbl # 822091 of size: 454171, ends # 1276262
Atom stsd # 822099 of size: 91, ends # 822190
Atom mp4a # 822115 of size: 75, ends # 822190
Atom esds # 822151 of size: 39, ends # 822190
Atom stts # 822190 of size: 161368, ends # 983558
Atom stsc # 983558 of size: 1480, ends # 985038
Atom stsz # 985038 of size: 282372, ends # 1267410
Atom stco # 1267410 of size: 8852, ends # 1276262
Atom udta # 1276262 of size: 97, ends # 1276359
Atom meta # 1276270 of size: 89, ends # 1276359
Atom hdlr # 1276282 of size: 33, ends # 1276315
Atom ilst # 1276315 of size: 44, ends # 1276359
Atom ©too # 1276323 of size: 36, ends # 1276359
Atom data # 1276331 of size: 28, ends # 1276359
Atom mdat # 1276359 of size: 251293325, ends # 252569684
Atom free # 252569684 of size: 8, ends # 252569692
Atom free # 252569692 of size: 62, ends # 252569754
------------------------------------------------------
Total size: 252569754 bytes; 50 atoms total. AtomicParsley version: 0.9.0 (utf8)
Media data: 251293325 bytes; 1276429 bytes all other atoms (0.505% atom overhead).
Total free atom space: 70 bytes; 0.000% waste. Padding available: 0 bytes.
------------------------------------------------------
On slow connection this movie start playing after 30-40 seconds until header info (4.2 Mb) downloading. I need that movie start playing fast as it possible. And i have next questions:
How reduce size of movie header?
How reduce size of
/moov[0]/trak[1]/mdia[0]/minf[0]/stbl[0] and why it so big when
sample rate 44100?
It looks like the AAC encoder or ffmpeg is doing a bad job encoding the AAC stream. The sample rate is not the issue here. Have you tried using the other AAC encoder?
-acodec aac -strict experimental
ffmpeg usually uses very small chunk sizes which will consequently lead to larger headers. Same goes for video.
The below is quite extreme.
Atom stts # 869498 of size: 2135816, ends # 3005314
Atom stsz # 3045026 of size: 1129400, ends # 4174426
I would try a different encoder if the other AAC encoder ends up looking the same.
Instead of using ffmpeg and then mp4box, I recommend you to look at a branch of ffmpeg called ffmbc. This can put the header at the start while transcoding instead of it being a post process. But given it's an ffmpeg branch, I am not sure whether it will help on your header size issue. Worth giving a try though.