Value too large for defined data type in Comm command solaris - solaris

When I use comm command to compare the files with 2 GB and 1.7GB I got the following error.
Value too large for defined data type
I tried the following command.
comm -23 file1.txt file2.txt
Solaris Generic_150401-32 i86pc
Kindly help

As Sathiyadasan writes, Solaris 10 comm can't handle large files (>2GB).
I offer 3 options:
1) download the GNU version of comm and use that on solaris 10
2) move to Solaris 11 and use the /usr/gnu/bin/comm
3) write a more complicated script, depending on what you're trying to accomplish:
Reducing your data might make the problem more manageable. If the files have lots of duplicate entries, this workds well. If you're trying to find lines that are unique to the first file, but don't care about the order of the lines within the file, you could use:
sort -o file1.smaller -u file1.txt
sort -o file2.smaller -u file2.txt
comm -23 file1.smaller file2.smaller
Really, how you handle this depends on the nature of your data and what you're trying to discover.
Good luck!

Related

How do I run this Bismark Bisulfite Sequencing program?

I am very new to coding so I'm not really sure how to approach this. I wanted to look at some data that we got and sequence them using Bismark. I already used Trim Galore to pare the reads, now I wanted to get the data into Bismark. However, I'm not exactly sure how to approach this. In the documentation it said that it required Perl to run so I downloaded Perl along with the Bismark zip file from github. I also downloaded the bowtie2 zip file and extracted both the zip files into the same directory. I then opened up the Perl command prompt and set the directory to one with my extracted folders.
I put this line in:
> \bismark\bismark_genome_preparation --path_to_bowtie ^
C:\Users\sevro\Documents\Lab_Code\bowtie2-master --verbose ^
C:\Users\sevro\Documents\Lab_Code\genome
The system cannot find the path specified.
I also tried this after changing the directory to the Bismark folder:
> perl bismark
Failed to execute Bowtie 2 porperly (return code of 'bowtie2 --version' was 256).
Please install Bowtie 2 or HISAT2 first and make sure it is in the PATH,
or specify the path to the Bowtie 2 with --path_to_bowtie2 /path/to/bowtie2,
or --path_to_hisat2 /path/to/hisat2
I tried a few other things but all in all I am a bit confused on how exactly to approach this. Things I have downloaded right now:
Bismark zip file- https://github.com/FelixKrueger/Bismark
Bowtie2 zip file- https://github.com/BenLangmead/bowtie2
A genome assembly in .fa format
The data that I want to analyze in fasta format
Any insight would be helpful.
I think Bismark and bowtie2 only supports Linux and macOS natively. If you want to use bismark on Windows you can try install it via a *nix emulation systems like Cygwin, MSYS2, or simply use WSL. I tested this on Windows 11 with WSL with Ubuntu 20.04:
Downloaded bowtie2-2.4.4-linux-x86_64.zip and extracted to ~/bowtie2/bowtie2-2.4.4-linux-x86_64 folder.
Downloaded Bismark-0.23.1.zip and extracted to ~/bismark/Bismark-0.23.1/
Tested installation:
$ perl --version
This is perl 5, version 30, subversion 0 (v5.30.0) built for x86_64-linux-gnu-thread-multi (with 50 registered patches, see perl -V for more detail)
$ perl bismark --path_to_bowtie2 ../../bowtie2/bowtie2-2.4.4-linux-x86_64/Bowtie 2 seems to be working fine (tested command '../../bowtie2/bowtie2-2.4.4-linux-x86_64/bowtie2 --version' [2.4.4])
Output format is BAM (default)
Did not find Samtools on the system. Alignments will be compressed with GZIP instead (.sam.gz)
Genome folder was not specified!
DESCRIPTION
The following is a brief description of command line options and arguments to control the Bismark
bisulfite mapper and methylation caller. Bismark takes in FastA or FastQ files and aligns the
reads to a specified bisulfite genome. Sequence reads are transformed into a bisulfite converted forward strand
version (C->T conversion) or into a bisulfite treated reverse strand (G->A conversion of the forward strand).
Each of these reads are then aligned to bisulfite treated forward strand index of a reference genome
(C->T converted) and a bisulfite treated reverse strand index of the genome (G->A conversion of the
forward strand, by doing this alignments will produce the same positions). These 4 instances of Bowtie 2 or HISAT2
are run in parallel. The sequence file(s) are then read in again sequence by sequence to pull out the original
sequence from the genome and determine if there were any protected C's present or not.
The final output of Bismark is in BAM/SAM format by default, described in more detail below.
USAGE: bismark [options] <genome_folder> {-1 <mates1> -2 <mates2> | <singles>}
[...]

Sorting and removing duplicates from single or multiple large files

I have a 70gb file with 400 million+ lines (JSON). My end goal is to remove duplicates lines so i have a fully "de-duped" version of the file. I am doing this on a machine with 8cores and 64gb ram.
I am also expanding on this thread, 'how to sort out duplicates from a massive list'.
Things i have tried:
Neek - javascript quickly runs out of memory
Using Awk (doesn't seem to work for this)
using Perl (perl -ne 'print unless $dup{$_}++;') - again, runs out of memory
sort -u largefile > targetfile
does not seem to work. I think the file is too large.
Current approach:
Split the files into chunks of 5million lines each.
Sort/Uniq each of the files
for X in *; do sort -u --parallel=6 $X > sorted/s-$X; done
Now I have 80 individually sorted files. I am trying to re-merge/de-dupe them them using sort -m. This seems to do nothing as the file/line size ends up being the same.
Since sort -m does not seem to work, i am currently trying this:
cat *.json | sort > big-sorted.json
then I will try to run uniq with
uniq big-sorted.json > unique-sorted.json
Based on past experience, I do not believe this will work.
What is the best approach here? How do i re-merge the files and remove any duplicate lines at this point.
Update 1
As I suspected, cat * | sort > bigfile did not work. It just copied everything to a single file the way it was previously sorted (in individual files).
Update 2:
I also tried the following code:
cat *.json | sort --parallel=6 -m > big-sorted.json
The result was the same as the previous update.
I am fresh out of ideas.
Thanks!
After some trial and error, i found the solution:
sort -us -o out.json infile.json

How to grep over 1 million files?

I need to grep about 1 million files. If there's a better way to do this, let me know. I was thinking there may be a faster way to do it in perl.
What I'm trying to do is export every line that contains the text httpsfile in it.
Here's what I'm trying to run:
grep 'httpsfile' * >> grepped.txt
Here's the error I'm getting:
-bash: /bin/grep: Argument list too long
Any help would be appreciated.
You can do it in parallel if you want:
ls > /tmp/files
parallel -a /tmp/files --xargs -s 100 grep 'httpsfile'
Unless you have a lot of RAM and your on million files are already in the buffer cache, parallelizing won't be of any help given the fact the operation will be I/O bound so here is the fastest still portable (POSIX) way:
find . -exec grep httpsfile {} + > grepped.txt
Note that unlike the accepted answer solution, using find won't fail with oddly named files. Have a look to https://unix.stackexchange.com/questions/128985/why-not-parse-ls
Try ls | xargs grep httpsfile.
Just change * to ./ or, whatever is the root directory that contains the 1 million files. You might need to add -r as well to make grep recursive and look into nested directories.
* in the shell expands out into all the files.

grep command to print follow-up lines after a match

how to use "grep" command to find a match and to print followup of 10 lines from the match. this i need to get some error statements from log files. (else need to download use match for log time and then copy the content). Instead of downloading bulk size files i need to run a command to get those number of lines.
A default install of Solaris 10 or 11 will have the /usr/sfw/bin file tree. Gnu grep - /usr/sfw/bin/ggrep is there. ggrep supports /usr/sfw/bin/ggrep -A 10 [pattern] [file] which does what you want.
Solaris 9 and older may not have it. Or your system may not have been a default install. Check.
Suppose, you have a file /etc/passwd and want to filter user "chetan"
Please try below command:
cat /etc/passwd | /usr/sfw/bin/ggrep -A 2 'chetan'
It will print the line with letter "chetan" and the next two lines as well.
-- Tested in Solaris 10 --

Why does grep hang when run against the / directory?

My question is in two parts :
1) Why does grep hang when I grep all files under "/" ?
for example :
grep -r 'h' ./
(note : right before the hang/crash, I note that I see some "no such device or address" messages , regarding sockets....
Of course, I know that grep shouldn't run against a socket, but I would think that since sockets are just files in Unix, it should return a negative result, rather than crashing.
2) Now, my follow up question : In any case -- how can I grep the whole filesystem? Are there certain *NIX directories which we should leave out when doing this ? In particular, I'm looking for all recently written log files.
As #ninjalj said, if you don't use -D skip, grep will try to read all your device files, socket files, and FIFO files. In particular, on a Linux system (and many Unix systems), it will try to read /dev/zero, which appears to be infinitely long.
You'll be waiting for a while.
If you're looking for a system log, starting from /var/log is probably the best approach.
If you're looking for something that really could be anywhere in your file system, you can do something like this:
find / -xdev -type f -print0 | xargs -0 grep -H pattern
The -xdev argument to find tells it to stay within a single filesystem; this will avoid /proc and /dev (as well as any mounted filesystems). -type f limits the search to ordinary files. -print0 prints the file names separated by null characters rather than newlines; this avoid problems with files having spaces or other funny characters in their names.
xargs reads a list of file names (or anything else) on its standard input and invokes the specified command on everything in the list. The -0 option works with find's -print0.
The -H option to grep tells it to prefix each match with the file name. By default, grep does this only if there are two or more file names on its command line. Since xargs splits its arguments into batches, it's possible that the last batch will have just one file, which would give you inconsistent results.
Consider using find ... -name '*.log' to limit the search to files with names ending in .log (assuming your log files have such names), and/or using grep -I ... to skip binary files.
Note that all this depends on GNU-specific features. Some of these options might not be available on MacOS (which is based on BSD) or on other Unix systems. Consult your local documentation, and consider installing GNU findutils (for find and xargs) and/or GNU grep.
Before trying any of this, use df to see just how big your root filesystem is. Mine is currently 268 gigabytes; searching all of it would probably take several hours. A few minutes spent (a) restricting the files you search and (b) making sure the command is correct will be well worth the time you spend.
By default, grep tries to read every file. Use -D skip to skip device files, socket files and FIFO files.
If you keep seeing error messages, then grep is not hanging. Keep iotop open in a second window to see how hard your system is working to pull all the contents off its storage media into main memory, piece by piece. This operation should be slow, or you have a very barebones system.
Now, my follow up question : In any case -- how can I grep the whole filesystem? Are there certain *NIX directories which we should leave out when doing this ? In particular, Im looking for all recently written log files.
Grepping the whole FS is very rarely a good idea. Try grepping the directory where the log files should have been written; likely /var/log. Even better, if you know anything about the names of the files you're looking for (say, they have the extension .log), then do a find or locate and grep the files reported by those programs.