Trying to merge 6000 small parquet files into a single parquet file - merge

I have 6000 parquet files (5-15 kb each) in hdfs, which is creating that many tasks of spark. I need to merge them in a single file.
I have already tried below codes. The problem with the first one is it is generating a text file and I need a parquet file as output.
The issue with the second one is it works fine with 300-400 files but gives an error as "Too many files open" when I try for 6000 files.
1.)
hadoop jar \
hadoop-streaming-3.2.0.jar \
-Dmapred.reduce.tasks=1 \
-Dmapred.job.queue.name=queue \
-Dstream.reduce.output=parquet \
-input "input file" \
-output "output file" \
-mapper cat \
-reducer cat
2.)
hadoop jar parquet-tools-1.9.0.jar merge /inputfile /outputfile
So, any help is appreciated here.

You can increase the open file limit to 6000 in your OS.
Check:
ulimit -a | grep open
limit is configured here
/etc/security/limits.conf

Related

Loading tables with partitions in ora2pg

I am having issues with bringing in selective data. I have a table that has 32 partitions and only want to import data for 2 partitions at a time. So i used the below directive in my *.conf file but when i execute ora2pg it brings in all the partitions. I also tried to use -t PARTITION option to exclude the ones i don't need but it doesn't bring any of them.
REPLACE_QUERY PH*** [SELECT * FROM PH** WHERE TIMECREATED between '2021-03-01' and '2021-06-01']
$ ora2pg -t COPY -o data.sql -b ./data -c $HOME/o2p/oracle_service_name/config/ora2pg.conf -l data_ext.log -t PARTITION -e 'PARTITION[INVOICES_Q1 NVOICES_Q10 INVOICES_Q11 INVOICES_Q12 INVOICES_Q13 INVOICES_Q14 INVOICES_Q15 INVOICES_Q16 INVOICES_Q17 INVOICES_Q18 INVOICES_Q19 INVOICES_Q2 INVOICES_Q20 INVOICES_Q21 INVOICES_Q22 INVOICES_Q23 INVOICES_Q26 INVOICES_Q27 INVOICES_Q28 INVOICES_Q29 INVOICES_Q3 INVOICES_Q30 INVOICES_Q31 INVOICES_Q32 INVOICES_Q4 INVOICES_Q5 INVOICES_Q6 INVOICES_Q7 INVOICES_Q8]'
[========================>] 0/0 partitions (100.0%) end of output.

how to exclude snapshots while running tar in Solaris

I'm trying to take a tar of the '/home/store/' directory content.
tar cvf store.tar /home/store/
While doing so, I can see that the .snapshot directories are also getting included. My understanding is that snapshots are kind of backups. Can I skip this? If possible, how? Tried excluding a test directory using the below command ran from /home/store/
tar cvfX store.tar <(echo /home/store/test) /home/store/
But this is not excluding the test directory from the tar created.
Also, tried this
tar cvf store.tar /home/store/ --exclude-file=exclude.txt
Output:
a /home/store// 0K
a /home/store//.profile 1K
a /home/store//local.profile 1K
a /home/store//.vas_logon_server 1K
a /home/store//.vas_disauthcc_611400381 1K
a /home/store//.bash_history 7K
a /home/store//test/ 0K
a /home/store//test/1.txt 1K
a /home/store//test/migrate-perf3.txt 3958K
a /home/store//test.txt 1K
a /home/store//exclude.txt 1K
a /home/store//.snapshot/hourly.0/d2/dd/d5d/f82-1 59K
a /home/store//.snapshot/hourly.0/d2/dd/d5d/f83-1 58K
.....
tar: --exclude-file=exclude.txt: No such file or directory
/home/store/exclude.txt has the entry 'test'. Tried entering the following as well and got same error.
/home/store/test/
/home/store/test/1.txt
When I gave the full path to 'exclude.txt' like this
`tar cvf store.tar /home/store/ --exclude-file=/home/store/exclude.txt`
it's giving the below error
tar: can't change directories to --exclude-file=/home/store: No such file or directory
tar -h
Usage: tar {c|r|t|u|x}[BDeEFhilmnopPqTvw#[0-7]][bfk][X...] [blocksize] [tarfile] [size] [exclude-file...] {file | -I include-file | -C directory file}...
Thanks well in advance!
Van Peer
Try to do so:
tar cvfX /var/tmp/src.tar /var/tmp/excl.txt /var/tmp/src/
Your exclude file should contain path:
/home/store//.snapshot
Best practice not to use full path of your tar dir, because in future you can overwite your /etc , when extract tar archive from /var/tmp, for example.
For example:
sudo tar -zcvpf /backup/farm-backup-$(date +%d-%m-%Y).tar.gz --exclude ".snapshots" --exclude ".cache" farm
Did not use a backslash in the command ie:/farm for the directory. Execute the tar command from the /home directory to back up "farm" user.
for making a backup in the root /backup directory.
OS: OpenSuse 15.1

merge chromosomes in Plink

I have downloaded 1000G dataset in the vcf format. Using Plink 2.0 I have converted them into binary format.
Now I need to merge the 1-22 chromosomes.
I am using this script:
${BIN}plink2 \
--bfile /mnt/jw01-aruk-home01/projects/jia_mtx_gwas_2016/common_files/data/clean/thousand_genomes/from_1000G_web/chr1_1000Gv3 \
--make-bed \
--merge-list /mnt/jw01-aruk-home01/projects/jia_mtx_gwas_2016/common_files/data/clean/thousand_genomes/from_1000G_web/chromosomes_1000Gv3.txt \
--out /mnt/jw01-aruk-home01/projects/jia_mtx_gwas_2016/common_files/data/clean/thousand_genomes/from_1000G_web/all_chrs_1000G_v3 \
--noweb
But, I get this error
Error: --merge-list only accepts 1 parameter.
The chromosomes_1000Gv3.txt has files related to chromosomes 2-22 in this format:
chr2_1000Gv3.bed chr2_1000Gv3.bim chr2_1000Gv3.fam
chr3_1000Gv3.bed chr3_1000Gv3.bim chr3_1000Gv3.fam
....
Any suggestions what might be the issue?
Thanks
The --merge-list cannot be used in combination with --bfile. You can either have --bfile/--bmerge or --merge-list only in one plink command.

wget --warc-file --recursive, prevent writing individual files

I run wget to create a warc archive as follows:
$ wget --warc-file=/tmp/epfl --recursive --level=1 http://www.epfl.ch/
$ l -h /tmp/epfl.warc.gz
-rw-r--r-- 1 david wheel 657K Sep 2 15:18 /tmp/epfl.warc.gz
$ find .
./www.epfl.ch/index.html
./www.epfl.ch/public/hp2013/css/homepage.70a623197f74.css
[...]
I only need the epfl.warc.gz file. How do I prevent wget to creating all the individual files?
I tried as follows:
$ wget --warc-file=/tmp/epfl --recursive --level=1 --output-document=/dev/null http://www.epfl.ch/
ERROR: -k or -r can be used together with -O only if outputting to a regular file.
tl;dr Add the options --delete-after and --no-directories.
Option --delete-after instructs wget to delete each downloaded file immediately after its download is complete. As a consequence, the maximum disk usage during execution will be the size of the WARC file plus the size of the single largest downloaded file.
Option --no-directories prevents wget from leaving behind a useless tree of empty directories. By default wget creates a directory tree that mirrors the one on the host, and downloads each file into the appropriate directory of the mirrored tree. wget does this even when the downloaded file is temporary due to --delete-after. To prevent that, use option --no-directories.
The below demonstrates the result, using your given example (slightly altered).
$ cd $(mktemp -d)
$ wget --delete-after --no-directories \
--warc-file=epfl --recursive --level=1 http://www.epfl.ch/
...
Total wall clock time: 12s
Downloaded: 22 files, 1.4M in 5.9s (239 KB/s)
$ ls -lhA
-rw-rw-r--. 1 chadv chadv 1.5M Aug 31 07:55 epfl.warc
If you forget to use --no-directories, you can easily clean up the tree of empty directories with find -type d -delete.
For individual files (without --recursive) the option -O /dev/null will make wget not to create a file for the output. For recursive fetches /dev/null is not accepted (don't know why). But why not just write all the output concatenated into one single file via -O tmpfile and delete this file afterwards?

How to ZIP specific files from a folder using Winzip command line?

With this command, I'm able to ZIP all files from the folders:
wzzip.exe -a -p -r C:\DestinationPath\DataFiles_20130903.zip C:\SourcePath\*.*
But, my folder has .dat,.bat,.txt,.xls files.I want to ZIP only .dat and .bat file.How to do this?
Thanks.
use this command (for the particular scenario in the question):
wzzip.exe -a -p -r C:\DestinationPath\DataFiles_20130903.zip C:\SourcePath\*.dat C:\SourcePath\*.bat
for more command line options for winZip refer to the following links:
winZip command line Reference 1
winZip command line Reference 2
To provide multiple file names you can also use #filename where the filename is a file which contains the list of files which you want to include in the zip file.
If you are making the command configurable then you can ask the user/ other program which is calling your command to select the file extensions and then write these selected extensions into the "filename" file using java code or any other language you prefer.
For example if the user selects bat and dat , then write "C:\SourcePath\*.bat" and "C:\SourcePath\*.dat" into the file(assume filename is fileExtensions.txt) and call the command
wzzip.exe -a -p -r "C:\DestinationPath\DataFiles_20130903.zip" #"C:\SourcePath\fileExtensions.txt"
You can use the D7zip
An excellent zipador file and folders D7zip.exe
link to download
https://drive.google.com/file/d/0B4bu9X3c-WZqdlVlZFV4Wl9QWDA/edit?usp=sharing
How to use
compressing files
D7Zip.exe -z "c:\fileout.zip" -f "C:\filein.txt"
compressing files and putting password
D7Zip.exe -z "c:\fileout.zip" -f "C:\filein.txt" -s "123"
compressing folders
D7Zip.exe -z "c:\folderout.zip" -f "C:\folderin\"
unzipping files
D7Zip.exe -u "c:\fileout.zip" -f "c:\folderout\"
unzipping files that have password
D7Zip.exe -u "c:\fileout.zip" -f "c:\folderout\" -s "123"
decompressing files by extension
D7Zip.exe -u "c:\fileout.zip" -f "c:\folderout\*.txt"
decompressing files without asking for confirmation to replace
D7Zip.exe -u "c:\fileout.zip" -f "c:\folderout\" -r
help
D7Zip.exe -?
D7Zip.exe by Delmar Grande.
If the command line given above is right then give this a go: but check the paths.
#echo off
pushd "C:\SourcePath"
"c:\program files\winzip\wzzip.exe" -a -p -r "C:\DestinationPath\DataFiles_20130903.zip" *.dat *.bat
popd