How to use sed with hdfs? - sed

Im trying change a special character (Þ) for ; but from hdfs not found. The command that i used is this:
hdfs dfs -cat path/file.txt | sed -i 's/Þ/;/g' | hadoop fs -put -f - path/file.txt
where:
hdfs dfs -cat is for getting the HDFS file content
sed -i 's/Þ/;/g' for replace Þ for ;
hadoop fs -put -f - path/file.txt is for overwriting the original file in HDFS
When i run this command i have this error:
sed: no input files
cat: Unable to write to output stream.
If i ejecute hdfs dfs -cat path/file.txt i can see the content ¿What is going on?
Edit 1:
I deleted the -i for the sed and i dont have the error in sed, but the console show this:
put:`path/file.txt': No such file or directory
cat: cat: Unable to write to output stream.
Thanks!!

hdfs dfs -cat path/file.txt | sed 's/Þ/;/g' | hadoop fs -put -f - path/file.txt
It works for me in this way.

Related

Airflow bash operator: Problem using sed in hdfs

I have a airflow task that I am trying to use sed command for replacing LF with CRLF:
hdfs dfs -cat /test/file.txt | sed 's/$/\r/g' | hdfs dfs -put -f - /test/file.txt
I get following error:
error: sed: -e expression #1, char 4: unterminated `s' command
I think it is due to \r which it is conflicting with. How do I solve this problem?
I found the reason, the \ is a special character in Python.
To solved it I just added an extra \ is it becomes sed 's/$/\\r/g' , another option is to use prefixing.

rename batch files in folder using a textfile

I have a folder of files that start with specific strings and would like to replace part of their strings using the corresponding column from textfile
Folder with files
ABC_S1_002.txt
ABC_S1_003.html
ABC_S1_007.png
NMC_D1_002.png
NMC_D2_003.html
And I have a text file that has the strings to be replaced as:
ABC ABC_newfiles
NMC NMC_extra
So the folder after renaming will be
ABC_newfiles_S1_002.txt
ABC_newfiles_S1_003.html
ABC_newfiles_S1_007.png
NMC_extra_D1_002.png
NMC_extra_D2_003.html
I tried file by file using mv
for f in ABC*; do mv "$f" "${f/ABC/ABC_newfiles}"; done
How can I read in the textfile that has the old strings in first column and replace that with new strings from second column? I tried
IFS=$'\n'; for i in $(cat file_rename);do oldName=$(echo $i | cut -d $'\t' -f1); newName=$(echo $i | cut -d $'\t' -f2); for f in oldName*; do mv "$f" "${f/oldName/newName}"; done ; done
Did not work though.
This might work for you (GNU parallel and rename):
parallel --colsep ' ' rename -n 's/{1}/{2}/' {1}* :::: textFile
This will list out the rename commands for each line in textFile.
Once the output has been checked, remove the -n option and run for real.
For a sed solution, try:
sed -E 's#(.*) (.*)#ls \1*| sed "h;s/\1/\2/;H;g;s/\\n/ /;s/^/echo mv /e"#e' testFile
Again, this will echo the mv commands out, once checked, remove echo and run for real.
Review the result of
sed -r 's#([^ ]*) (.*)#for f in \1*; do mv "$f" "${f/\1/\2}"; done#' textfile
When that looks well, you can copy paste the result or wrap it in source:
source <(sed -r 's#([^ ]*) (.*)#for f in \1*; do mv "$f" "${f/\1/\2}"; done#' textfile)

Hdfs find files below certain size

Is there a way to list files with size less than certain size in Hdfs . Using the command line or even a spark script ?
Scala / spark would be great as it may run faster as compared to command line .
I have looked at the Apache FileSystem documentation but could not find much information
You can use the below command to show files which are more than 1KB
hdfs dfs -ls -R / | awk '$5 > 1000'
Similarly, you can use the below script to show files of less than 1KB
hdfs dfs -ls -R / | awk '$5 < 1000'
Hope that helps.

How to use 'sed' to find and replace values within a tsv file?

I am currently working with a large .tsv.gz file that contains two columns that looks something like this:
xxxyyy 408261
yzlsdf 408260null408261
zlkajd 408258null408259null408260
asfzns 408260
What I'd like to do is find all the rows that contain "null" and replace it with a comma ",". So that the result would look like this:
xxxyyy 408261
yzlsdf 408260,408261
zlkajd 408258,408259,408260
asfzns 408260
I have tried using the following command but did not work:
sed -i 's/null/,/g' 46536657_1748327588_combined_copy.tsv.gz
Unzipping the file and trying it again also does not work with a tsv file.
I've also tried opening the unzipped file in a text editor to manually find and replace. But the file is too huge and would crash.
Try:
zcat comb.tsv.gz | sed 's/null/,/g' | gzip >new_comb.tsv.gz && mv new_comb.tsv.gz comb.tsv.gz
Because this avoids unzipping your file all at once, this should save on memory.
Example
Let's start with this sample file:
$ zcat comb.tsv.gz
xxxyyy 408261
yzlsdf 408260null408261
zlkajd 408258null408259null408260
asfzns 408260
Next, we run our command:
$ zcat comb.tsv.gz | sed 's/null/,/g' | gzip >new_comb.tsv.gz && mv new_comb.tsv.gz comb.tsv.gz
By looking at the output file, we can see that the substitutions were made:
$ zcat comb.tsv.gz
xxxyyy 408261
yzlsdf 408260,408261
zlkajd 408258,408259,408260
asfzns 408260

How to replace a text with another text in a file present at HDFS

I have file.txt in UNIX file system. Its content is below:
{abc}]}
{pqr}]}
I want to convert this file.txt into:
[
{abc}]},
{pqr}]}
]
I am able to do this using below shell script:
sed -i 's/}]}/}]},/g' file.txt
sed -i '1i [' file.txt
sed -i '$ s/}]},/}]}]/g' file.txt
My question is what if this file were present on HDFS at /test location.
If I use : sed -i 's/}]}/}]},/g' /test/file.txt
It would look at unix partition /test and say file does not exist.
If I use : sed -i 's/}]}/}]},/g' | hadoop fs -cat /test/file.txt
It says ----- sed: no input files and then prints content of file.txt as per cat command.
If I use hadoop fs -cat /test/file.txt | sed -i 's/}]}/}]},/g'
It says ---- sed: no input files
cat: Unable to write to output stream
So, how shall I replace strings from my file at HDFS with some other string?
With sed and hdfs commands:
hdfs dfs -cat /test/file.txt | sed 's/$/,/g; $s/,$/\n]/; 1i [' | hadoop fs -put -f - /test/file.txt
where,
hdfs dfs -cat /test/file.txt is for getting the HDFS file content
s/$/,/g; is for adding a comma at the end of each line
$s/,$/\n]/; is for removing comma at the line and adding a newline with a bracket
1i [ is for adding a bracket at the first line
hadoop fs -put -f - /test/file.txt is for overwriting the original file in HDFS