I have a document with this structure (it's large, more than 20000 lines)
#A00627:308:H227VDSX3:1:1363:9616:15013 1:N:0:TTCTGCAG+TTCCGGTA
GATACATCGCCATCCGAATTCCACTCCGGTACAATGGCTTGGTGACGGGTATGAGGGCGAAGGGCATCATTGCGATTTGCTGGGTGCTGTCATT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
#A00627:308:H227VDSX3:1:1568:1362:27430 1:N:0:TTCTGCAG+TTCCGGTA
ACTGGCTTCTGCGCTGCCTGCCATGGCTGCCTCTTCTTCGCCTGCTTTGTCCTGGTCCTCACGCAGAGTTCCATCTTCAGCCTCTTGGCT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FF:FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFF:
#A00627:308:H227VDSX3:1:2375:11198:6511 1:N:0:TTCTGCAG+TTCCGGTA
CTGGCTTCTGCGCTGCCTGCCATGGCTGCCTCTTCTTCGCCTGCTTTGTCCTGGTCCTCACGCAGAGTTCCATCTTCAGCCTCTT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
And I want to create a new document with part of the lines... Such as:
>1
GATACATCGCCATCCGAATTCCACTCCGGTACAATGGCTTGGTGACGGGTATGAGGGCGAAGGGCATCATTGCGATTTGCTGGGTGCTGTCATT
>2
ACTGGCTTCTGCGCTGCCTGCCATGGCTGCCTCTTCTTCGCCTGCTTTGTCCTGGTCCTCACGCAGAGTTCCATCTTCAGCCTCTTGGCT
>3
CTGGCTTCTGCGCTGCCTGCCATGGCTGCCTCTTCTTCGCCTGCTTTGTCCTGGTCCTCACGCAGAGTTCCATCTTCAGCCTCTT
....
I think that this it's easy, but I am very new in coding, how should I do it?
Related
First time poster and quite new to R.
I'm trying to add a new variable to a tibble ("joined") that adds value nrow-1 from column 22 ("NurseID"), if the value of the variable in column 3("AccountID") on nrow matches the one on nrow-1.
I can do it with a sorted loop, but this is a large dataset and it takes a long time to run and I wonder if there is a faster/easier way to do this
arrange (joined, AccountID, date_day, shift)
tie <- "."
for (i in 2:nrow(joined))
{
ifelse (joined[i,3]==joined[i-1,3], temp<-joined[i-1,22], temp<-".")
tie <- c(tie,temp)
}
temptie <- as.numeric(tie)
joined <- as_tibble(cbind(joined,temptie))
Any help / input is much appreciated. Please kindly let me know if you need more information on the tibble
Lines of the document as follows:
I am 12 year old.
I go to school.
I am playing.
Its 4 pm.
There are two lines of the document that contain numbers in them. I want to count how many lines are there in the document with number?
This is to be implemented in scala spark.
val lineswithnum=linesRdd.filter(line => (line.contains([^0-9]))).count()
I expect output to be 2 . But I am getting 0
You can use exists method:
val lineswithnum=linesRdd.filter(line => line.exists(_.isDigit)).count()
In line with your original approach and not discounting the other answer(s):
val textFileLines = sc.textFile("/FileStore/tables/so99.txt")
val linesWithNumCollect = textFileLines.filter(_.matches(".*[0-9].*")).count
The .* added so as to capture within a line string.
I have a question on key/value pair RDD.
I have five files in the C:/download/input folder which has the dialogs in the films as the content of the files as follows:
movie_horror_Conjuring.txt
movie_comedy_eurotrip.txt
movie_horror_insidious.txt
movie_sci-fi_Interstellar.txt
movie_horror_evildead.txt
I am trying to read the files in the input folder using the sc.wholeTextFiles() where i get the key/value as follows
(C:/download/input/movie_horror_Conjuring.txt,values)
I am trying to do an operation where i have to group the input files of each genre together using groupByKey(). The values of all the horror movies together , comedy movies together and so on.
Is there any way i can generate the key/value pair this way (horror, values) instead of (C:/download/input/movie_horror_Conjuring.txt,values)
val ipfile = sc.wholeTextFiles("C:/download/input")
val output = ipfile.groupByKey().map(t => (t._1,t._2))
The above code is giving me the output as follows
(C:/download/input/movie_horror_Conjuring.txt,values)
(C:/download/input/movie_comedy_eurotrip.txt,values)
(C:/download/input/movie_horror_Conjuring.txt,values)
(C:/download/input/movie_sci-fi_Interstellar.txt,values)
(C:/download/input/movie_horror_evildead.txt,values)
where as i need the output as follows :
(horror, (values1, values2, values3))
(comedy, (values1))
(sci-fi, (values1))
I also tried to do some map and split operations to remove the folder paths of the key to get only the file name, but i'm not able to append the corresponding values to the files.
Also i would like to know how can i get the lines count in values1, values2, values3 etc.
My final output should be like
(horror, 100)
where 100 is the sum of the count of lines in values1 = 40 lines, values2 = 30 lines and values3 = 30 lines and so on..
Try this:
val output = ipfile.map{case (k, v) => (k.split("_")(1),v)}.groupByKey()
output.collect
Let me know if this works for you!
Update:
To get output in the format of (horror, 100):
val output = ipfile.map{case (k, v) => (k.split("_")(1),v.count(_ == '\n'))}.reduceByKey(_ + _)
output.collect
I'm a bit new to data import using Matlab.
Basically, I have an Ascii file. It has 13 Header Lines, along with 765 columns and ~3500 rows of data. I am attempting to import the data into a 3500 x 765 matrix in Matlab. I've tried the following:
fileID = fopen('filename');
formatspec = [repmat('%f ', [1,765])];
raw_data=textscan(fileID,formatspec, 'Headerlines',13,'delimiter','\t');
It successfully skips the 13 header lines. However, it only gives me a 1 x 765 matrix containing only the data from the first row.
Perhaps I have misunderstood just how I am supposed to use textscan, so any help in getting my other ~3499 rows of data would be very well appreciated.
~Thank You
NOTE
The Data File itself is formatted as follows. The First 13 lines do not contain the data itself. All lines following that contain sets of data similar to what will be pasted below, extending for 700+ columns and 3000+ rows.
Wyko ASCII Data File Format 0 1 1
X Size 3571
Y Size 765
Block Name Type Length Value
Wavelength 7 4 72.482628
Aspect 7 4 1
Pixel_size 7 4 0.00196
StageY 7 4 -0.048055
Magnification 8 8 5.05
StageX 7 4 0.214484
ScannerPosition 7 4 3490.000732
ScannerSpeed 7 4 3.165393
RAW_DATA 3 10927260
-10976.61035 -10977.07324 -10981.07422 -10985.6084 ...
-10967.41309 -10963.31836 -10966.75195 -10980.40723 ...
-10969.08496 -10976.03711 -10976.62988 -10964.23731 ...
-10974.12695 -10976.61133 -10979.2627 -10973.57813 ...
-10969.21094 -10966.56543 -10973.74512 -10983.41797 ...
-10970.18359 -10980.82715 -10968.00195 -10975.58594 ...
-10980.41016 -10982.39356 -10982.74316 -10974.51563 ...
-10972.31641 -10984.00488 -10987.89453 -10976.23633 ...
I think the following should work, but I don't have Matlab on this machine to test it out.
fileID = fopen('filename');
formatspec = [repmat('%f ', [1,765])];
raw_data = new_data = textscan(fileID,formatspec, 'Headerlines',13,'delimiter','\t');
while ~feof(fileID)
new_data = textscan(fileID,formatspec,'delimiter','\t');
raw_data = [raw_data; new_data];
end
fclose(fileID);
Note that this is not a particularly efficient way to do it. If your header lines give you the size of your array, you may want to use zeros to create an array of the appropriate size and then read the data into your array.
My data is in following format:
TABLE NUMBER 1
FILE: name_1
name_2
TIME name_3
day name_4
-0.01 0
364.99 35368.4
729.99 29307
1094.99 27309.5
1460.99 26058.8
1825.99 25100.4
2190.99 24364
2555.99 23757.1
2921.99 23240.8
3286.99 22785
3651.99 22376.8
4016.99 22006.1
4382.99 21664.7
4747.99 21348.3
5112.99 21052.5
5477.99 20774.1
5843.99 20509.9
6208.99 20259.7
6573.99 20021.3
6938.99 19793.5
7304.99 19576.6
TABLE NUMBER 2
FILE: name_1
name_5
TIME name_6
day name_7
-0.01 0
364.99 43110.4
729.99 37974.1
1094.99 36175.9
1460.99 34957.9
1825.99 34036.3
2190.99 33293.3
2555.99 32665.8
2921.99 32118.7
3286.99 31626.4
3651.99 31175.1
4016.99 30758
4382.99 30368.5
4747.99 30005.1
5112.99 29663
5477.99 29340
5843.99 29035.2
6208.99 28752.4
6573.99 28489.7
6938.99 28244.2
7304.99 28012.9
TABLE NUMBER 3
Till now I was splitting this data and reading the variables (time and name_i) from each file in following way:
[TIME(:,j), name_i(:,j)]=textread('filename','%f\t%f','headerlines',5);
But now I am producing the data of those files into 1 file as shown in beginning. For example I want to read and store TIME data in vectors TIME1, TIME2, TIME3, TIME4, TIME5 for name_3, name_6, _9 respectively, and similarly for others.
First of all, I suggest you don't use variable names such as TIME1,TIME2 etc, since that gets messy quickly. Instead, you can e.g. use a cell array with five rows (one for each well), and one or two columns. In the sample code below, wellData{2,1} is the time for the second well, wellData{2,2} is the corresponding Oil Rate SC - Yearly.
There might be more elegant ways to do the reading; here's something quick:
%# open the file
fid = fopen('Reportq.rwo');
%# read it into one big array, row by row
fileContents = textscan(fid,'%s','Delimiter','\n');
fileContents = fileContents{1};
fclose(fid); %# don't forget to close the file again
%# find rows containing TABLE NUMBER
wellStarts = strmatch('TABLE NUMBER',fileContents);
nWells = length(wellStarts);
%# loop through the wells and read the numeric data
wellData = cell(nWells,2);
wellStarts = [wellStarts;length(fileContents)];
for w = 1:nWells
%# read lines containing numbers
tmp = fileContents(wellStarts(w)+5:wellStarts(w+1)-1);
%# convert strings to numbers
tmp = cellfun(#str2num,tmp,'uniformOutput',false);
%# catenate array
tmp = cat(1,tmp{:});
%# assign output
wellData(w,:) = mat2cell(tmp,size(tmp,1),[1,1]);
end