Extract text, Matlab [closed] - matlab

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I am trying to find a way to extract text in a specific and efficient way
as in this example:
'Hello Mr. Jack Andrew , your number is 894Gfsf , and your Bank ID # 734234"
I want a way to get the Name, the Number and the Bank ID Number.
I want to write software that deals with different text files and get those required values. I may not know the exact order but it must be a template like a bank statement or something.
Thanks!

It's a bit hard to understand what exactly is the problem.. If all you need to do is to split strings, here's a possible way to do it:
str = 'Hello Mr. Jack Andrew , your number is 894Gfsf , and your Bank ID # 734234';
tokenized = strsplit(str,' ');
Name = strjoin([tokenized(3:4)],' ');
Number = tokenized{9};
Account = tokenized{end};
Alternatively, for splitting you could use regexp(...,'split') or regexp(...,'tokens');

I think you want regular expressions for this. Here's an example:
str = 'Hello Mr. Jack Andrew , your number is 894Gfsf , and your Bank ID # 734234';
matches=regexp(str, 'your number is (\w+).*Bank ID # (\d+)', 'tokens');
matches{1}
ans =
'894Gfsf' '734234'
My suggestion would be to make a whole array of strings with sample patterns that you want to match, then build a set of regular expressions that collectively match all of your samples. Try each regexp in sequence until you find one that matches.
To do this, you will need to learn about regular expressions.

Related

Perl script to remove new line character and move next line data to previous line [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
I have input like below
"ID"|"Desc"
"100"|"
The data present in Desc column has new line characters.
So the data came to second line.
Some records of data went to third line. But I need all data to be present in first line."
"101"|"This record desc is correct data which has present in single line. So I need data to present in single line."
I need output like below,
"ID"|"Desc"
"100"|"The data present in Desc column has new line characters.So the data came to second line.Some records of data went to third line. But I need all data to be present in first line."
"101"|"This record desc is correct data which has present in single line. So I need data to present in single line."
Can someone please help the Perl script where we can achieve above requirement.
Use Text::CSV_XS to process the file as it can parse it correctly.
perl -MText::CSV_XS=csv -wE 'csv( in => shift,
always_quote => 1,
sep_char => "|",
eol => "\n",
on_in => sub { $_[1][1] =~ s/\n//g } );
' -- file.csv > newfile.csv
I'm testing this in a Linux shell, you might need a different eol if you're in MSWin. Also, I don't know what rules Powershell uses for quoting, co you might need to use a different type of quotes.

How to give an alias name with a space in sparksql [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
i have tried below codes
trial-1
..........
val df2=sqlContext.sql("select concat(' ',Id,LabelName) as 'first last' from p1 order by LabelName desc ");
trial-2
.........
val df2=sqlContext.sql("select concat(' ',Id,LabelName) from p1 order by LabelName desc ");
val df3=df2.toDF("first last")
trial-1 is throwing error when i tried to run it.......but in trial-2 it is taking the command but throwing the error when i performed below action
scala> df3.write.parquet("/prashanth/a1")
When a SQL column contains special characters in a SQL statement, you can use `, such as `first last`.
You cannot use space in a Parquet column. You can either rename the column or use other file format, such as csv.

Rename file from xx_02.csv to xx.csv [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a folder 'a' with about 200 files with names xx_out_02.csv and I want to rename them to xx_out.csv. May be using Matlab or running some script. I tried it in cmd but I have to run the command for each and every file.
Can someone help me here?
Best Regards
Dilip
You can use the movefilefunction from matlab.
Here is an example:
clc
addpath('yourdir')
csvf = dir('yourdir/*.csv');
numberOfcsv = numel(csvf);
for ii = 1:numberOfcsv
file = csvf(ii).name;
movefile(sprintf('yourdir/%s', file), sprintf('yourdir/x%03d_out.csv', ii), 'f');
end
Your question is unclear. I'm assuming
You want to strip off substrings of the form _ followed by one or more digits right before .csv.
The resulting target names are all different. For example, you have files such as xx_out_02.csv and yy_out_01.csv, but not xx_out_02.csv and xx_out_01.csv.
Operating system? I'm considering Windows. For other systems you can change the system line below with the appropriate system comand. Or better use movefile as in SamuelNLP's answer.
Code:
files = dir('*.csv');
names = {files.name};
for n = 1:numel(names)
name = names{n};
name_new = regexprep(name, '_\d+(?=\.csv$)', '');
system(['ren ' name ' ' name_new]); %// MS-DOS command to rename file
end

Print a substring of an array value [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I have an array in the third element dataArr[2], I know it contains a 10 digit phone. I need to only read or print the first 6 digits. For instance if my phone is 8329001111, I need to print out the 832900. I tried to see if I can use substr but I keep reading or printing the full list. Do I need to dereference..
Try this :
$dataArr[2] =~ s/\s//g; # ensure there's no spaces
print substr($dataArr[2], 0, 6);
# ^ ^ ^
# variable | |
# offset start|
# |
# substring length

one million records in log file for a online shopping website. FInd distinct IP addresses [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
It is a technical test question. Given 1 million records in a log file. These are records of website hits of an online shopping website. The records are of type:
TimeStamp:Date,time ; IP address; ProductName
Find the distinct IP addresses and most popular product. What is the most efficient way to do this? One solution is hashing. If solution is hashing, please provide a explanation for efficiently hashing this since there are one million records.
I recently did something similar for homework, not sure on the total number of lines, but it was quite a lot. The point is that your computer can probably do this very quickly, even if there is a million records.
I agree with you on the hashtable, and I would do the two questions slightly differently.
The first one I would check every ip against the hashtable, and if it exists, do nothing. If it does not exist, add it to the hashtable, and increment a counter. At the end of the program, the counter will tell you how many unique IP's there were.
The second I would hash the product name and put that in the hashtable. I would increment the value associated with the hashkey every time I found a match in the table. At the end, loop through all the keys and values of the hashtable and find the highest value. That is the most popular product.
A million log records is really a very small number; just read them in and keep a set of the IP addresses and a dict from product names to number of mentions -- you don't mention any specific language constraint so I assume a language that will do (implicitly) excellent hashing of such strings on your behalf is acceptable (Perl, Python, Ruby, Java, C#, etc, all have fine facilities for the purpose).
E.g., in Python:
import collections
import heapq
ipset = set()
prodcount = collections.defaultdict(int)
numlines = 0
for line in open('logfile.txt', 'r'):
timestamp, ip, product = line.strip().split(';')
ipset.add(ip)
prodcount[product] += 1
numlines += 1
print "%d distinct IP in %d lines" % (len(ipset), numlines)
print "Top 10 products:"
top10 = heapq.nlargest(10, prodcount, key=prodcount.get)
for product in top10:
print "%6d %s" % (prodcount[product], product)
Distinct IP addresses:
$ cut -f 2 -d \; | sort | uniq
Most popular product:
$ cut -f 3 -d \; | sort | uniq -c | sort -n
If you can do so, shell script it like that.
First of all, one million lines is not at all a huge file.
A simple Perl script can chew a 2.7 million lines script in 6 seconds, without having to think much about the algorithm.
In any case hashing is the way to go and, as shown, there's no need to bother with hashing over an integer representation.
If we were talking about a really huge file, then I/O would become the bottleneck and thus the hashing method gets less and less relevant as the file grows bigger.
Theoretically in a language like C it would probably be faster to hash over an integer than over a string, but I doubt that in a language suited to this task that would really make a difference. Things like how to read the file efficiently would matter much much more.
Code
vinko#parrot:~$ more hash.pl
use strict;
use warnings;
my %ip_hash;
my %product_hash;
open my $fh, "<", "log2.txt" or die $!;
while (<$fh>) {
my ($timestamp, $ip, $product) = split (/;/,$_); #To fix highlighting
$ip_hash{$ip} = 1 if (!defined $ip_hash{$ip});
if (!defined $product_hash{$product}) {
$product_hash{$product} = 1
} else {
$product_hash{$product} = $product_hash{$product} + 1;
}
}
for my $ip (keys %ip_hash) {
print "$ip\n";
}
my #pkeys = sort {$product_hash{$b} <=> $product_hash{$a}} keys %product_hash;
print "Top product: $pkeys[0]\n";
Sample
vinko#parrot:~$ wc -l log2.txt
2774720 log2.txt
vinko#parrot:~$ head -1 log2.txt
1;10.0.1.1;DuctTape
vinko#parrot:~$ time perl hash.pl
11.1.3.3
11.1.3.2
10.0.2.2
10.1.2.2
11.1.2.2
10.0.2.1
11.2.3.3
10.0.1.1
Top product: DuctTape
real 0m6.295s
user 0m6.230s
sys 0m0.030s
I also would read the file into a database, and would link it to another table of log filenames and date/time imported.
This is because in the real world you're going to need to do this regularly. The company is going to want to be able to check trends over time, so you're quickly going to be asked questions like "is that more or less unique IP addresses than last month's log file?" and "how are the most popular products changing from week to week".
In my experience the best way to answer these questions in an interview scenario as you describe is to show awareness of real-world situations. A tool to parse the log files (produced daily? weekly? monthly?) and read them into a database where some queries, graphs etc can pull all the data out, especially across multiple log files, will take a bit longer to write but be infinitely more useful and useable.
In the real world, if this is a once-off or occasional thing, I would simply insert the data into a database and run a few basic queries.
Since this is a homework assignment though, that's probably not what the prof is looking for. An IP address is really just a 32-bit number. I could convert each IP to it's 32-bit equivalent instead of creating a hash.
Since this is homework, the rest "is left as an exercise to the reader."
Like other people write, there are just 2 hashtables. One for IP's and one for Products. You can count the occurrences for both, but you only care about them in the latter "Product Popularity"
The key to hashing is having an efficient hash key, and hash efficiency means that the keys are evenly distributed. Poor key choice means that there will be many collisions, and performance of the hash table will suffer.
Being lazy, I'd be tempted to create a Dictionary<IPAddress,int> and hope that the implementor of the IPAddress class created the hash key appropriately.
Dictionary<IPAddress, int> ipAddresses = new Dictionary<IPAddress, int>();
Dictionary<string, int> products = new Dictionary<string,int>();
For the Products list, after you've setup the hash table, just use linq to select them out.
var sorted = from o in products
orderby o.Value descending
where o.Value > 1
select new { ProductName = o.Key, Count = o.Value };
Just sort the file by each of the two fields of interest. That avoids any need to worry about hash functions and will work just fine on a million record set.
Sorting the IP addresses this way also makes it easy to extract other interesting information such as accesses from the same subnet.
Unique IPs:
$ awk -F\; '{print $2}' log.file | sort -u
count.awk
{ a[$0]++ }
END {
for(key in a) {
print a[key] " " key;
}
}
Top 10 favorite items:
$ awk -F\; '{print $3}' log.file | awk -f count.awk | sort -r -n | top