Finding line duplicates in text file where lines can be identical to each other

Finding line duplicates in text file where lines can be identical to each other - text-processing

I've made a system where the data in the database is filled when the system reads a file. This file may be filled at a later stage, which creates a demand to read the same file again.
The data itself is represented on each line of the file, and the tough part is to find unique values, and I'll tell you why.
The file may look like this:
123 20110101 4123 Hello
123 20110101 4123 Hello
124 20110102 6133 Hello again
125 20110103 6425 Yes
The real problem here is that the first two lines aren't duplicates, so they're both going to get read into the database by the system.
As I earlier told, this file may be added to at a later stage, making it necessary that we read it again. As I was not familiar with how text was appended to the file, I made the assumption that new data would be appended to the end of the file. Therefore I added file row number to each row in the database, to make lines unique. However, I was wrong...
As it turns out, data where appended to the file in the middle of it as well.
This means we now may have the following file:
123 20110101 4123 Hello
123 20110101 4123 Hello
124 20110102 6133 Hello again
123 20110101 4123 Hello
125 20110103 6425 Yes
And now we stand before the second time we read the file. In this case I only want to read the fourth line, as this is the only new line. How can I find the new line and get rid of the others?

Save the old version of the file, then run a diff on the old version and the new version. That will give you the newly added lines.

Related

How to generate a 10000 lines test file from original file with 10 lines?

I want to test an application with a file containing 10000 lines of records (plus header and footer lines). I have a test file with 10 lines now, so I want to duplicate these line 1000 times. I don't want to create a C# code in my app to generate that file (is only for test), so I am looking for a different and simple way to do that.
What kind of tool can I use to do that? CMD? Visual Studio/VS Code extension? Any thought?

If your data is textual, load the 10 records from your test file into an editor. Select all, copy, insert at the end of file. Repeat until the file is of length 10000+
This procedure requires ceil(log_2(1000)) cycles, 10 in your case, in general ceil(log_2(<target_number_of_lines>/<base_number_of_lines>)).
Alternative (large files)
Modern editors should not have performance problems here. However, the principle can be applied using a cat cli command. Assuming that you copy the original file into a file named dup0.txt proceed as follows:
cat dup0.txt dup0.txt >dup1.txt
cat dup1.txt dup1.txt >dup0.txt
leaving you with the quadrupled number of lines in dup0.txt.

Reading the last 2 lines of a .log file via Matlab

The problem that I am attempting to solve is as follows:
I have a .log file that is updated every x seconds (an interval that I can change), with updated status information from a piece of test equipment. At each interval, another line is added to the .log file, with the updated information. My goal is to have the most recent status information (the last two lines of the .log file) easily viewable in Matlab.
Here is an example of what each update looks like, in case that is relevant (a single line of text):
What I have tried:
I used the readtable command (shown below) to view the information in the .log file, but that gives me the entire .log file every time the function is called, when I only want/need the last two lines.
data = readtable('FileName.log','FileType','text')
I know that this would be simpler if I was working with a .csv or .xlsx file, but the test equipment only updates the .log file, so I cannot just change the file type, as it would not get updates then.
Any advise would be appreciated.

If the .log file is in plain text format (as I assume based on your code snippet), you can get the last 2 lines of the file by using the following system command in MATLAB:
[status,output] = system(['tail -n 2 ', path]);
Please do keep in mind that this requires the tail command to be available, which is not available in windows by default - however you can get around this by installing a package containing the tail command, for example Cygwin.

Paginate a big text file

I have a big text file. Each line in the file is a record. And I need to parse the text file and show only 20 records in a HTML table at a time. I will have to support sorting as well.
What I am currently doing is read the file line by line based on the parameters start, stop, and page_size which is provided in querystring. It seems to work fine until I have to sort the records, because in order to sort I need to process every line in the text file.
So is there a Unix command which can I extract from line to line and sort? I tried grep but I do not know enough it to get this problem solved.

Take a look at the pr command. This is what we use to use all the time to paginate big files. You can set the page length, headers, footers, turn on line numbers, etc.
There's probably even a way to munge the output into HTML.

How big is the file?
man sort
Here

comparing two files

I have two file
file1 contents are as below
===================================================
OUTPUT1:---------
orange
india
US
xx
OUTPUT2:---------
orange-1
india-1
US-1
xx
===================================================
file2 contents are as below
OUTPUT1:---------
orange
india
US
xx
OUTPUT2:---------
orange-1
india-1
US-2
xx
===================================================
I want difference of two as below
-----------------------
OUTPUT1: No evolution
----------------------
OUTPUT2: Evolution found
Before:US-1
After:US-2
----------------------
Is it possible to write script in perl with above requirement
any help will be much appreciated

No perl, but something more awesome: diff!
It compares files:
[blender#arch Desktop]$ diff file1.txt file2.txt
11c11
< US-1
---
> US-2
11c11 says that the changed text starts on line 11, character 11.

Algorithm::Diff should do the job. It works on arrays (i.e. you can parse whatever input format you like) and generates diff-like output.
However, it might turn out the LCS algorithm is a bit of overkill for the task, and you should just go with hash tables instead.

Possible in Perl, for sure, it's a pretty powerful language.
The degree of difficulty will be affected by the assumptions we can make about the data. Is it sorted? How big are the files.
If the data is unsorted and the files are too large to be held entirely in memory then you may need to adopt a pipeline approach, first sorting and then "diffing", and in which case if you have access to Unix heritage tools such as diff and sort you may not even need Perl.
Assuming you want to use Perl, I'd suggest looking at the problem in stages:
Identify "records", which span multiple lines. Write code to consume a single file and build a representation of each record.
Solve the sort problem, if need be build an intermediate file containing the sorted records.
Do the diff across the two sorted files, if you can build a hash of one entire file in memory this is easy, otherwise you need to fetch records from one file or the other depending upon which one has the "next" record.
Having indentified a change print out the details in the desired format

Editing YAML documents in-place with Perl

I am fairly new to Perl and YAML. I would like to read from a YAML file, and also edit/write some of the property values without rewriting the entire config file (preserving the existing comments, blank lines, spaces etc.)
I am using YAML library in Perl. What would be a good way to achieve this?

You cannot readily write part of the file - you will end up rewriting the whole file. If you did write a partial file, you'd have to seek to a start position, truncate the file to that length (or truncate then seek/append), and then write the new tail of the file after the unchanged start. File systems do not support operations such as 'delete 329 bytes at offset 193 and insert 46 bytes after resulting offset 227'.
If your YAML module (library) preserves, or makes available, leading comments and blank lines somehow, then you'll be able to preserve them easily. If not, then you'll probably have to do the job yourself - read and save the comment lines, then use YAML to parse the file, then write the preserved comments and the replacement YAML.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Finding line duplicates in text file where lines can be identical to each other - text-processing

Save the old version of the file, then run a diff on the old version and the new version. That will give you the newly added lines.

Related

How to generate a 10000 lines test file from original file with 10 lines?

Reading the last 2 lines of a .log file via Matlab

Paginate a big text file

comparing two files

Editing YAML documents in-place with Perl

Categories

Resources