long text file to SAS Dataset - import

I am trying to load a large text file(report) as a single cell in SAS dataset, but because of multiple spaces and formatting the data is getting split into multiple cells.
Data l1.MD;
infile 'E:\Sasfile\f1.txt' truncover;
input text $char50. #;
run;
I have 50 such files to upload so keeping each file as a single cell is most important. What am I missing here?

Data l1.MD;
infile 'E:\Sasfile\f1.txt' recfm=f lrecl=32767 pad;
input text $char32767.;
run;
That would do it. RECFM=F tells SAS to have fixed line lengths (ignoring line feeds) and the other options set the line length to the maximum for a single variable (lines can be longer, but one variable is limited to 32767 characters) and to fill it with blank spaces if it's too short.
You'd only end up with > 1 cell if your text file is longer than that. Note that the line feed and/or carriage return characters will be in this file, which may be good or may be bad. You can identify them with '0A'x and/or '0D'x (depending on your OS you may have one or both), and you can compress them with the 'c' option or translate them to a line separator of your preference.

Related

SAS PROC IMPORT GROUPED VARIABLES

How to I keep variables in separate columns when using proc import with a tab delimited txt file? Only one variable is created called Name__Gender___Age. Is it only possible with the data step?
This is the code
proc import datafile= '/folders/myfolders/practice data/IMPORT DATA/class.txt'
out=new
dbms=tab
replace;
delimiter='09'x;
run;
You told PROC IMPORT that your text file had tabs between the fields. From the name of the variable it created it is most likely that instead your file just has spaces between the fields. And multiple spaces so that the lines look neatly aligned when viewed with a fixed width font.
Just write your own data step to read the file (something you should do anyway for text files).
data mew;
infile '/folders/myfolders/practice data/IMPORT DATA/class.txt' firstobs=2 truncover;
length Name $30 Gender $6 Age 8 ;
input name gender age;
run;
If there are missing values for either NAME or GENDER that are not entered as a period then you probably will want to read it using formatted or column mode input instead the simple list mode input style above.
The data file appears to have space delimiters instead of tab, contrary to your expectations.
Because you specified tab delimiting, the spaces in the header row are considered part of the column named Name Gender Age. Because spaces are not allowed in SAS column names (default setting), the spaces were converted to underscores. That is why you ended up with Name__Gender___Age
Change the delimiter to space and you should be able to import.
If the data file has a mix of space and tab delimiting, you will want to edit the data file to be consistent.

Is there a way for spark to read this odd text format?

The file format I have is sort of like csv and looks like this (abinitio .dat file of some sort):
1,apple,10.00,\n
2,banana,12.35,\n
3,orange,9.23,\n
The commas are actually "Start of Header" 0x01 byte characters, but I will use commas for simplicity. I can easily read the above sample by reading the file as a string RDD with a custom line split ,\n and then passing that into spark.read.csv. I am currently splitting lines by ,\n because there may be newlines in the data and I thought that those two characters were unique for each record. However a problem occurs when there are newline characters at the start of text fields. For example:
1,one \n apple,10.00,\n
2,two banana,12.35,\n
3,\n three orange,9.23,\n
My current code is able to ignore the newline in record 1 but picks up the ,\n after the 3 and splits the 3 lines into 4. How can I reliably read in this format?
My current ideas are:
Check that there are the right number of , column delimiters before allowing a split. I am not sure how to implement this, is it possible to do a regex look-back when spark sees a ,\n and check for the correct number of delimiters?
Try to coerce the file into some other format besides CSV
Make my own InputFormatClass, although I am not sure what this entails.

How can I remove tailing white space while loading data from CSV file into an Postgres Table?

I want to remove the trailing whitespaces from CSV file.
Sample CSV file Data:(Delimitor=";")
X ;Y;Z
X1 ; Y1;Z1
X2;Y2; Z2
I would have gone for something like SED or GREP but the file size is huge so it may impact the performance because of preprocessing.
I am looking for a way to remove these whites spaces at the time of loading only.
COPY command does not support preprocessing - you can't do it "at the time of loading "
https://www.postgresql.org/docs/current/static/sql-copy.html
In CSV format, all characters are significant. A quoted value
surrounded by white space, or any characters other than DELIMITER,
will include those characters. This can cause errors if you import
data from a system that pads CSV lines with white space out to some
fixed width. If such a situation arises you might need to preprocess
the CSV file to remove the trailing white space, before importing the
data into PostgreSQL.
I think here the best solution would be importing data with spaces and then
update t set attr = rtim(attr);

What code format shows proper line breaks?

I am exporting some Access tables to txt files and there are a lot of problems with the txt file. One of those problems being line breaks not visible in the txt file itself. If I copy a line with a line break into Notepad++ from Notepad, it'll break into 2 lines.
So I believe this may be a code format problem, but I can't find the proper one to resolve this. I'm currently exporting to the default Western European, but should I export tot UTF, Unicode, ASCII or something else?
When exporting from MS Access (or VB/VBA in general), make sure you're using vbCrLf constant (Carriage Return plus Line Feed) for line breaks. That constant corresponds to HEX values 0D 0A.
In Windows, it is a convention to use the above 2 characters together as line breaks, while in many other platforms, such as Unix/Linux/MacOS/etc. typically just 0A is used.
That brings up an issue: Notepad, the standard Windows text file viewer, cannot deal with 0A alone and does not treat such symbols as line breaks. More advanced editors, such as Notepad++ or UltraEdit, display such files correctly, though.
The CSV export function in Microsoft Office applications (Excel, Access) terminate a data row with CR+LF and write for a line break within a data value (multi-line string) just LF into the file. (I think just CR was written into the CSV file for a line break in older versions of Office before Office 2007.)
Most text editors detect those LF without CR (respectively CR without LF) and convert them to CR+LF on loading the CSV file which results on viewing of the CSV file in text editor in supposed wrong CSV lines as number of data values is not correct on data rows with data values containing a line break.
However, newline characters within a double quoted value in a CSV file are correct according to CSV specification as described in Wikipedia article about Comma-separated values.
But most applications with support on import from CSV file do not support CSV files with newline characters within a double quoted value and therefore some data values are imported wrong. Also regular expression replaces can't be done on a CSV file with newline characters within a data value because the number of separator character is not constant on all lines.
UltraEdit has for editing such CSV files with only LF (or CR) for a line break within a data value a special configuration setting. At Advanced - Configuration - File Handling - DOS/Unix/Mac Handling the option Never prompt to convert files to DOS format or Prompt to convert if file is not DOS format with clicking on button No if this prompt is displayed must be selected and additionally Only recognize DOS terminated lines (CR/LF) as new lines for editing must be enabled.
The CSV file with CR+LF for end of data row and only LF (or CR) for a line-break within a data value is loaded with those settings in UltraEdit with number of lines equal the number of data rows. And the line-feeds without carriage return (respectively the carriage returns without line-feed) in the CSV file are displayed as character in the lines with a small rectangle as no font has a glyph for a carriage return or line-feed defined because they are whitespace characters with no width. A Perl regular expression find searching for \r(?!\n)|\n(?<!\r) can be used now to find those line breaks within data values and replace them with something different like a space character or remove them.
Which character encoding (ASCII, ANSI, Unicode (UTF-16), UTF-8) to use on export depends on which characters can exist in string values. A Unicode encoding is necessary if string values can have also characters not included in local code page.

updating line in large text file using scala

i've a large text file around 43GB in .ttl contains triples in the form :
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://la.dbpedia.org/resource/Mahatma_Gandhi> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lad.dbpedia.org/resource/Mohandas_Gandhi> .
and i want to find the fastest way to update a specific line inside the file without rewriting all next text. either by updating it or deleting it and appending it to the end of the file
to access the specific line i use this code :
val lines = io.Source.fromFile("text.txt").getLines
val seventhLine = lines drop(10000000) next
If you want to use text files, consider a fixed length/record size for each line/record.
This way you can use a RandomAccessFile to seek to the exact position of each line by number: You just seek to line * LineSize, and then update it.
It will not really help, if you have to insert a new line. Other limitations are: The file size will grow (because of the fixed record length), and there will always be one record which is too big.
As for the initial conversion:
Get the maximum line length of the current file, then add 10% for example.
Now you have to convert the file once: Read a line from the text file, and convert it into a fixed-size record.
You could use a special character like | to separate the fields. If possible, use somthing like ;, so you get a .csv file
I suggest padding the remaining space it with spaces, so it still looks like a text file which you can parse with shell utilities.
You could use a \n to terminate the record.
For example
http://x.com|http://x.com|http://x.com|...\n
or
http://x.com;http://x.com;http://x.com;...\n
where each . at the end represents a space character. So it's still somehow compatible with a "normal" text file.
On the other hand, looking at your data, consider using a key-value data store like Redis: You could use the line number or the 1st URL as the key.