SAS import word by word - import

I have one (maybe silly) question. I have a text(it cames from an xml which is longer) file which looks like:
<Hi>
|<interview><rul| |ebook><name>EQU| |OTE_RULES</name| |><version>aaaa| |ON
QUOTE TR2 v2| |.14</version></| |rulebook><creat| |edDate>2017-10-|
|23`16:00:16.581| | UTC</createdDa| |te><status>IN_P| |ROGRESS</$10tus|
|>`<lives>`<life n| |umber="1" clien| |tId="1" status=| |"IN_PROGRESS"><|
|pages>`<page cod| | e="QD_APP" numb| |er="1" name="Pl| |an type" create|
|dDate="2017-10-| </Hi>
I would like to know if there is any way to import word by word, so I could clean the text an remove characters such as $ or to keep the space such as
|umber="1" clien|
| e="QD_APP" numb|
Thank fyou for your help
Julen

SAS can certainly input by word as opposed to by ... whatever else you're inputting by. In fact the most simple way to import would be:
data want;
infile "yourfile.xml";
length word $1024; *or whatever the longest feasible "word" would be;
input word $ ##;
run;
If you don't tell SAS how to split words, it assumes spaces, for example.

Related

How to remove multiple characters between 2 special characters in a column in SSIS expression

I want to remove the multiple characters starting from '#' till the ';' in derived column expression in SSIS.
For example,
my input column values are,
and want the output as,
Note: Length after '#' is not fixed.
Already tried in SQL but want to do it via SSIS derived column expression.
First of all: Please do not post pictures. We prefer copy-and-pastable sample data. And please try to provide a minimal, complete and reproducible example, best served as DDL, INSERT and code as I do it here for you.
And just to mention this: If you control the input, you should not mix information within one string... If this is needed, try to use a "real" text container like XML or JSON.
SQL-Server is not meant for string manipulation. There is no RegEx or repeated/nested pattern matching. So we would have to use a recursive / procedural / looping approach. But - if performance is not so important - you might use a XML hack.
--DDL and INSERT
DECLARE #tbl TABLE(ID INT IDENTITY,YourString VARCHAR(1000));
INSERT INTO #tbl VALUES('Here is one without')
,('One#some comment;in here')
,('Two comments#some comment;in here#here is the second;and some more text')
--The query
SELECT t.ID
,t.YourString
,CAST(REPLACE(REPLACE((SELECT t.YourString AS [*] FOR XML PATH('')),'#','<!--'),';','--> ') AS XML) SeeTheIntermediateXML
,CAST(REPLACE(REPLACE((SELECT t.YourString AS [*] FOR XML PATH('')),'#','<!--'),';','--> ') AS XML).value('.','nvarchar(max)') CleanedValue
FROM #tbl t
The result
+----+-------------------------------------------------------------------------+-----------------------------------------+
| ID | YourString | CleanedValue |
+----+-------------------------------------------------------------------------+-----------------------------------------+
| 1 | Here is one without | Here is one without |
+----+-------------------------------------------------------------------------+-----------------------------------------+
| 2 | One#some comment;in here | One in here |
+----+-------------------------------------------------------------------------+-----------------------------------------+
| 3 | Two comments#some comment;in here#here is the second;and some more text | Two comments in here and some more text |
+----+-------------------------------------------------------------------------+-----------------------------------------+
The idea in short:
Using some string methods we can wrap your unwanted text in XML comments.
Look at this
Two comments<!--some comment--> in here<!--here is the second--> and some more text
Reading this XML with .value() the content will be returned without the comments.
Hint 1: Use '-->;' in your replacement to keep the semi-colon as delimiter.
Hint 2: If there might be a semi-colon ; somewhere else in your string, you would see the --> in the result. In this case you'd need a third REPLACE() against the resulting string.

Spark CSV Read Ignore characters

I'm using Spark 2.2.1 through Zeppelin.
Right now my spark read code is as follows:
val data = spark.read.option("header", "true").option("delimiter", ",").option("treatEmptyValuesAsNulls","true").csv("listings.csv")
I've noticed when I use the .show() function, the cells are shifted to the right. On the CSV all the cells are in the correct places, but after going through Spark, the cells would be shifted to the right. I was able to identify the culprit: The quotations are misplacing cells. There are some cells in the CSV file that written like so:
{TV,Internet,Wifi,"Air conditioning",Kitchen,"Indoor fireplace",Heating,"Family/kid friendly",Washer,Dryer}
Actual output (please note that I used .select() and picked some columns to show the issue I am having.):
| description| amenities| square_feet| price|
+--------------------+--------------------+-----------------+--------------------+
|This large, famil...|"{TV,Internet,Wif...| Kitchen|""Indoor fireplace""|
|Guest room in a l...| "{TV,""Cable TV""| Internet| Wifi|
Expected output:
| description| amenities| square_feet| price|
+--------------------+--------------------+-----------------+--------------------+
|This large, famil...|"{TV,Internet,Wif...| 1400 | $400.00 ||
|Guest room in a l...| "{TV,""Cable TV""| 1100 | $250.00 ||
Is there a way to get rid of the quotations or replace them with apostrophes? Apostrophes appear to not affect the data.
What your are looking for is the regexp_replace function with the syntax regexp_replace(str, pattern, replacement).
Unfortunately, I could not reproduce your issue as I didn't know how to write the listings.csv file.
However, the example below should give you an idea on how to replace certain regex patterns when dealing with a data frame in Spark.
This is reflecting your original data
data.show()
+-----------+----------+-----------+--------+
|description| amenities|square_feet| price|
+-----------+----------+-----------+--------+
|'This large| famil...'| '{TV|Internet|
+-----------+----------+-----------+--------+
With regexp_replace you can replace suspicious string patterns like this
import org.apache.spark.sql.functions.regexp_replace
data.withColumn("amenitiesNew", regexp_replace(data("amenities"), "famil", "replaced")).show()
+-----------+----------+-----------+--------+-------------+
|description| amenities|square_feet| price| amenitiesNew|
+-----------+----------+-----------+--------+-------------+
|'This large| famil...'| '{TV|Internet| replaced...'|
+-----------+----------+-----------+--------+-------------+
Using this function should solve your problem with the problematic characters by replacing them. Feel free to use regular expression in that function.

Is there a way to see raw string values using SQL / presto SQL / athena?

Edit after asked to better specify my need:
TL;DR: How to show whitespace escaped characters (such as /r) in the Athena console when performing a query? So this: "abcdef/r" instead of this "abcdef ".
I have a dataset with a column that contains some strings of variable length, all of them with a trailing whitespace.
Now, since I had analyzed this data before, using python, I know that this whitespace is a \r; however, if in Athena I SELECT my_column, it obviously doesn't show the escaped whitespace.
Essentially, what I'm trying to achieve:
my_column | ..
----------+--------
abcdef\r | ..
ghijkl\r | ..
What I'm getting instead:
my_column | ..
----------+--------
abcdef | ..
ghijkl | ..
If you're asking why would I want that, it's just to avoid having to parse this data through python if I ever incur in this situation again, so that I can immediately know if there's any weird escaped characters in my strings.
Any help is much appreciated.

How do I read the rest of a line of a text file in MATLAB with TEXTSCAN?

I am trying to read a text file with data according to a specific format. I am using and textscan together with a string containing the format to read the whole data set in one code line. I've found how to read the whole line with fgetl, but I would like to use as few code lines as possible. So I want to avoid own for loops. textscan seems great for that.
As an example I'll include a part of my code which reads five strings representing a modified dataset, its heritage (name of old dataset), the date and time of the modification and lastly any comment.
fileID = fopen(filePath,'r+');
readContentFormat = '%s = %s | %s %s | %s';
content = textscan(fileID, readContentFormat, 'CollectOutput,1);
This works for the time being if the comment doesn't have any delimiters (like a white space) in it. However, I would like to be able to write comments at the end of the line.
Is there a way to use textscan and let it know that I want to read the rest of a line as one string/character array (including any white spaces)? I am hoping for something to put in my variable readContentFormat, instead of that last %s. Or is there another method which does not involve looping through each row in the file?
Also, even though my data is very limited I am keen to know any pros or cons with different methods regarding computational efficiency or stability. If you know something you think is worth sharing, please do so.
One way that is satisfactory to me (but please share any other methods anyway!) is to set the delimiters to characters other than white space, and trim away any leading or trailing white spaces with strtrim. This seemed to work well, but I have no idea how demanding the computations are.
Example:
The text file 'testFile.txt' in the current folder has the following lines
File |Heritage |Date and time |Comment
file1.mat | oldFile1.mat | 2018-03-01 14:26:00 | -
file2.mat | oldFile2.mat | 2018-03-01 13:26:00 | -
file3.mat | oldFile3.mat | 2018-03-01 12:26:00 | Time for lunch!
The following code will read the data and put it into a cell array without leading or trailing white spaces, with few lines of code. Neat!
function contentArray = myfun()
fileID = fopen(testFile.txt,'r');
content = textscan(fileID, '%s%s%s%s','Delimiter', {'|'},'CollectOutput', 1);
contentArray = strtrim(content{1}(2:4,:));
end
The output:
tmpArr =
3×4 cell array
'file1.mat' 'oldFile1.mat' '2018-03-01 14:26:00' '-'
'file2.mat' 'oldFile2.mat' '2018-03-01 13:26:00' '-'
'file3.mat' 'oldFile3.mat' '2018-03-01 12:26:00' 'Time for lunch!'

Search text between symbol

I have this text (taken from concatenated field row)
Astronomic Event 2013/1434H - Aceh ....
How do We search it by 2013 or 1434h keywords?
I have tried below code but it resulting no row.
to_tsvector result:
'2013/1434h':8,12 'aceh':1 'bin.....
Sample Case:
WITH sample_table as
(SELECT to_tsvector('Astronomic Event 2013/1434H - Aceh') sample_content)
SELECT *
FROM sample_table, to_tsquery('2013') q
WHERE sample_content ## q
How do We search it by 2013 or 1434h keywords?
It seems like you want to replace:
to_tsquery('1434h') q
with:
to_tsquery('1434h | 2013') q
http://www.postgresql.org/docs/current/static/functions-textsearch.html
Side note: the to_tsquery() syntax is extremely capricious. It doesn't allow for much if any fantasy, and many of the assumptions in Postgres are everything but end-user friendly.
More often than not, you'll be better off using plainto_tsquery(), which allows any amount of garbage to be thrown at it. Thus, consider pre-processing the string before issuing the query. For instance, you could split the string, and OR the original parts together:
where sc.text_index ## (plainto_tsquery('1434h') || plainto_tsquery('2013'))
Doing so will make your code a bit more complex, but it won't rely on your users needing to understand that (contrary to what they're accustomed to in Google) they should enter 'quick & brown & fox & jumps & lazy & dog' instead of plain 'The quick brown fox jumps over the lazy dog'.
Edit: I ended up actually trying your sample query, and it seems you're actually running into a parser issue:
# SELECT alias, description, token FROM ts_debug('Astronomic Event 2013/1434H - Aceh');
alias | description | token
-----------+-------------------+------------
asciiword | Word, all ASCII | Astronomic
blank | Space symbols |
asciiword | Word, all ASCII | Event
blank | Space symbols |
file | File or path name | 2013/1434H
blank | Space symbols |
blank | Space symbols | -
asciiword | Word, all ASCII | Aceh
(8 rows)
http://www.postgresql.org/docs/current/static/textsearch-parsers.html
It looks like you might need to write (or find) and configure an app-specific parser. Having never done this personally, the best I can do is to highlight that Postgres allows this and includes a sample:
http://www.postgresql.org/docs/current/static/test-parser.html
Alternatively, change your tsvector-related trigger so that it matches e.g. \d{4}/\d+[a-zA-Z] or whatever seems most appropriate, and adds spaces accordingly, before converting it to a tsvector. Something as simple as the following might do the trick if you never need to store file names:
SELECT alias, description, token
FROM ts_debug(replace('Astronomic Event 2013/1434H - Aceh', '/', ' / '));