Stata: importing txt with several multi character delimiters - import

I have data with very odd delimiters:
1,|ABC1|,|BUD|,|Fed Budget & Appropriations|,|t1|
2,|ABC2|,|LBR|,|Labor, Antitrust & Workplace|,|t2|
3,|ABC3|,|UNM|,|Unemployment|,|t1|
So the delimiter is a comma and each variable, but the first one (the identifier) is between two pipes. The problem is that the fourth variable also uses commas, so I can't simply use commas as delimiters and delete the pipes. I have found a way to work the data by doing some find and replace operations through the terminal, but I would like to do this through Stata. Does anyone have an idea how to?

I put your data example into a text file and found that the delimiters were detected quite well automatically. Then I dropped any variable that was all commas or all missing, using findname from the Stata Journal.
. import delimited "troublesome.txt"
(9 vars, 3 obs)
. list
+-------------------------------------------------------------------------+
| v1 v2 v3 v4 v5 v6 v7 v8 v9 |
|-------------------------------------------------------------------------|
1. | 1, ABC1 , BUD , Fed Budget & Appropriations , t1 . |
2. | 2, ABC2 , LBR , Labor, Antitrust & Workplace , t2 . |
3. | 3, ABC3 , UNM , Unemployment , t1 . |
+-------------------------------------------------------------------------+
. findname, all(# == ",")
v3 v5 v7
. drop `r(varlist)'
. findname, all(missing(#))
v9
. drop `r(varlist)'
. destring v1, ignore(",") replace
v1: character , removed; replaced as byte
. list
+-----------------------------------------------------+
| v1 v2 v4 v6 v8 |
|-----------------------------------------------------|
1. | 1 ABC1 BUD Fed Budget & Appropriations t1 |
2. | 2 ABC2 LBR Labor, Antitrust & Workplace t2 |
3. | 3 ABC3 UNM Unemployment t1 |
+-----------------------------------------------------+

Related

Postgresql - Chain multiple regex_replace functions in single query?

Using Postgresql 11.6. I have values in tab_a.sysdescr that I want to convert using regex_replace and update those converted values into tab_b.os_type.
Here is table tab_a that contains the source string in sysdescr :
hostname | sysdescr |
-------------+-----------------+
wifiap01 | foo HiveOS bar |
switch01 | foo JUNOS bar |
router01 | foo IOS XR bar |
Here is table tab_b that is the target for my update, in column os_type :
hostname | mgmt_ip | os_type
-------------+--------------+---------
wifiap01 | 10.20.30.40 |
switch01 | 20.30.40.50 |
router01 | 30.40.50.60 |
This is example desired state for tab_b :
hostname | mgmt_ip | os_type
-------------+--------------+---------
wifiap01 | 10.20.30.40 | hiveos
switch01 | 20.30.40.50 | junos
router01 | 30.40.50.60 | iosxr
I have a working query that will work against a single os_type. In this example, HiveOS :
UPDATE tab_b
SET os_type = (
SELECT REGEXP_REPLACE(sysdescr, '.*HiveOS.*', 'hiveos')
FROM tab_a
WHERE tab_a.hostname = tab_b.hostname
)
WHERE EXISTS (
SELECT sysdescr
FROM tab_a
WHERE tab_a.hostname = tab_b.hostname
);
What I can't figure out is how I can "chain" multiple regex_replace functions together into a single query, or via nested sub-queries. Adding 'OR' after that SELECT REGEX_REPLACE line doesn't work, and haven't been able to find examples online of something like this.
End-goal is a single query function that will replace the strings as specified, updating the replaced string on all rows in tab_b. I was hoping to avoid having to delve into PL/Python but if that is the best way to solve this, that's okay. Ideally, I could define a third table that contains the pattern and replacement_string arguments - and could iterate over that somehow.
Edit: Example of what I am trying to accomplish
This is not valid code, but hopefully demonstrates what I am trying to accomplish. A single query that can be executed once, and will translate/transform every sysdescr in a table into proper values for os_type in a new table.
UPDATE tab_b
SET os_type = (
SELECT REGEXP_REPLACE(sysdescr, '.*HiveOS.*', 'hiveos') OR
SELECT REGEXP_REPLACE(sysdescr, '.*JUNOS.*', 'junos') OR
SELECT REGEXP_REPLACE(sysdescr, '.*IOS XR.*', 'iosxr')
FROM tab_a
WHERE tab_a.hostname = tab_b.hostname
)
WHERE EXISTS (
SELECT sysdescr
FROM tab_a
WHERE tab_a.hostname = tab_b.hostname
);
If foo and bar are consistent in all rows (as indicated in your example), then this should work:
postgres=# SELECT lower(replace(regexp_replace('foo IOS XR bar','foo (.*) bar','\1'),' ',''));
lower
-------
iosxr
(1 row)
In short, this does the following:
Trim off foo and bar from the front and back with regexp_replace()
Remove the spaces with replace()
Lower-case the text with lower()
If you need to do anything further to remove foo and bar, you can nest the string functions as demonstrated above.
I was able to solve this using a third table (lookup table). It contains two columns, one holding the match string and one holding the return string.
New table tab_lookup:
id | match_str | return_str
----+-----------------------------------------------+------------
1 | HiveOS | hiveos
2 | IOS XR | iosxr
3 | JUNOS | junos
5 | armv | opengear
6 | NX-OS | nxos
7 | Adaptive Security Appliance | asa
17 | NetScreen | netscreen
19 | Cisco Internetwork Operating System Software | ios
18 | Cisco IOS Software | ios
20 | ProCurve | hp
21 | AX Series Advanced Traffic Manager | a10
22 | SSG | netscreen
23 | M13, Software Version | m13
24 | WS-C2948 | catos
25 | Application Control Engine Appliance | ace
Using this query I can update tab_b.os_type with the appropriate value from tab_lookup.return_str:
UPDATE tab_b
SET os_type = (
SELECT return_str
FROM tab_lookup
WHERE EXISTS (
SELECT regexp_matches(sysdescr, match_str)
FROM tab_a
WHERE tab_a.hostname = tab_b.hostname
)
);
The only catch I have encountered is that there must be only one match against a given row. But this is easily accomplished by verbose match_str values. E.g, don't use 'IOS' but instead use 'Cisco IOS Software'.
All in all, very happy with this solution since it provides an easy way to update the lookup values, as more device types are added to the network.

Importing text file with length delimiter

I have a text file, which contains only numbers.
For example:
2001 31110
199910 311
Its layout can be explained as follows:
1~4th numbers : Year
5~6th numbers : Month
7~8th numbers : Day
9th number : Sex
10th number : Married
However, I can't decide how to import this file into Stata.
For instance, if I use the command:
import delimited input.txt, delimiter(??)
What should I write in delimiter?
I don't necessarily need to use the above. I just want to import the data using whatever method.
The answer depends on what you want to do with the data later.
My understanding is that the spaces indicate a single digit for date-related numbers and that in the text file, only month or day can be single digit but not both. In addition, sex and married are binary indicators taking values 0 and 1.
Assuming the above are correct and the data below are included in a file data.txt:
2001 31110
199910 311
1983 41201
2012121500
Here's one way to do it:
clear
import delimited data.txt, delimiter(" ") stringcols(_all)
list
+--------------------+
| v1 v2 |
|--------------------|
1. | 2001 31110 |
2. | 199910 311 |
3. | 1983 41201 |
4. | 2012121500 |
+--------------------+
replace v2 = "0" + v2 if v2 != ""
generate v3 = v1 + v2
generate year = substr(v3, 1, 4)
generate month = substr(v3, 5, 2)
generate day = substr(v3, 7, 2)
generate date = substr(v3, 1, 8)
generate sex = substr(v3, 9, 1)
generate married = substr(v3, 10, 1)
list
+----------------------------------------------------------------------------------+
| v1 v2 v3 year month day date sex married |
|----------------------------------------------------------------------------------|
1. | 2001 031110 2001031110 2001 03 11 20010311 1 0 |
2. | 199910 0311 1999100311 1999 10 03 19991003 1 1 |
3. | 1983 041201 1983041201 1983 04 12 19830412 0 1 |
4. | 2012121500 2012121500 2012 12 15 20121215 0 0 |
+----------------------------------------------------------------------------------+
You basically import everything in a maximum of two string variables, with a single space " " acting as a separator. The single-digit months or days are changed to two digits by adding a 0 at the front. Then, after you extract the relevant parts of the strings using the substr() function, you can simply convert the resulting variables to numeric as needed.
For example:
destring year month day sex married, replace
generate date2 = daily(date, "YMD")
format date2 %tdDD-NN-CCYY
. list date2
+------------+
| date2 |
|------------|
1. | 11-03-2001 |
2. | 03-10-1999 |
3. | 12-04-1983 |
4. | 15-12-2012 |
+------------+
If in your text file both month and day contain single digits, you follow the same logic as above but you will need to deal with a third variable as well after you import the data.

String splitting and operations on only some results

I have strings that look like this:
schedulestart | event_labels
2018-04-04 | 9=TTR&11=DNV&14=SWW&26=DNV&2=QQQ&43=FTW
When I look at it in the database. I have code that relies in this string in this format to display a schedule with events with those labels on those days.
Now I find myself needing to break down the string in postgres for reporting/analysis, and I can't really pull out the string and parse it in another language, so I have to stick to postgres.
I've figured out a way to unpack the string so my results look like this:
User ID | Schedule Start | Unpacked String
2 | 2018-04-04 | TTR
2 | 2018-04-04 | 9
2 | 2018-04-04 | DNV
2 | 2018-04-04 | 11
2 | 2018-04-04 | SWW
2 | 2018-04-04 | 14
2 | 2018-04-04 | DNV
2 | 2018-04-04 | 26
select schedulestart, unnest(string_to_array(unnest(string_to_array(event_labels, '&')), '=')) from table;
Now what I need is a way to actually perform an interval calculation (so 2018-04-04+11 days::interval), and I can if I only get a numbers list, but I need to also bind that result to each string. So the goal is an output like this:
eventdate | event_label
2018-04-12 | TTR
2018-04-20 | DNV
Where eventdate is the schedule start + which day of the schedule the event is on. I'm not sure how to take the unpacked string I created and use it to perform date calculations, and tie it to the string.
I've considered doing only one unnest, so that it's 11=TTR and 14=DNV, but I'm not sure how to take that to my desired result either. Is there a way to read a string until you reach a certain character, and then use that in calculations, and then read every character past a certain character in a string into a new column?
I'm aware completely rewriting how this is handled would be ideal, but I did not initially write it, and I don't have the time or means to rewrite the ~20 locations this is used.
Here is your table (I added userid column):
CREATE TABLE test(userid INTEGER, schedulestart DATE, event_labels VARCHAR);
And input data:
INSERT INTO test(userid,schedulestart , event_labels) VALUES
(2,DATE '2018-04-04', '9=TTR&11=DNV&14=SWW&26=DNV&2=QQQ&43=FTW');
And finally the solution:
SELECT
userid,
(schedulestart + (SPLIT_PART(kv,'=',1)||' days')::INTERVAL)::DATE AS eventdate,
SPLIT_PART(kv,'=',2) AS event_label
FROM (
SELECT
userid,schedulestart,
REGEXP_SPLIT_TO_TABLE(event_labels, '&') AS kv
FROM test
WHERE userid = 2
) a

How to get back aggregate values across 2 dimensions using Python Cubes?

Situation
Using Python 3, Django 1.9, Cubes 1.1, and Postgres 9.5.
These are my datatables in pictorial form:
The same in text format:
Store table
------------------------------
| id | code | address |
|-----|------|---------------|
| 1 | S1 | Kings Row |
| 2 | S2 | Queens Street |
| 3 | S3 | Jacks Place |
| 4 | S4 | Diamonds Alley|
| 5 | S5 | Hearts Road |
------------------------------
Product table
------------------------------
| id | code | name |
|-----|------|---------------|
| 1 | P1 | Saucer 12 |
| 2 | P2 | Plate 15 |
| 3 | P3 | Saucer 13 |
| 4 | P4 | Saucer 14 |
| 5 | P5 | Plate 16 |
| and many more .... |
|1000 |P1000 | Bowl 25 |
|----------------------------|
Sales table
----------------------------------------
| id | product_id | store_id | amount |
|-----|------------|----------|--------|
| 1 | 1 | 1 |7.05 |
| 2 | 1 | 2 |9.00 |
| 3 | 2 | 3 |1.00 |
| 4 | 2 | 3 |1.00 |
| 5 | 2 | 5 |1.00 |
| and many more .... |
| 1000| 20 | 4 |1.00 |
|--------------------------------------|
The relationships are:
Sales belongs to Store
Sales belongs to Product
Store has many Sales
Product has many Sales
What I want to achieve
I want to use cubes to be able to do a display by pagination in the following manner:
Given the stores S1-S3:
-------------------------
| product | S1 | S2 | S3 |
|---------|----|----|----|
|Saucer 12|7.05|9 | 0 |
|Plate 15 |0 |0 | 2 |
| and many more .... |
|------------------------|
Note the following:
Even though there were no records in sales for Saucer 12 under Store S3, I displayed 0 instead of null or none.
I want to be able to do sort by store, say descending order for, S3.
The cells indicate the SUM total of that particular product spent in that particular store.
I also want to have pagination.
What I tried
This is the configuration I used:
"cubes": [
{
"name": "sales",
"dimensions": ["product", "store"],
"joins": [
{"master":"product_id", "detail":"product.id"},
{"master":"store_id", "detail":"store.id"}
]
}
],
"dimensions": [
{ "name": "product", "attributes": ["code", "name"] },
{ "name": "store", "attributes": ["code", "address"] }
]
This is the code I used:
result = browser.aggregate(drilldown=['Store','Product'],
order=[("Product.name","asc"), ("Store.name","desc"), ("total_products_sale", "desc")])
I didn't get what I want.
I got it like this:
----------------------------------------------
| product_id | store_id | total_products_sale |
|------------|----------|---------------------|
| 1 | 1 | 7.05 |
| 1 | 2 | 9 |
| 2 | 3 | 2.00 |
| and many more .... |
|---------------------------------------------|
which is the whole table with no pagination and if the products not sold in that store it won't show up as zero.
My question
How do I get what I want?
Do I need to create another data table that aggregates everything by store and product before I use cubes to run the query?
Update
I have read more. I realised that what I want is called dicing as I needed to go across 2 dimensions. See: https://en.wikipedia.org/wiki/OLAP_cube#Operations
Cross-posted at Cubes GitHub issues to get more attention.
This is a pure SQL solution using crosstab() from the additional tablefunc module to pivot the aggregated data. It typically performs better than any client-side alternative. If you are not familiar with crosstab(), read this first:
PostgreSQL Crosstab Query
And this about the "extra" column in the crosstab() output:
Pivot on Multiple Columns using Tablefunc
SELECT product_id, product
, COALESCE(s1, 0) AS s1 -- 1. ... displayed 0 instead of null
, COALESCE(s2, 0) AS s2
, COALESCE(s3, 0) AS s3
, COALESCE(s4, 0) AS s4
, COALESCE(s5, 0) AS s5
FROM crosstab(
'SELECT s.product_id, p.name, s.store_id, s.sum_amount
FROM product p
JOIN (
SELECT product_id, store_id
, sum(amount) AS sum_amount -- 3. SUM total of product spent in store
FROM sales
GROUP BY product_id, store_id
) s ON p.id = s.product_id
ORDER BY s.product_id, s.store_id;'
, 'VALUES (1),(2),(3),(4),(5)' -- desired store_id's
) AS ct (product_id int, product text -- "extra" column
, s1 numeric, s2 numeric, s3 numeric, s4 numeric, s5 numeric)
ORDER BY s3 DESC; -- 2. ... descending order for S3
Produces your desired result exactly (plus product_id).
To include products that have never been sold replace [INNER] JOIN with LEFT [OUTER] JOIN.
SQL Fiddle with base query.
The tablefunc module is not installed on sqlfiddle.
Major points
Read the basic explanation in the reference answer for crosstab().
I am including with product_id because product.name is hardly unique. This might otherwise lead to sneaky errors conflating two different products.
You don't need the store table in the query if referential integrity is guaranteed.
ORDER BY s3 DESC works, because s3 references the output column where NULL values have been replaced with COALESCE. Else we would need DESC NULLS LAST to sort NULL values last:
PostgreSQL sort by datetime asc, null first?
For building crosstab() queries dynamically consider:
Dynamic alternative to pivot with CASE and GROUP BY
I also want to have pagination.
That last item is fuzzy. Simple pagination can be had with LIMIT and OFFSET:
Displaying data in grid view page by page
I would consider a MATERIALIZED VIEW to materialize results before pagination. If you have a stable page size I would add page numbers to the MV for easy and fast results.
To optimize performance for big result sets, consider:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Optimize query with OFFSET on large table

Iterate on a tMssqlInput in Talend

I use the last version of Talend 5.3.1.
I have a tmssqlInput which query my database like :
SELECT IdInvoice, DateInvoice, IdStuff, Name FROM Invoice
INNER JOIN Stuff ON Invoice.IdInvoice = Stuff.IdInvoice
which result in something like this
IdInvoice | DateInvoice | IdStuff | Name
1 | 2013-01-01 | 10 | test
1 | 2013-01-01 | 11 | test2
2 | 2013-02-01 | 12 | test3
2 | 2013-02-01 | 13 | test4
I'd like to export one file per invoice, here the specifications :
one header line with IdInvoice;DateInvoice
then one line per stuff like IdStuff;Name
example file 1:
1;2013-01-01
10;test
11;test2
example file 2 :
2;2013-02-01
12;test3
13;test4
how can I resolve that case with talend ?
Probably in tFileOutputDelimited but how can I have one file with multiple informations and iterate over each IdInvoice
Please go through the following link, you will get clear idea how to split data into multiple files
http://www.talendfreelancer.com/2013/09/talend-tflowtoiterate.html