GROUP BY CLAUSE using SYNCSORT - group-by

I have some content in a file on which I must generate statistics such as how many of records are of type - 1, type - 2 etc. Number of types can change and is unknown to the code until file arrives. In a SQL system, I can do this using COUNT and GROUP BY clause. But I am not sure if I can do this using SYNCSORT or COBOL program. Would anyone here have an idea on how I can implement 'GROUP BY' type query on a file using SYNCSORT.
Sample Data:
TYPE001 SUBTYPE001 TYPE01-DESC
TYPE001 SUBTYPE002 TYPE01-DESC
TYPE001 SUBTYPE003 TYPE01-DESC
TYPE002 SUBTYPE001 TYPE02-DESC
TYPE002 SUBTYPE004 TYPE02-DESC
TYPE002 SUBTYPE008 TYPE02-DESC
I want to get the information such as TYPE001 ==> 3 Records, TYPE002 ==> 3 Records. What the code doesn't know until runtime is the TYPENNN value

You show data already in sequence, so there is no need to sort the data itself, which makes SUM FIELDS= with SORT a poor solution if anyone suggests it (plus code for the formatting).
MERGE with a single input file and SUM FIELDS= would be better, but still require the code for formatting.
The simplest way to produce output which may suit you is to use OUTFIL reporting functions:
OPTION COPY
OUTFIL NODETAIL,
REMOVECC,
SECTIONS=(1,7,
TRAILER3=(1,7,
' ==> ',
COUNT=(M10,LENGTH=3),
' Records'))
The NODETAIL says "remove all the data lines". The REMOVECC says "although it is a report, don't use printer-control characters on position one of the output records". The SECTIONS says "we're going to use control-breaks, and here they (it in this case) are". In this case, your control-field is 1,7. The TRAILER3 defines the output which will be produced at each control-break: COUNT here is the number of records in that particular break. M10 is an editing mask which will change leading zeros to blanks. The LENGTH gives a length to the output of COUNT, three is chosen from your sample data with sub-types being unique and having three digits as the unique part of the data. Change to whatever suits your actual data.
You've not been clear, and perhaps you want the output "floating" (3bb instead of bb3, where b represents a blank)? That would require more code...

Related

Using NEsper to read LogFiles for reporting purposes

We are evaluating NEsper. Our focus is to monitor data quality in an enterprise context. In an application we are going to log every change on a lot of fields - for example in an "order". So we have fields like
Consignee name
Consignee street
Orderdate
....and a lot of more fields. As you can imagine the log files are going to grow big.
Because the data is sent by different customers and is imported in the application, we want to analyze how many (and which) fields are updated from "no value" to "a value" (just as an example).
I tried to build a test case with just with the fields
order reference
fieldname
fieldvalue
For my test cases I added two statements with context-information. The first one should just count the changes in general per order:
epService.EPAdministrator.CreateEPL("create context RefContext partition by Ref from LogEvent");
var userChanges = epService.EPAdministrator.CreateEPL("context RefContext select count(*) as x, context.key1 as Ref from LogEvent");
The second statement should count updates from "no value" to "a value":
epService.EPAdministrator.CreateEPL("create context FieldAndRefContext partition by Ref,Fieldname from LogEvent");
var countOfDataInput = epService.EPAdministrator.CreateEPL("context FieldAndRefContext SELECT context.key1 as Ref, context.key2 as Fieldname,count(*) as x from pattern[every (a=LogEvent(Value = '') -> b=LogEvent(Value != ''))]");
To read the test-logfile I use the csvInputAdapter:
CSVInputAdapterSpec csvSpec = new CSVInputAdapterSpec(ais, "LogEvent");
csvInputAdapter = new CSVInputAdapter(epService.Container, epService, csvSpec);
csvInputAdapter.Start();
I do not want to use the update listener, because I am interested only in the result of all events (probably this is not possible and this is my failure).
So after reading the csv (csvInputAdapter.Start() returns) I read all events, which are stored in the statements NewEvents-Stream.
Using 10 Entries in the CSV-File everything works fine. Using 1 Million lines it takes way to long. I tried without EPL-Statement (so just the CSV import) - it took about 5sec. With the first statement (not the complex pattern statement) I always stop after 20 minutes - so I am not sure how long it would take.
Then I changed my EPL of the first statement: I introduce a group by instead of the context.
select Ref,count(*) as x from LogEvent group by Ref
Now it is really fast - but I do not have any results in my NewEvents Stream after the CSVInputAdapter comes back...
My questions:
Is the way I want to use NEsper a supported use case or is this the root cause of my failure?
If this is a valid use case: Where is my mistake? How can I get the results I want in a performant way?
Why are there no NewEvents in my EPL-statement when using "group by" instead of "context"?
To 1), yes
To 2) this is valid, your EPL design is probably a little inefficient. You would want to understand how patterns work, by using filter indexes and index entries, which are more expensive to create but are extremely fast at discarding unneeded events.
Read:
http://esper.espertech.com/release-7.1.0/esper-reference/html_single/index.html#processingmodel_indexes_filterindexes and also
http://esper.espertech.com/release-7.1.0/esper-reference/html_single/index.html#pattern-walkthrough
Try the "previous" perhaps. Measure performance for each statement separately.
Also I don't think the CSV adapter is optimized for processing a large file. I think CSV may not stream.
To 3) check your code? Don't use CSV file for large stuff. Make sure a listener is attached.

Truncating a mergefield in Microsoft Word for mailmerge

I'm looking to create a formula and apply to a mergefield in Word 2013. I have no access to the db and only given merge fields. My goal is truncate/shorten certain mergefields. E.g. Full Name to just initial, Age [27 years] to short age [27].
Access and excel have the formula 'left' which i've tried to use with no success. There seems to be a low more options available for numbers.
{=left({ MERGEFIELD First_Name },1)}
However this gives a syntax error. Is there a list of formulas that work for mergefield?
Outcome
'Steven' -> 'S'
'27 Years' -> '27'
The short answer to your question is that there is nothing in the Word field language that can reliably do string manipulation such as left(), mid() and so on. The {=} field only has numeric functions such as SUM, ABS, PRODUCT and so on. There are unreliable approaches, and in some cases they may be reliable enough for your requirement, but that really depends on how sure you can be that the data source will always contain values formatted as you expect.
As a simple example, let's take the "27 years" thing.
If every value in the relevant data source column is in the same general format, which I will describe as "something that Word recognises as an individual number, followed by an alpha string", then you can in fact use
{ SET dat { MERGEFIELD age } }{ =dat }
Notice that in that case, if you are merging to a new document, { =dat } fields will remain in the output, and updating those fields will cause errors. You can avoid that by nesting either the { =dat } field or the all the fields in a QUOTE field:
{ QUOTE "{ SET dat { MERGEFIELD age } }{ =dat }" }
{ SET dat { MERGEFIELD age } }{ QUOTE { =dat } }
However, if your data source field could contain a value such as
4 years 2 months
then this will not work because in that case { =dat } will evaluate to 6, not 2. Word will also evaluate anything that looks like an { = } field expression, e.g. if your data source contains
SUM(23,25)
then { =dat } will evaluate to 48. There are further oddities that I will not describe now.
The simplest unreliable approach to extracting the first letter from a field is to use a large number of IF fields to test for every possible initial letter, e.g.
{ IF "{ MERGEFIELD First_Name }" = "A*" "A" }{ IF "{ MERGEFIELD First_Name }" = "B*" "B" } etc.
If you don't need to distinguish between lower and upper case you can use
{ IF "{ MERGEFIELD First_Name \*Upper }" = "A*" "A" } etc.
That's OK if you know (for example) that the names can only start with A-Z,a-z (and you could obviously test for 0-9 etc. as well. But what if it could start with any Unicode letter? Not sure that inserting thousands of IF fields is a reliable approach.
There is a similarly "unreliable" - and resource-consuming - way to use functions such as left, mid etc., as long as you are using recent versions of Windows Word (not Mac Word).
What you can do is create a completely empty Access/Jet database .mdb (let's say it is at c:\i\i.mdb, then insert a DATABASE field nested in a QUOTE field like this
{ QUOTE { DATABASE \d "c:\\i\\i.mdb" \s "SELECT left('{ MERGEFIELD First_Name }',1)" } }
Normally, a DATABASE field inserts a Word table (unless the data source has more columns than a Word table can contain), but when you only insert a single value with no headings, Word does not put the value in a cell. Unfortunately, these days Word does add a paragraph mark, but nesting the DATABASE field inside a QUOTE field appears to remove that again.
So why is that "unreliable"? Well, the main reason is if the First_Name field contains any quotation marks (certainly single-quotation marks, and OTTOMH I think double quotation marks) then the query that Word sends to Jet will look like this like this
SELECT left('a name containing a ' mark'),1)
and Jet will return a syntax error.
There are other problems with the DATABASE field approach, including
Word restricts the SELECT statement to 255 characters (I think). If
your data source filed causes the SLEECT statement length to exceed
that, Jet will return an error.
You have to put the database somewhere. If you are just using this
merge yourself, that may not be a problem, but if you have to
distribute the Word document etc. for others to use, you also have to
ensure they have the .mdb and that it's at the specified location.
Word sometimes gets confused between a Mail Merge data source and a
data source introduced via a DATABASE field.
Even one DATABASE field will execute a query for every record in the data source. If you use this technique in several places, a very large number of queries will be issued. That could cause problems.
As far as "single letter extraction" is concerned, there is another approach, rather similar to the DATABASE one, that uses an external .XML file and a set of INCLUDETEXT fields to specify a node in the file and return its content. But there are also similar difficulties. I may modify this Answer to describe that approach at some point, but as far as I know it has never been used in a real-world scenario.
So what if you need something more reliable? Well, there are several approaches, but all of them suffer from shortcomings of one kind or another. The main approaches I know are:
use Word VBA and the OpenDataSource method to open the data source.
That allows you to specify a query in the SQL dialect understood by
the data source.
use a Query/View defined in an intermediate database to extract the
data items you need, and use that Query/View as your data source
Use Word VBA's MailMerge Events to manipulate the data for each
record in the data source as Word processes the mailmerge
use a manual intermediate step
(more drastic) ditch Word MailMerge and find another approach
altogether, e.g. create a .docx using .NET, the relevant database
provider, and the Office Open XML SDK
If you are creating this merge for use by other people, two side-effects of all those approaches is that the overall process becomes more complicated or unfamiliar for the user, and in particular, they may not be able to use Word's facilities for data source record filtering and so on. Another issue that some people encounter is that if your database contains long text fields/memo fields longer than 255 characters, they have a tendency to be truncated by Jet whenever you do something much more complicated than the default "SELECT * FROM TABLE"
(1) requires that you can write a suitable query to get the columns you need from your data source. Because the query is executed using OLE DB you don't actually need to create any permanent objects in your database. So it may be a viable approach as long as the backend database allows you to execute external queries. But Word also imposes a 255 or 511 character limit on the query, so if you have to manipulate a lot of fields or the functions you need are complicated, you may find that you exceed the character limit quite quickly.
(2) is rather similar to (1) but may allow you to specify a much more complex query. For example, if your data source is a Jet .accdb, you may be able to create your own .accdb and define a query in that that accesses the tables in the .accdb that you are not allowed to modify. You might either used "linked tables" to achieve that, or in certain cases you can specify the locations of the underlying tables/queries in the SQL.
(3) means that you use VBA to intercept Word as it processes each data source record. I leave you to research that. You have to control the process from VBA to ensure that the MailMerge events are invoked. There have been reports of various unreliabilities. VBA can only access the first 255 characters of any memo fields.
(4), e.g. you create an Excel workbook and use it to query the database. In that case you may be able to issue a much longer SQL query than you can in Word, and you may be able to create new Excel columns that manipulate the data using excel formulas. (I have never tried that, though). Then use that as your data source.
Finally, a web search should reveal a list of functions recognised by Word's "=" field, but recent Microsoft documentation tends to omit the IF() function. The ISO29500 documents on the .docx standard omit it as well, but I think that was not the intention and may be fixed in a future version of the standard. The functions are:
ABS, AND, AVERAGE, COUNT, DEFINED, FALSE, IF, INT, MIN, MAX, MOD, NOT, OR, PRODUCT, ROUND, SUM, TRUE.

Transpose data using Talend

I have this kind of data:
I need to transpose this data into something like this using Talend:
Help would be much appreciated.
dbh's suggestion should work indeed, but I did not try it.
However, I have another solution which doesn't require to change input format and is not too complicated to implement. Indeed the job has only 2 transformation components (tDenormalize and tMap).
The job looks like the following:
Explanation :
Your input is read from a CSV file (could be a database or any other kind of input)
tDenormalize component will Denormalize your column value (column 2), based on value on id column (column 1), separating fields with a specific delimiter (";" in my case), resulting as shown in 2 rows.
tMap : split the aggregated column into multiple columns, by using java's String.split() method and spreading the resulting array into multiple columns. The tMap should like like this:
Since Talend doesn't accept to store Array objects, make sure to store the splitted String in Object format. Then, cast that object into Array on the right side of the Map.
That approach should give you the expected result.
IMPORTANT:
tNormalize might shuffle the rows, meaning for bigger input, you might encounter unsorted output. Make sure to sort it if needed or use tDenormalizeSortedRow instead.
tNormalize is similar to an aggregation component meaning it scans the whole input before processing, which results into possible performance issues with particularly big inputs (tens of millions of records).
Your input is probably wrong (you have 5 entries with 1 as id, and 6 entries with 2 as id). 6 columns are expected meaning you should always have 6 lines per id. If not, then you should implement dbh's solution, and you probably HAVE TO add a column with a key.
You can use Talend's tPivotToColumnsDelimited component to achieve this. You will most likely need an additional column in your data to represent the field name.
Like "Identifier, field name, value "
Then you can use this component to pivot the data and write a file as output. If you need to process the data further, read the resulting file with tFileInoutDelimited .
See docs and an example at
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/13.43+tPivotToColumnsDelimited

Is there any way for Access 2016 to sort the numbers that are part of a "text" data type formatted field as though they are numeric values?

I am working on a database that (hopefully) will end up using a primary key with both numbers and letters in the values to track lots of agricultural product. Due to the way in which the weighing of product takes place at more than one facility, I have no other option but to maintain the same base number but use letters in addition to this base number to denote split portions of each lot of product. The problem is, after I create record number 99, the number 100 suddenly floats up and underneath 10. This makes it difficult to maintain consistency and forces me to replace this alphanumeric lot ID with a strictly numeric value in order to keep it sorted (which I use "autonumber" as the data type). Either way, I need the alphanumeric lot ID, and so having 2 ID's for the same lot can be confusing for anyone inputting values into the form. Is there a way around this that I am just not seeing?
If you're using query as a data source then you may try to sort it by string converted to number, something like
SELECT id, field1, field2, ..
ORDER BY CLng(YourAlphaNumericField)
Edit: you may also try Val function instead of CLng - it should not fail on non-numeric input
Why not properly format your key before saving ? e.g: "0000099". You will avoid a costly conversion later.
Alternatively, you could use 2 fields as the composite PK. One with the Number (as Long) and one with the Location (as String).

Tableau: Create a table calculation that sums distinct string values (names) when condition is met

I am getting my data from denormalized table, where I keep names and actions (apart from other things). I want to create a calculated field that will return sum of workgroup names but only when there are more than five actions present in DB for given workgroup.
Here's how I have done it when I wanted to check if certain action has been registered for workgroup:
WINDOW_SUM(COUNTD(IF [action] = "ADD" THEN [workgroup_name] END))
When I try to do similar thing with count, I am getting "Cannot mix aggregate and non-aggregate arguments":
WINDOW_SUM(COUNTD(IF COUNT([Number of Records]) > 5 THEN [workgroup_name] END))
I know that there's problem with the IF clause, but don't know how to fix it.
How to change the IF to be valid? Maybe there's an easier way to do it, that I am missing?
EDIT:
(after Inox's response)
I know that my problem is mixing aggregate with non-aggregate fields. I can't use filter to do it, because I want to use it later as a part of more complicated view - filtering would destroy the whole idea.
No, the problem is to mix aggregated arguments (e.g., sum, count) with non aggregate ones (e.g., any field directly). And that's what you're doing mixing COUNT([Number of Records]) with [workgroup_name]
If your goal is to know how many workgroup_name (unique) has more than 5 records (seems like that by the idea of your code), I think it's easier to filter then count.
So first you drag workgroup_name to Filter, go to tab conditions, select By field, Number of Records, Count, >, 5
This way you'll filter only the workgroup_name that has more than 5 records.
Now you can go with a simple COUNTD(workgroup_name)
EDIT: After clarification
Okay, than you need to add a marker that is fixed in your database. So table calculations won't help you.
By definition table calculation depends on the fields that are on the worksheet (and how you decide to use those fields to partition or address), and it's only calculated AFTER being called in a sheet. That way, each time you call the function it will recalculate, and for some analysis you may want to do, the fields you need to make the table calculation correct won't be there.
Same thing applies to aggregations (counts, sums,...), the aggregation depends, well, on the level of aggregation you have.
In this case it's better that you manipulate your data prior to connecting it to Tableau. I don't see a direct way (a single calculated field that would solve your problem). What can be done is to generate a db from Tableau (with the aggregation of number of records for each workgroup_name) then export it to csv or mdb and then reconnect it to Tableau. But if you can manipulate your database outside Tableau, it's usually a better solution