Error in makeClassifTask - columns to join must specify "on=" - classification

I am getting an error here for the makeClassifTask() from MLR package.
task = makeClassifTask(data = data[,2:20441], target='Disease')
Entering this I get this error.
Provided data is not a pure data.frame but from class data.table, hence it will be converted.
Error in [.data.table(data, target) :
When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.
If someone could help me out it'd be great.

Given that you did not provide the data I can only do some guessing and suggest to read the documentation at https://mlr3book.mlr-org.com/tasks.html.
It looks like you left out the first column in your dataset which might be your target. Hence makeClassifTask() cannot find your target column.

As #Shreyash Gputa pointed out correctly, changing the data.table object to a data.frame object solves the issue:
task = makeClassifTask(data = as.data.frame(data[,2:20441]), target='Disease')
Given of course that data[,2:20441] contains the target variable Disease...

Related

Which Stage is used to Combine Two Data Stream without Common Key Field in DataStage (IBM)

I'm using Data Stage version 11.7 and encountered the error message below from the Lookup stage while compiling the job:
"The supplied expression was empty."
In the Lookup Stage, there are two links from two transformers and there is no common key column between the two datasets.
I googled how to merge or combine the two datasets from two transformers without a common key column. However, I couldn't find a proper way to solve this issue or the way implementing my job in DataStage.
Empty Expression
Is there anyone who knows how to solve this problem? If so, please let me know which stage is good for my job or how to solve the error. I would appreciate it.
If you need to join n:m, add a dummy column to each input link and fill it with a constant value like 1. Then join over that column. Decide if mutiple matches result in mutiple output rows or if the first match 'wins' - which would be like a random n:1 then, since every row matches when joining over a const value.
But if you need to join specific rows, it indicates that there actually is a common key but it's not obvious or visible. Either Transform the sources so that they get a common key or use an anchor table that provides the relations. Join that to the first source, then join the second source.

PostgreSQL - Compare ts_vector fields

I have two tables in which I have data coming from two different sources. One of the field of each table contains the title of a movie, but for some reason out of my control, the titles are not always exactly the same.
So I use the ts_vector to get rid of all the minor differences (stop words, plurals and so on).
See an example here: http://sqlfiddle.com/#!17/5ccbc/3
My problem is how to compare the two ts_vector without taking into account the numberic values, but just the text content. If I compare directly the two fields, I only get the exact match between values, including position of each word. The only solution I have found is using the strip() function, that remove positions and weights from tsvector, leaving only the text content.
I was wondering if there is a fastest way to compare ts_vectors.
You could create in index on the stripped vector:
create index on tbl1 (strip(ts_title));
create index on tbl2 (strip(ts_title));
But given that your query has to fetch every row of each table, it is unlikely this would serve much of a point. Doing a merge join between the precomputed stripped vectors could be faster, but probably not once you include the overhead of building and maintaining the indexes. If the real WHERE clause is more restrictive (selecting only a few rows from one or the other of the tables) then please share the real query.

Transpose data using Talend

I have this kind of data:
I need to transpose this data into something like this using Talend:
Help would be much appreciated.
dbh's suggestion should work indeed, but I did not try it.
However, I have another solution which doesn't require to change input format and is not too complicated to implement. Indeed the job has only 2 transformation components (tDenormalize and tMap).
The job looks like the following:
Explanation :
Your input is read from a CSV file (could be a database or any other kind of input)
tDenormalize component will Denormalize your column value (column 2), based on value on id column (column 1), separating fields with a specific delimiter (";" in my case), resulting as shown in 2 rows.
tMap : split the aggregated column into multiple columns, by using java's String.split() method and spreading the resulting array into multiple columns. The tMap should like like this:
Since Talend doesn't accept to store Array objects, make sure to store the splitted String in Object format. Then, cast that object into Array on the right side of the Map.
That approach should give you the expected result.
IMPORTANT:
tNormalize might shuffle the rows, meaning for bigger input, you might encounter unsorted output. Make sure to sort it if needed or use tDenormalizeSortedRow instead.
tNormalize is similar to an aggregation component meaning it scans the whole input before processing, which results into possible performance issues with particularly big inputs (tens of millions of records).
Your input is probably wrong (you have 5 entries with 1 as id, and 6 entries with 2 as id). 6 columns are expected meaning you should always have 6 lines per id. If not, then you should implement dbh's solution, and you probably HAVE TO add a column with a key.
You can use Talend's tPivotToColumnsDelimited component to achieve this. You will most likely need an additional column in your data to represent the field name.
Like "Identifier, field name, value "
Then you can use this component to pivot the data and write a file as output. If you need to process the data further, read the resulting file with tFileInoutDelimited .
See docs and an example at
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/13.43+tPivotToColumnsDelimited

GROUP BY CLAUSE using SYNCSORT

I have some content in a file on which I must generate statistics such as how many of records are of type - 1, type - 2 etc. Number of types can change and is unknown to the code until file arrives. In a SQL system, I can do this using COUNT and GROUP BY clause. But I am not sure if I can do this using SYNCSORT or COBOL program. Would anyone here have an idea on how I can implement 'GROUP BY' type query on a file using SYNCSORT.
Sample Data:
TYPE001 SUBTYPE001 TYPE01-DESC
TYPE001 SUBTYPE002 TYPE01-DESC
TYPE001 SUBTYPE003 TYPE01-DESC
TYPE002 SUBTYPE001 TYPE02-DESC
TYPE002 SUBTYPE004 TYPE02-DESC
TYPE002 SUBTYPE008 TYPE02-DESC
I want to get the information such as TYPE001 ==> 3 Records, TYPE002 ==> 3 Records. What the code doesn't know until runtime is the TYPENNN value
You show data already in sequence, so there is no need to sort the data itself, which makes SUM FIELDS= with SORT a poor solution if anyone suggests it (plus code for the formatting).
MERGE with a single input file and SUM FIELDS= would be better, but still require the code for formatting.
The simplest way to produce output which may suit you is to use OUTFIL reporting functions:
OPTION COPY
OUTFIL NODETAIL,
REMOVECC,
SECTIONS=(1,7,
TRAILER3=(1,7,
' ==> ',
COUNT=(M10,LENGTH=3),
' Records'))
The NODETAIL says "remove all the data lines". The REMOVECC says "although it is a report, don't use printer-control characters on position one of the output records". The SECTIONS says "we're going to use control-breaks, and here they (it in this case) are". In this case, your control-field is 1,7. The TRAILER3 defines the output which will be produced at each control-break: COUNT here is the number of records in that particular break. M10 is an editing mask which will change leading zeros to blanks. The LENGTH gives a length to the output of COUNT, three is chosen from your sample data with sub-types being unique and having three digits as the unique part of the data. Change to whatever suits your actual data.
You've not been clear, and perhaps you want the output "floating" (3bb instead of bb3, where b represents a blank)? That would require more code...

Stata merging changing my values?

I feel like I'm missing something really basic here...
I am trying to merge two datasets in Stata, FranceSQ.dta and FranceHQ.dta. They both have a variable that I created named "uid" that uniquely identifies the observations.
use FranceSQ, clear
merge 1:1 uid using FranceHQ, gen(_merge) keep(match)
Now what's confusing me is that it tells me that uid doesn't uniquely identify my observations. What I realized it happening is that when I open FranceSQ, everything is normal, and when I look at my uid variable, I have the following values...
25010201
25010202
25010203
...
But then once I try to run the merge, it changes all of my values, so that I see...
2.50101e+10
2.50101e+10
2.50101e+10
...
Any help would be very appreciated...I'm sure there's a simple answer but it's eluding me at the moment.
*** EDIT ***
So Nick's advice helped, thanks! This is what I was doing that went wrong, so I wonder if someone could point out why it didn't work.
1) I created the uid variable in each dataset by concatenating two numeric variables, which cast the uid variable as a string.
2) I ran destring on the whole dataset (because there were a lot of incorrectly cast variables), which turned uid into a double.
3) Then I recast uid as a string. It was with this that I was unable to do the initial merge. I noticed that the value it was changing all of my observations to was the last value in the dataset.
4) Just because I was tweaking around, I recast the uid variable as double, and was getting the same results.
Now I finally got it to work by just starting over and not recasting the uid variable as a string in the first place, but I'm still at a loss as to why my previous efforts did not work, or how the merge command actually decided to change my values.
Very likely, this is a problem with precision. Long integers need to be held in long or double data types. You might need to recast one identifier before the merge.
You should check by looking at the results of describe whether uid has the same data type in both datasets.
To check whether your variable really identifies observations, type isid uid. Stata would complain if uid is not a unique identifier when performing merge, anyway, but that's a useful check on its own. If uid passes the check in both files, it should still do so in the merged file; it must be failing in at least one of the source files in order to fail in the merged file.
On top of Nick Cox' answer concerning data types, the issue may simply be formatting. Type describe uid to find out what the current format is, and may be format uid %12.0f to get rid of the scientific notation.
I think Stata promotes variables to the more accurate format when it needs to, say when you replace an integer-valued variable with non-integer values; same thing should happen with merge when you have say byte values in one data set, and you merge in float values on the same variable from the other data set.
Missing values in uid may be the reason Stata does not believe this variable works well. Check for these, too, before and after merge (see help data types that I references above concerning the valid ranges for each type).