Aminoacid screening library in Knime - workflow

I have a task to create tetrapeptide screening library aminoacids using Knime. I have never used Knime before sadly. I need to create a workflow with all 20 aminoacids, multiply it with another 20, then multiply the result with another 20 and repeat to get final result of tetrapeptides. Can someone suggest me how to input aminoacids on the Knime? Thank you very much!

Use a Table Creator node to enter the Amino acid single-letter codes, one per table. Now use a Cross Joiner node to cross-join the table to itself - you should now have a table with rows like:
A|A
A|C
etc.
Now put this table into both inputs of a second Cross Joiner node, which should give you now quite a long table starting something like:
A|A|A|A
A|A|A|C
A|C|A|A
A|C|A|C
etc.
Now use a Column Aggregator node, select all column as aggregation columns, the aggregation method as Concatenate and change the delimiter to an empty string:
and:
This will give you a table with a single column, 'Peptide':
AAAA
AAAC
ACAA
ACAC
etc.
If you want the output as a chemical structure, then as of v1.36.0 the Vernalis community contribution contains a node Speedy Sequence to SMILES which will convert the sequence to a SMILES string (make sure you select the option that your input column is a Protein!)
The full workflow is as shown:

Related

Use the bin binary on a per-ticker basis

For two tables each with datetime and ticker symbol columns, how can we achieve the functionality of the binary function bin within each ticker group. That is, instead of returning the latest index from the entire left-table prior to the time of each right-table row, for a given right-table row it should return the latest index from the left-table amongst only the rows of the same ticker symbol as the right-table row.
My first thought would be to add a per-group index in the left-table, apply bin on each ticker group for it’s group-index and then use the unique (ticker,group-index) pair to find the index on the full left-table. However, I am not sure how to implement this or if this is the best way to achieve the desired functionality.
Could you give some sample inputs and desired output?
This sounds like something you can solve with aj
Check https://code.kx.com/q/ref/aj/ for details

Is is possible limit the number of rows in the output of a Dataprep flow?

I'm using Dataprep on GCP to wrangle a large file with a billion rows. I would like to limit the number of rows in the output of the flow, as I am prototyping a Machine Learning model.
Let's say I would like to keep one million rows out of the original billion. Is this possible to do this with Dataprep? I have reviewed the documentation of sampling, but that only applies to the input of the Transformer tool and not the outcome of the process.
You can do this, but it does take a bit of extra work in your Recipe--set up a formula in a new column using something like RANDBETWEEN to give you a random integer output between 1 and 1,000 (in this million-to-billion case). From there, you can filter rows based on whatever random integer between 1 and 1,000 as what you'll keep, and then your output will only have your randomized subset. Just have your last part of the recipe remove this temporary column.
So indeed there are 2 approaches to this.
As Courtney Grimes said, you can use one of the 2 functions that create random-number out of a range.
randbetween :
rand :
These methods can be used to slice an "even" portion of your data. As suggested, a randbetween(1,1000) , then pick 1<x<1000 to filter, because it's 1\1000 of data (million out of a billion).
Alternatively, if you just want to have million records in your output, but either
Don't want to rely on the knowledge of the size of the entire table
just want the first million rows, agnostic to how many rows there are -
You can just use 2 of these 3 row filtering methods: (top rows\ range)
P.S
By understanding the $sourcerownumber metadata parameter (can read in-product documentation), you can filter\keep a portion of the data (as per the first scenario) in 1 step (AKA without creating an additional column.
BTW, an easy way of "discovery" of how-to's in Trifacta would be to just type what you're looking for in the "search-transtormation" pane (accessed via ctrl-k). By searching "filter", you'll get most of the relevant options for your problem.
Cheers!

Cloud Dataprep - Multiply rows in one column based on values in other column

I am working in Cloud Dataprep and i have a case like this:
Basically I need to create new rows in column 2 based on how many rows there is with matching data in column 1.
Is it possible and how?
I understand that the scenario you want to have is: obtain all values from column1 that match a value present in column2. There are many things to consider in this scenario, which you did not describe, such as: can values in column2 be repeated? or if there is a value in column2 missing in column1, what should happen? or what happens the other way around?
However, as a general approach to this issue, I would do the following flow:
With a flow such as this one, you take the input table, which as two columns like this:
In recipes FIRST_COLUMN and SECOND_COLUMN you split both columns into different branches, and do the necessary steps to clean each column. In column1, I understand nothing is needed to be done. In column2, I understand that you will have to remove duplicates (again, this is my guessing, but it would depend on your specific implementation, which you have not completely described) and delete empty values. You can do that applying the following transforms:
Finally, you can join both columns together. Depending on your needs (only values present in both columns should appear, only values present in columnX should appear, etc.) you should apply a different JOIN strategy. You should use a Join key like column1 = column2 (as in the image), and if you choose only the second column in the left-side menu, you will have a single-column result.
Note that in this case I used an Inner-join, but using other JOIN types will provide completely different results. Use the one that fits your requirements better.

How to get all missing days between two dates

I will try to explain the problem on an abstract level first:
I have X amount of data as input, which is always going to have a field DATE. Before, the dates that came as input (after some process) where put in a table as output. Now, I am asked to put both the input dates and any date between the minimun date received and one year from that moment. If there was originally no input for some day between this two dates, all fields must come with 0, or equivalent.
Example. I have two inputs. One with '18/03/2017' and other with '18/03/2018'. I now need to create output data for all the missing dates between '18/03/2017' and '18/04/2017'. So, output '19/03/2017' with every field to 0, and the same for the 20th and 21st and so on.
I know to do this programmatically, but on powercenter I do not. I've been told to do the following (which I have done, but I would like to know of a better method):
Get the minimun date, day0. Then, with an aggregator, create 365 fields, each has that "day0"+1, day0+2, and so on, to create an artificial year.
After that we do several transformations like sorting the dates, union between them, to get the data ready for a joiner. The idea of the joiner is to do an Full Outer Join between the original data, and the data that is going to have all fields to 0 and that we got from the previous aggregator.
Then a router picks with one of its groups the data that had actual dates (and fields without nulls) and other group where all fields are null, and then said fields are given a 0 to finally be written to a table.
I am wondering how can this be achieved by, for starters, removing the need to add 365 days to a date. If I were to do this same process for 10 years intead of one, the task gets ridicolous really quick.
I was wondering about an XOR type of operation, or some other function that would cut the number of steps that need to be done for what I (maybe wrongly) feel is a simple task. Currently I now need 5 steps just to know which dates are missing between two dates, a minimun and one year from that point.
I have tried to be as clear as posible but if I failed at any point please let me know!
Im not sure what the aggregator is supposed to do?
The same with the 'full outer' join? A normal join on a constant port is fine :) c
Can you calculate the needed number of 'dublicates' before the 'joiner'? In that case a lookup configured to return 'all rows' and a less-than-or-equal predicate can help make the mapping much more readable.
In any case You will need a helper table (or file) with a sequence of numbers between 1 and the number of potential dublicates (or more)
I use our time-dimension in the warehouse, which have one row per day from 1753-01-01 and 200000 next days, and a primary integer column with values from 1 and up ...
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
Ok... so you could override your source qualifier to achieve this in the selection query itself (am giving Oracle based example as its what I'm used to and I'm assuming your data in is from a table). I looked up the connect syntax here
SQL to generate a list of numbers from 1 to 100
SELECT (MIN(tablea.DATEFIELD) + levquery.n - 1) AS Port1 FROM tablea, (SELECT LEVEL n FROM DUAL CONNECT BY LEVEL <= 365) as levquery
(Check if the query works for you - haven't access to pc to test it at the minute)

How do you insert part of an exsisting column record into a new table in postgreSQL?

I'm in the process of formatting a database and have found a column I'd like to format. It has 3 types of information for every record in the column. For example, a record in my history column shows as (American, born Estonia. 19011974). What I want to do is put this data into new individual columns to make them atomic. I want to extract data such as 'American' into a country column, 'Estonia' into a born column and '1901' into a born column and '1974' into a death column. UPDATE: However, some of the columns hold nullls, for example, another record in the same column might be (German, 19242004), so a normal regular expression wouldn't work for all data would it? Any help is appreciated!
What PostgreSQL statements would I use to obtain specific parts of this data from the individual records? I understand it would be insert and have already came up with:
INSERT INTO historian (id,url)
SELECT object_id, url FROM maintable;
that statement allowed me to get those values into new columns, but those were atomic already so I could easily transition them. Thanks for any help! :)