Zed specification: Promotion and applying an operation more than one schema - specifications

I have an Array schema that keeps track of sequence of Data schemas. Using promotion, I am able to promote Increment operation to use with Array.
ArrayIncrement only increments one data inside Array. How do I make it such that it increments every Data in \ran data?

The basic obstacle in your approach to increment all values is that the use of the relational override in Promote (last line) specifies that all values in data' map to the same value as in data except at position index?.
One approach is to explicitly "iterate" over the relation for all elements:
--- ArrayIncrement ---
| ΔArray
| dom data = dom data'
| ∀ i:dom data; ΔData ·
| θData = data i ∧ θData' = data' i ∧ Increment
In the first line of the body we state that the domain stays the same, without it there would be infinite solutions with additional elements.
In the next line we set up the variables to represent the before and after state at the specific index analogously to the second line in Promote of your solution.


Talend sequence Generation with Data

I Have generated my own sequence based on the data. I need to compare the current sequence with the previous sequence generated from the data.
If both the sequences are same I should not increment the value. If the sequence are different I need to increment the sequence by using Numeric.sequence system routine. How to do that?
Example :
Generated sequence --1234567890 --1
Next sequence --1234567890 --1
If Both has the sequence number generated the value should remain the same.
Store the previous sequence in a variable in order for you to be able to compare, instead of comparing now == next, in talend you need to do now == previous for you to be able to compare both.
For this a tJavaRow should be enough, you can store the previous sequence on a global variable, and compare it in the next iteration
Have a lookup on the target filter the sequence if same
SOURCE (row1) -> filter
(if(row1.sequence !=row2.sequence))
>insert out
Target (Lookup row2)

How to do recursive calculation in SPSS Modeler

If I want to compute a value that relies on the previous one (Recursive functions) how can I do it in SPSS ? Example:
Q0 = 0
Qn = Q(n-1) + Constant
If by "... the previous one ..." you mean the value of the same field (or a different field) for the previous record, you can use the #OFFSET(FIELD, EXPR) function.
The function allows you to access values from records other than the current one based on a relative reference.
After many research I couldn't find any way to do recursive function with SPSS Modeler. The only work around is to use R Transform node within SPSS. HTH.
Depending on what you need to do, you can either chain many derive nodes or refer to the previous value in a column after sorting them.
I started with creating a domain context for the stream data flow (iterations) with a simple csv source file with records keeping one field N (range from 1 to 100), just to limit the example. Then I connected this data source with a derive node that defines the variable field Q:
if not(#NULL(#OFFSET(N,1))) then #OFFSET(Q,1) + 2 else 0 endif
Here I used the value 2 for the Constant in the example above. I see this being a recursive function and it relies on the OFFSET just as Kenneth suggested above.

How to delete data from an RDBMS using Talend ELT jobs?

What is the best way to delete from a table using Talend?
I'm currently using a tELTJDBCoutput with the action on Delete.
It looks like Talend always generate a DELETE ... WHERE EXISTS (<your generated query>) query.
So I am wondering if we have to use the field values or just put a fixed value of 1 (even in only one field) in the tELTmap mapping.
To me, putting real values looks like it useless as in the where exists it only matters the Where clause.
Is there a better way to delete using ELT components?
My current job is set up like so:
The tELTMAP component with real data values looks like:
But I can also do the same thing with the following configuration:
Am I missing the reason why we should put something in the fields?
The following answer is a demonstration of how to perform deletes using ETL operations where the data is extracted from the database, read in to memory, transformed and then fed back into the database. After clarification, the OP specifically wants information around how this would differ for ELT operations
If you need to delete certain records from a table then you can use the normal database output components.
In the following example, the use case is to take some updated database and check to see which records are no longer in the new data set compared to the old data set and then delete the relevant rows in the old data set. This might be used for refreshing data from one live system to a non live system or some other usage case where you need to manually move data deltas from one database to another.
We set up our job like so:
Which has two tMySqlConnection components that connect to two different databases (potentially on different hosts), one containing our new data set and one containing our old data set.
We then select the relevant data from the old data set and inner join it using a tMap against the new data set, capturing any rejects from the inner join (rows that exist in the old data set but not in the new data set):
We are only interested in the key for the output as we will delete with a WHERE query on this unique key. Notice as well that the key has been selected for the id field. This needs to be done for updates and deletes.
And then we simply need to tell Talend to delete these rows from the relevant table by configuring our tMySqlOutput component properly:
Alternatively you can simply specify some constraint that would be used to delete the records as if you had built the DELETE statement manually. This can then be fed in as the key via a main link to your tMySqlOutput component.
For instance I might want to read in a CSV with a list of email addresses, first names and last names of people who are opting out of being contacted and then make all of these fields a key and connect this to the tMySqlOutput and Talend will generate a DELETE for every row that matches the email address, first name and last name of the records in the database.
In the first example shown in your question:
you are specifically only selecting (for the deletion) products where the SOME_TABLE.CODE_COUNTRY is equal to JS_OPP.CODE_COUNTRY and SOME_TABLE.FK_USER is equal to JS_OPP.FK_USER in your where clause and then the data you send to the delete statement is setting the CODE_COUNTRY equal to JS_OPP.CODE_COUNTRY and FK_USER equal to JS_OPP.CODE_COUNTRY.
If you were to put a tLogRow (or some other output) directly after your tELTxMap you would be presented with something that looks like:
| tLogRow_1 |
|GBR |1 |
|GBR |2 |
|USA |3 |
In your second example:
You are setting CODE_COUNTRY to an integer of 1 (your database will then translate this to a VARCHAR "1"). This would then mean the output from the component would instead look like:
|tLogRow_1 |
|1 |
|1 |
|1 |
In your use case this would mean that the deletion should only delete the rows where the CODE_COUNTRY is equal to "1".
You might want to test this a bit further though because the ELT components are sometimes a little less straightforward than they seem to be.

store list in key value database

I search for best way to store lists associated with key in key value database (like berkleydb or leveldb)
For example:
I have users and orders from user to user
I want to store list of orders ids for each user to fast access with range selects (for pagination)
How to store this structure?
I don't want to store it in serializable format for each user:
user_1_orders = serialize(1,2,3..)
user_2_orders = serialize(1,2,3..)
beacuse list can be long
I think about separate db file for each user with store orders ids as keys in it, but this does not solve range selects problem.. What if I want to get user ids with range [5000:5050]?
I know about redis, but interest in key value implementation like berkleydb or leveldb.
Let start with a single list. You can work with a single hashmap:
store in row 0 the count of user's order
for each new order store a new row with the count incremented
So yoru hashmap looks like the following:
key | value
0 | 5
1 | tomato
2 | celery
3 | apple
4 | pie
5 | meat
Steady increment of the key makes sure that every key is unique. Given the fact that the db is key ordered and that the pack function translates integers into a set of byte arrays that are correctly ordered you can fetch slices of the list. To fetch orders between 5000 and 5050 you can use bsddb Cursor.set_range or leveldb's createReadStream (js api)
Now let's expand to multiple user orders. If you can open several hashmap you can use the above using several hashmap. Maybe you will hit some system issues (max nb of open fds or max num of files per directory). So you can use a single and share the same hashmap for several users.
What I explain in the following works for both leveldb and bsddb given the fact that you pack keys correctly using the lexicographic order (byteorder). So I will assume that you have a pack function. In bsddb you have to build a pack function yourself. Have a look at wiredtiger.packing or bytekey for inspiration.
The principle is to namespace the keys using the user's id. It's also called key composition.
Say you database looks like the following:
key | value
1 | 0 | 2 <--- count column for user 1
1 | 1 | tomato
1 | 2 | orange
... ...
32 | 0 | 1 <--- count column for user 32
32 | 1 | banna
... | ...
You create this database with the following (pseudo) code:
db.put(pack(1, make_uid(1)), 'tomato')
db.put(pack(1, make_uid(1)), 'orange')
db.put(pack(32, make_uid(32)), 'bannana')
make_uid implementation looks like this:
def make_uid(user_uid):
# retrieve the current count
counter_key = pack(user_uid, 0)
value = db.get(counter_key)
value += 1 # increment
# save new count
db.put(counter_key, value)
return value
Then you have to do the correct range lookup, it's similar to the single composite-key. Using bsddb api cursor.set_range(key) we retrieve all items
between 5000 and 5050 for user 42:
def user_orders_slice(user_id, start, end):
key, value = cursor.set_range(pack(user_id, start))
while True:
user_id, order_id = unpack(key)
if order_id > end:
# the value is probably packed somehow...
yield value
key, value = cursor.next()
Not error checks are done. Among other things slicing user_orders_slice(42, 5000, 5050) is not guaranteed to tore 51 items if you delete items from the list. A correct way to query say 50 items, is to implement a user_orders_query(user_id, start, limit)`.
I hope you get the idea.
You can use Redis to store list in zset(sorted set), like this:
// this line is called whenever a user place an order
$redis->zadd($user_1_orders, time(), $order_id);
// list orders of the user
$redis->zrange($user_1_orders, 0, -1);
Redis is fast enough. But one thing you should know about Redis is that it stores all data in memory, so if the data eventually exceed the physical memory, you have to shard the data by your own.
Also you can use SSDB(https://github.com/ideawu/ssdb), which is a wrapper of leveldb, has similar APIs to Redis, but stores most data in disk, memory is only used for caching. That means SSDB's capacity is 100 times of Redis' - up to TBs.
One way you could model this in a key-value store which supports scans , like leveldb, would be to add the order id to the key for each user. So the new keys would be userId_orderId for each order. Now to get orders for a particular user, you can do a simple prefix scan - scan(userId*). Now this makes the userId range query slow, in that case you can maintain another table just for userIds or use another key convention : Id_userId for getting userIds between [5000-5050]
Recently I have seen hyperdex adding data types support on top of leveldb : ex: http://hyperdex.org/doc/04.datatypes/#lists , so you could give that a try too.
In BerkeleyDB you can store multiple values per key, either in sorted or unsorted order. This would be the most natural solution. LevelDB has no such feature. You should look into LMDB(http://symas.com/mdb/) though, it also supports sorted multi-value keys, and is smaller, faster, and more reliable than either of the others.

Removing duplicate lines from a large dataset

Let's assume that I have a very large dataset that can not be fit into the memory, there are millions of records in the dataset and I want to remove duplicate rows (actually keeping one row from the duplicates)
What's the most efficient approach in terms of space and time complexity ?
What I thought :
1.Using bloom filter , I am not sure about how it's implemented , but I guess the side effect is having false-positives , in that case how can we find if it's REALLY a duplicate or not ?
2.Using hash values , in this case if we have a small number of duplicate values, the number of unique hash values would be large and again we may have problem with memory ,
Your solution 2: using hash value doesn't force a memory problem. You just have to partition the hash space into slices that fits into memory. More precisely:
Consider a hash table storing the set of records, each record is only represented by its index in the table. Say for example that such a hash table will be 4GB. Then you split your hash space in k=4 slice. Depending on the two last digits of the hash value, each record goes into one of the slice. So the algorithm would go roughly as follows:
let k = 2^M
for i from 0 to k-1:
t = new table
for each record r on the disk:
h = hashvalue(r)
if (the M last bit of h == i) {
insert r into t with respect to hash value h >> M
search t for duplicate and remove them
delete t from memory
The drawback is that you have to hash each record k times. The advantage is that is it can trivially be distributed.
Here is a prototype in Python:
# Fake huge database on the disks
records = ["askdjlsd", "kalsjdld", "alkjdslad", "askdjlsd"]*100
M = 2
mask = 2**(M+1)-1
class HashLink(object):
def __init__(self, idx):
self._idx = idx
self._hash = hash(records[idx]) # file access
def __hash__(self):
return self._hash >> M
# hashlink are equal if they link to equal objects
def __eq__(self, other):
return records[self._idx] == records[other._idx] # file access
def __repr__(self):
return str(records[self._idx])
to_be_deleted = list()
for i in range(2**M):
t = set()
for idx, rec in enumerate(records):
h = hash(rec)
if (h & mask == i):
if HashLink(idx) in t:
The result is:
>>> [records[idx] for idx in range(len(records)) if idx not in to_be_deleted]
['askdjlsd', 'kalsjdld', 'alkjdslad']
Since you need deletion of duplicate item, without sorting or indexing, you may end up scanning entire dataset for every delete, which is unbearably costly in terms of performance. Given that, you may think of some external sorting for this, or a database. If you don't care about ordering of output dataset. Create 'n' number of files which stores a subset of input dataset according to hash of the record or record's key. Get the hash and take modulo by 'n' and get the right output file to store the content. Since size of every output file is small now, your delete operation would be very fast; for output file you could use normal file, or a sqlite/ berkeley db. I would recommend sqlite/bdb though. In order to avoid scanning for every write to output file, you could have a front-end bloom filter for every output file. Bloom filter isn't that difficult. Lot of libraries are available. Calculating 'n' depends on your main memory, I would say. Go with pessimistic, huge value for 'n'. Once your work is done, concatenate all the output files into a single one.