How to merge only some files? - merge

I am trying to merge part of a commit from the default branch (not all files and parts of other files) to a named branch. I tried graft, but it just takes the whole commit wthout giving me a chance to choose. How would this be done?
Example:
A---B---C---D
\
-E---(G)
G does not exist yet. Lets say C and D each added 5 files and modified 5 files. I want G to have 2 of the 5 files added at C, all the modifications to one of the files and one modification to another file. I would ideally like it to also have something similar from D.
When I selected graft to local..., all I got was the whole C change-set. Same for merge with local...

The unit of merging is a whole changeset, so C and D should have been committed in smaller pieces. You could now merge the whole thing and revert some files, but this will have the result that you won't be able to merge the rest later-- they're considered merged already.
What I'd do is make a branch parallel to C-D, rooted at B in your example, that contains copies of the changes in C and D but splits them into coherent parts. Then you can merge whole changesets from that, and close (or or perhaps even delete) the original C-D branch.
C---D
/
A---B--C1--D1--C2--D2 (equivalent to C--D)
\
E---(G?)
In the above, C1 and C2 together are equivalent to C. While I was at it I went ahead and reordered the four new changesets (use a history-rewriting tool such as rebase), so that you can then simply merge D1 with E:
C---D
/
A---B--C1--D1--C2--D2
\ \
E------G
If reordering the new changesets is not an option, you'd have to do some fancy tapdancing to commit the partial changesets in the order C1, D1, C2, D2; it's probably a lot less trouble to use graft (or transplant) to copy the changes that you're not allowed to merge separately. E.g., in the following you can still merge C1, but then you need a copy of D1 (labeled D1') since there's no way to merge it without pulling C2 along with it.
C---D
/
A---B--C1--C2--D1--D2
\ \
E--G1--D1'

Related

How to handle concurrent adds on the same key in Last Write Wins map?

I am implementing an LWW map and in my design, all added key-value pairs have timestamps as is expected from LWW. That works for me until the same key is added in two replicas with different values at the same time. I can't understand how to make the merge operation commutative in this scenario.
Example:
Replica1 => add("key1", "value1", "time1")
Replica2 => add("key1", "value2", "time1")
Merge(Replica1, Replica2) # What should be the value of key1 in the resulting map?
Let's see what last write wins means in terms on causality. Let's say two clients C1 and C2 changed the same data D (same key) to values D1 and D2 respectively. If they don't ever talk to each other directly, you can pick D1 or D2 as the last value, and they will be OK.
But, if they talk to each other like, C1 changed the value to D1, informed C2, which as the result changed it to D2. Now, D1 and D2 have causal dependency, D1 happens before D2, so if your system picks D1 as the last value in merge, you have broken the last write wins guaranty.
Now coming to your question, when two clients make two requests in parallel, those two requests no way can have causal dependency, as both were inflight together, so any value you pick is fine.

Snowpipe skips files

We have used Snowpipe for ~10 months now and we recently ran into a case where part of the files in a stage got uploaded to the corresponding snowflake table and any future files were not detected. Verified that the underlying stage and pipe were in valid states.
Let's assume that the staging location is s3://<some_bucket>/some/path and there are 5 files file1.csv, file2.csv, file3.csv, file4.csv, file5.csv
select metadata$filename, count(*) from #<DB_NAME>.<SCHEMA>.<STAGE_NAME> group by metadata$filename;
The output indicates that all 5 files were detected and the counts align with what's expected. But file4.csv and file5.csv never got ingested.
select * from table(information_schema.copy_history(table_name=>'<TABLE_NAME>', start_time=> dateadd(hours, -1000, current_timestamp())));
does not show the copy history which makes us suspect if the table/pipe is in non-deterministic state and if there's a way out of this.
Note: the copy_history command works for other tables in the database.

Cherry-pick commit with its merge commit without solving conflict

Here is my case:
├── (c0) ── (c1) ── (c2-merge-commit) ── (b0)── (b1-merge-commit)
I wanted to combine c0, c1 and c2 into one commit and have this:
├── (squashed : c0, c1, c2-merge-commit) ── (b0)── (b1-merge-commit)
So i carried my HEAD to c2, then back to c0 and squash merge to c2 (HEAD#{1})
git merge --squash HEAD#{1}
This part went quite well and i had :
├── (squashed : c0, c1, c2)
Now i need to add 2 commits (b0) and (b1-merge-commit). I used cherry-pick, but in that case i need to resolve conflict for b0. But i dont want to resolve conflict again as i actually did it with (b1-merge-commit).
As a solution; I first squashed b0 and b1 into new commit; and then cherry-pick this one over (squashed : c0, c1, c2). But obviously this solution won't scale if i would do squashing +10 commits past with so many merge commits flying around.
TLDR :
I wanna carry two commits on top of my branch; one of them is actual change(b0) with conflicts and other one is merge commit where conflicts are actually resolved. When i cherry-pick one by one, git asks me to handle conflicts of b0 where conflict actually resolved with merge commit on b1. I could only manage to do it with squashing b0 and b1 into one new commit, and cherry-pick it on my branch.
I could find two approaches so far, but non of them gives what exactly i want. But atleast i wouldnt need to resolve conflicts again.
First one, as it is mentioned in question already.
I did squashing for b0 and b1 in another branch. Turn back to original branch and then cherry-pick squashed b0 and b1:
├── (squashed : c0, c1, c2)-(squashed: b0, b1)
Second one, cherry-pick merge commit directly:
This one is more faster and simpler.
$> git cherry-pick -m 1 <hash-b1>
And branch now look like this :
├── (squashed : c0, c1, c2)-(b1-merge-commit)

Sort unmatched records using joinkeys

I have two GDG files (-1 & 0 version). Using these two files a flat file needs to be generated which will have Insert records(records which are not in -1 version but are in +0 version), Delete records(records which are in -1 version but are not in +0 version) and Update records(records which are in both the versions but the +0 version might have changes in some of the fields). How can i get those update records? Can i do it using Joinkeys, if yes, How?
Note: The update can be anywhere from column 1 to the last column of the file(+0 version of the GDG)
It is a simple JOINKEYS:
OPTION COPY
JOINKEYS F1=INA,FIELDS=(4,80),SORTED,NOSEQCK
JOINKEYS F2=INB,FIELDS=(4,80),SORTED,NOSEQCK
JOIN UNPAIRED
REFORMAT FIELDS=(F1:1,227,F2:1,227,?)
The OPTION COPY is for the Main Task, the bit which runs after the joined file is produced. SORT FIELDS=COPY is equivalent to OPTION COPY.
The assumption is that your data is in key order already. If not, remove the SORTED,NOSEQCKs but bear in mind that you may get "spurious" matches, by equal keys not in the same position on the file relative to inserts and deletes.
JOIN UPAIRED gives you matches and both types of mismatch. JOIN UNPAIRED,F1,F2 is equivalent.
The REFORMAT statement defines the records on the joined file. You want all the data from both/either record, and you want to know whether there was a match, and if no match, which input file had the record. That is what the question-mark (?) is. It will contain 'B' (on both files), '1' (on F1, or the first physically present JOINKEYS, only) or '2' (on the other JOINKEYS file only).
Then you need to output the data. I'll assume you want the data in separate places:
OUTFIL FNAMES=INSERT,
INCLUDE=(455,1,CH,EQ,C'1'),
BUILD=(1,227)
OUTFIL FNAMES=DELETE,
INCLUDE=(455,1,CH,EQ,C'2'),
BUILD=(228,227)
OUTFIL FNAMES=CHANGE,
INCLUDE=(455,1,CH,EQ,C'B',
AND,
1,227,CH,NE,228,227,CH),
BUILD=(1,454)
OUTFIL FNAMES=UNCHNGE,
SAVE,
BUILD=(1,227)
INCLUDE= (or OMIT=) includes or omits the data from the "OUTFIL Group". OUTFILs "run" concurrently (as in the same record is presented to each in turn, then the next record, etc).
FNAMES gives you the DDname to put in the JCL.
For CHANGE, the INCLUDE is for the first record (known to match due to the test for 'B') not being equal to the second. It is not exactly clear what output you want here. Currently those are output as F2 appended to F1, and entire (twice the size) record written. You could also write the records in "pairs" (BUILD=(1,227,/,228,227)) or just one or the other of the records.
SAVE is a thing which says "if this record hasn't appeared on any OUTFIL, output it here. It is certainly useful for testing, even if you don't want it in the final code.

Using merge in sas

I am a little bit confused about merging in SAS. For example, when people are using the merge statement, sometimes (in=a) or (in=b) is followed. What does that do exactly?
To elaborate more on vknowles answer, in=a and in=b are useful in different types of merges. Let's say we have the following data step:
data inner left right outer;
merge have1(in=a) have2(in=b);
by ...;
if a and b then output inner;
else if a and not b then output left;
else if not a and b then output right;
else if not (a or b) then output miss;
run;
The data step will create 4 different datasets which are the basis of an inner join, left join, and right join.
The statement if a and b then output inner; will output only records in which the key is found in the datasets have1 and have2 which is the equivalent of a SQL inner join.
else if a and not b then output left; will output only the records that occur in the have1 dataset and not the have2. This is the equivalent of a left outer join in SQL. If you wanted a full left join you could either append the left dataset to the inner dataset or just change the statement to if (a and b) or (a and not b) then output left.
The third else if is just the opposite of the previous. Here you can perform a right join on the data.
The last else if will output to the outer dataset which is the equivalent of an outer join. This is useful for debugging purposes as the key is unique to each dataset. Thanks to Robert for this addition.
When you see a dataset referenced in SAS with () following it the items inside are called dataset options. The IN= dataset option is valid when reading datasets using statements like SET, MERGE, UPDATE. The word following IN= names a variable that SAS will create that will be true (1) when that dataset contributes to the current observation and false (0) otherwise.
A good example would be if want to use one data set to subset another. For example if you wanted to merge on data from a master lookup table to add an extra variable like an address or account type , but did not what to add in every id from the lookup table.
data want;
merge my_data(in=in1) master_lookup (in=in2);
by id;
if in1 ;
run;
Or if you are stacking or interleaving data from more than one table and wanted to take action depending on which table this record is from.
data want;
set one(in=in1) two(in=in2);
by id;
if in1 then source='ONE';
if in2 then source='TWO';
run;
Let's say you have MERGE A (in=INA) B (in=INB);
When merging two datasets with a BY statement, you can have the following situations:
Dataset A has an observation with a given by-value and dataset B does not. In this case, INA will be true and INB will be false.
Dataset A and B both have an observation with a given by-value. In this case, INA and INB will be true.
[Edited to correct error] Dataset A and B have different numbers of observations with a given by-value. Let's say A has more than B. Even after B runs out of observations, both INA and INB will be true. If you want to know whether B still has observations, you need something like the following code. As #Tom pointed out, you often want to know which dataset is contributing observations.
data want;
ina=0;
inb=0;
merge a (in=ina) b (in=inb);
by mybyvariable;
...
run;
The above code takes advantage of the fact that SAS retains the variables from the last observation contributed to the by-group by the dataset with the smaller number of observations. If you reinitialize any of the variables before the MERGE statement, and there is no new observation, they will keep their reinitialized values; but if there is a new observation, SAS will reset them. I hope that makes sense.
Dataset A does not have an observation with a given by-value and B does. In this case, INA will be false and INB will be true.
Note that this does not cover the situation of merging without a BY statement.