Skip to content

Conversation

pthomadakis
Copy link
Collaborator

This PR touches/fixes the following issues:

  1. According to Remove SetOp operation #93, the ta.set operation has been removed. In its place, each of the tensor arithmetic operations (add, sub, mul, eltwise_mul, transpose) accept an optional argument that defines the lhs where the result of this operation should be stored. This lhs will be the same value that will be returned from the result of these operations. If not provided, the lowering passes will automatically generate an empty tensor where the results will be stored.
    This can simplify analysis and rewrites significantly as all the related information can be retrieved from the op itself.
  2. Fusion + Parallel Execution causes issues #84 and Parallel GNN After Fusion has Problems in Lowering and Performance #91. We now support parallel execution + fusion but without dimension reduction (which has been made optional). Performance still needs to be examined. The issue with parallel + fusion + dimension reduction is that the dimension reduction makes it impossible to share a tensor among threads. For example, when a 2D matrix is reduced to a vector, all threads would try to access the same vector (unsafe). Instead, each thread should acquire a private copy of the vector. This seems to be handled automatically by scf.forall without dimension reduction.

This PR also introduces a new optimization:
In a compound expression, e.g., D = A + B + C, the old version would create a new allocation for the intermediate result, i.e., AB = A+B, D = AB + C. In contrast, we now store the result A+B in D, D=A+B and then add C, D = D + Cfor operations that are safe to do this.

…accept an optional operand `lhs` which can

be used to specify the output storage of the operation that will occur during bufferization, similar to linalg's paradigm.
For example, an expression C = Ax + Bx would previously work as:
A' = Ax, B' = Bx, C = A' + B'. Thus, each sub-expression would cause new allocations (A', B' tensors).
In contrast, by being able to optionally specify the output of each operation, we can store the result of the LHS of an expression into the output storage and accumulate the rest, i.e. C = Ax, B' = Bx, C = C + B'. Thus, we can remove some of the intermediate allocations.
Note that, we could eventually also avoid the allocation of B' if we change operations to accumulate into C (C[i,j] += op) instead of storing (C[i,j] = op) as we do in matmul

2. Fixed fusion + parallel, but only without applying dimension reduction

3. Made dimension reduction optional
…e of the operands in a ComputeOp are also found in the LHS
…the output (lhs) instead of creating new allocations.

2. Fixed issue with opt-dense-transpose not lowering properly when its result is returend from a function.
3. Added significant more functionality for cometpy and make sure we avoid unnecessary copies.
1. Fixed small incosistenices in output formats
2. Made `flags` argument of `comet.compile`optional
3. Added tests/exampes for new python constructs supported
4. Added "in_place" and "return" versions of several tests that will store results in an input argument or return them normally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant