Skip to content

Conversation

@zasdfgbnm
Copy link
Collaborator

@zasdfgbnm zasdfgbnm commented Oct 20, 2019

Stack from ghstack:

Type casting is used in copy, and will be used also in tensor iterator
in the next stacked diff. I move it to c10 to make it serve as an common
util for different things.

I also add two dynamic casting functions

  • fetch_and_cast
  • cast_and_store

fetch_and_cast fetch a value with dynamic type specified by a ScalarType
from a void pointer and cast it to a static type.

cast_and_store casts a static typed value into dynamic type specified
by a ScalarType, and store it into a void pointer.

Type casting is used in copy, and will be used also in tensor iterator
in the next stacked diff. I move it to c10 to make it serve as an common
util for different things.

I also add two dynamic casting functions
- fetch_and_cast
- cast_and_store

fetch_and_cast fetch a value with dynamic type specified by a ScalarType
from a void pointer and cast it to a static type.

cast_and_store casts a static typed value into dynamic type specified
by a ScalarType, and store it into a void pointer.

[ghstack-poisoned]
@zasdfgbnm zasdfgbnm requested review from colesbury and ezyang October 20, 2019 07:38
@zasdfgbnm zasdfgbnm added module: internals Related to internal abstractions in c10 and ATen module: type promotion Related to semantics of type promotion labels Oct 20, 2019
@zasdfgbnm zasdfgbnm mentioned this pull request Oct 21, 2019
@ezyang
Copy link
Contributor

ezyang commented Oct 21, 2019

Fetch and cast seems like a bad idea. It does a dynamic test, which means if you ever do it in a loop, you are going to do a billion tests, no?

I'm OK with the movement, that can go in whenever.

@zasdfgbnm
Copy link
Collaborator Author

zasdfgbnm commented Oct 21, 2019

@ezyang Yes and no. It will execute a billion tests, but it is not as bad as it might look:

  • on CPU, this is a branch that always has the same outcome, therefore hopefully the branch predictor could do the job pretty well
  • on GPU, these branches will not diverge, so we could still have the same warp executing the same line of code
  • Most kernels, like add, are bandwidth bound, adding a few clock cycles to check an integer does not hurt the performance much because the ALUs would wait for load instructions anyway.

For example for the benchmark at #28344, the add_ speed up ~3x, which justifies that the argument on being bandwidth bound (previously 1 read 1 write for the cast, 2 read 1 write for add, 1 read 1 write for casting back therefore 7 memory access in total, after 1 read, cast, add, cast, 1 write, 2 memory access, 3x less).

The choice of using dynamic cast is mostly for the consideration of the compilation time and binary size, for example, to implement add_ with type promotion, without dynamic casting, if we support N dtypes, we should do something like

AT_DISPATCH_ALL_TYPES(output.dtype(),
   AT_DISPATCH_ALL_TYPES(input1.dtype(),
      AT_DISPATCH_ALL_TYPES(input2.dtype(),
          [](arg0_t a, arg1_t b) -> out_t { return a + b; }
      )
   )
)

which would generate the a+b kernel for all the N * N * N different supported types, the compilation time and binary size would become horrible.

@zasdfgbnm
Copy link
Collaborator Author

@ezyang See also #28352 (comment)
The performance for the copy kernel with static vs dynamic typing is also comparable.

@ezyang
Copy link
Contributor

ezyang commented Oct 22, 2019

OK, that's a lovely explanation. Let's put it in the code? Then I'll approve

@zasdfgbnm zasdfgbnm merged commit b3009ac into gh/zasdfgbnm/8/base Oct 22, 2019
@zasdfgbnm
Copy link
Collaborator Author

It seems that I mess things up...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: internals Related to internal abstractions in c10 and ATen module: type promotion Related to semantics of type promotion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants