Simple caching allocator for CPU. #42006

kimishpatel · 2020-07-24T14:23:45Z

Stack from ghstack:

Enable input pointer caching in XNNPACK integration. #42840 Enable input pointer caching in XNNPACK integration.
Change quantizer to account for input tensor's memory format. #42178 Change quantizer to account for input tensor's memory format.
Call qnnpack's conv setup only if input pointer has changed. #42008 Call qnnpack's conv setup only if input pointer has changed.
Refactor qconv to reduce allocations. #42007 Refactor qconv to reduce allocations.
Simple caching allocator for CPU. #42006 Simple caching allocator for CPU.

Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via WithCPUCachingAllocatorGuard.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D22726976

Summary: This PR introduces a simple CPU caching allocator. This is specifically intended for mobile use cases and for inference. There is nothing specific to the implementation that can prevent it from other use cases, however its simplicity may not be suitable everywhere. It simply tracks allocation by sizes and relies on deterministic repeatable behavior where allocation of same sizes are made on every inference. Thus after the first allocation when the pointer is returned, instead of returning it to system, allocator caches it for subsequent use. Memory is freed automatically at the end of the process, or it can be explicitly freed. This is enabled at the moment in DefaultMobileCPUAllocator only. Test Plan: android test: cpu_caching_allocator_test Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

dr-ci · 2020-07-24T14:34:28Z

💊 CI failures summary and remediations

As of commit dd32215 (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 2/2 non-CircleCI failure(s)

Extra GitHub checks: 1 failed

Failed: GitHub Actions - clang-tidy

ci.pytorch.org: 1 failed

Failed: pr/caffe2-pytorch-linux-xenial-rocm3.5.1-py3.6-test

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 75 times.

AshkanAliabadi · 2020-07-24T17:10:10Z

High level question: Do you have an eviction policy or will the cache keep growing?

kimishpatel · 2020-07-24T17:40:04Z

High level question: Do you have an eviction policy or will the cache keep growing?

No eviction policy. At the moment it frees entire cache if allocation fails. Otherwise it will keep growing. But keep growing only if you are doing many many allocations of different sizes. Usually I expect that you fall into a pattern and hence get reuse. I did think about adding some watermark to this such that if you go above that watermark you first free some allocation before allocating new, but then that watermark has to be user defined probably.

But please add as much scrutiny to this as you can think of because this is a crucial piece. Plus I wanted to keep it simple so that this itself should not add another layer of overhead.

Also note that you have to explicitly opt-in to use caching allocator.

vadimkantorov · 2020-07-24T18:03:57Z

That's pretty cool as a move towards more general arena allocators! (even for GPU it makes sense, since we could signal that a new iteration is starting and all previous tensor sizes don't matter much anymore)

kimishpatel · 2020-07-24T18:15:28Z

That's pretty cool as a move towards more general arena allocators! (even for GPU it makes sense, since we could signal that a new iteration is starting and all previous tensor sizes don't matter much anymore)

For GPUs, at least on non-mobile side, there is caching cuda allocator. We definitely need to aim for that on mobile side as well, now that we are adding mobile GPU support.

vadimkantorov · 2020-07-24T18:19:29Z

cuda caching allocator sometimes beahves sub-optimally (fragmentation and time spent reallocating) for variable batch-sizes, so some hints to it would be nice. but there was a long discussion about this with @ngimel , so far the conclusion has been that the existing caching allocator is good enough

AshkanAliabadi · 2020-07-24T18:25:22Z

I agree that for the vast majority of use cases memory consumption will stabilize around a reasonable figure, but considering that this solution is incorporated into an open source library as popular as PyTorch that can be used in many difficult-to-predict ways as part of models we are not aware of, I think we need to strike some balance between memory consumption and performance. This is all in on performance, which is great, but it comes at the cost of unpredictable memory consumption for the few scenarios that do not behave according to our expectations. Usually if a program reaches a point where an allocation fails because the system has run out of memory, the system has come to a grinding halt already making for really bad performance and hiccups. It's extreme cases like this that warrants a more robust approach IMHO where having an upper limit on the size of the cache (in megabytes) is a safer design. Just my two cents.

kimishpatel · 2020-07-24T18:36:36Z

Usually if a program reaches a point where an allocation fails because the system has run out of memory, the system has come to a grinding halt already making for really bad performance and hiccups

This is absolutely right and I 100% agree. We can certainly go with some user specified limit which would be the simplest option, but I would also like to see what else we can do which does not add too much runtime complexity to the allocator.

AshkanAliabadi · 2020-07-27T16:46:09Z

Alternatively on 64-bit systems what I have seen done is to allocate from the virtual address space without worrying much about deallocations and have the OS commit those pages to memory upon use. This requires a 64-bit address space though so does not apply to a good portion of our userbase.

Summary: This PR introduces a simple CPU caching allocator. This is specifically intended for mobile use cases and for inference. There is nothing specific to the implementation that can prevent it from other use cases, however its simplicity may not be suitable everywhere. It simply tracks allocation by sizes and relies on deterministic repeatable behavior where allocation of same sizes are made on every inference. Thus after the first allocation when the pointer is returned, instead of returning it to system, allocator caches it for subsequent use. Memory is freed automatically at the end of the process, or it can be explicitly freed. This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`. Test Plan: android test: cpu_caching_allocator_test Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976) [ghstack-poisoned]

ilia-cher · 2020-08-05T01:05:59Z

since this is a new allocator, please make sure to add reportMemoryUsageToProfiler to it (examples in the code of other allocators), this is a very simple API to allow pytorch profiler to know about memory allocations/deallocations

Summary: This PR introduces a simple CPU caching allocator. This is specifically intended for mobile use cases and for inference. There is nothing specific to the implementation that can prevent it from other use cases, however its simplicity may not be suitable everywhere. It simply tracks allocation by sizes and relies on deterministic repeatable behavior where allocation of same sizes are made on every inference. Thus after the first allocation when the pointer is returned, instead of returning it to system, allocator caches it for subsequent use. Memory is freed automatically at the end of the process, or it can be explicitly freed. This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`. Test Plan: android test: cpu_caching_allocator_test Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976) [ghstack-poisoned]

dreiss

This looks great!

I think there's a memory leak here. If I allocate a tensor with the caching allocator active, then free it with the allocator inactive, its size will stay in the allocation_map_ forever, right? I'm not sure how to resolve this. Maybe in DefaultMobileCPUAllocator::free, even if the caching allocator is disabled, you could inform the caching allocator that the pointer is being deleted. However, my next suggestion will break that strategy. Maybe we could take advantage of the fact that DataPtr stores more than just the raw buffer pointer.

What would you think about this slight refactor: instead of having CPUCachingAllocator be global, require the user to create an own it. Probably rename it to "CPUAllocationCache" as well, to be more specific about what it does. Then WithCPUCachingAllocatorGuard would take an argument: a pointer (or reference) to the cache, and set that pointer in the thread local, so CachingAllocatorInfo is removed and just replaced with a pointer (always assumed to be enabled if it's non-null). This has a few benefits;

During construction of the cache, there's an easy way for the user to specify configuration parameters, like maximum size. I also think a good addition that could be configured here is the minimum allocation size. For example, even on the Pixel 3, I bet jemalloc caches allocations under 4 kB, so we probably don't need to cache those ourselves.
When the user knows they are done with their use case, the allow the cache to be destroyed, freeing all of the memory. It doesn't stick around for the lifetime of the process.
If the expectation is that you'll have one cache per use case, we can store some model-specific information (like a memonger-like allocation plan) in the cache object.

Downsides:

It's harder to fix the memory leak I described above, since you don't know which cache needs to be informed of the deletion. But I think it's possible.
Can't share cached allocations between use cases.

dreiss · 2020-08-05T18:14:13Z

binaries/speed_benchmark_torch.cc

    std::cout << module.forward(inputs) << std::endl;
  }

+  c10::WithCPUCachingAllocatorGuard cachine_allocator_guard;


We should probably make a flag to disable this.

I see what you mean.

dreiss · 2020-08-05T18:17:46Z

c10/core/CPUCachingAllocator.h

+    //    it must not exist in available_map_, and vice-e-versa.
+    // 2. All the pointers in available_map_ must be unique.
+    ska::flat_hash_map<void*, size_t> allocation_map_;
+    ska::flat_hash_map<size_t, std::deque<void*>> available_map_;


IIRC, std::deque has pretty bad constant factor performance (though it might be better in libc++). I would make this a vector (or even c10::SmallVector) and treat it as a stack. FIFO also has the potential to give better temporal locality in the CPU cache.

I see. I was avoiding vector to avoid dynamic sizing and copying on allocation path. Did not know about bad constant factor perf of deque though.

Also another reason to use deque was that the only ops needed were push_back and pop_front or pop_back so I thought it will be faster with deque. But I think SmallVector should do as well.

Currently SmallVector isn't compiling with flat_hash_map, @smessmer has a fix in #42694. Will rebase once that lands.

dreiss · 2020-08-05T18:18:21Z

c10/core/CPUCachingAllocator.h

+  public:
+    void* allocate(const size_t bytes);
+    void free(void* ptr);
+    void free_cached();


Semantics should be documented.

dreiss · 2020-08-05T18:23:25Z

c10/core/CPUCachingAllocator.cpp

+}
+
+inline void* CPUCachingAllocator::allocate_and_cache(const size_t bytes) {
+  void* ptr;


Possible future optimization: if there's some way query the underlying allocator for the buffer size that will actually be returned when requesting an allocation of size bytes, we can use that here and in use_cached. That way, a cached allocation for 1023 bytes can (most likely) be used to satisfy a request for 1024 bytes.

So CUDACaching allocator does this by using binary tree for allocation, but that adds some latency to allocations. We dont know if this is significant. We probably should add some profiling to confirm if this is the case or not. My intuition is that for static graph like networks allocation patterns are same. The optimization you mentioned can help us reduce memory footprint though.

dreiss · 2020-08-05T18:23:57Z

c10/core/CPUCachingAllocator.cpp

+  // Is this assert necessary?
+  TORCH_INTERNAL_ASSERT(allocation_map_.find(ptr) == allocation_map_.end());


No, I don't think so.

dreiss · 2020-08-05T18:24:34Z

c10/core/CPUCachingAllocator.h

+   * No speculative allocation for any future allocations.
+   */
+  private:
+    std::mutex mutex_;


I think this needs to be used to guard all accesses to allocation_map_ and available_map_.

OK, so I realized that there was one place, free_cached where this was not the case, but then it is being used by 1) xctor and 2) allocate_and_cache. The latter one is already protected by lock. The former one is xctor, so that was probably my thought process in the first place. But free_cached was exposed before so now I moved it to function.

dreiss · 2020-08-05T18:29:21Z

c10/core/CPUCachingAllocator.cpp

+  std::lock_guard<std::mutex> guard(mutex_);
+  if (available_map_.find(bytes) == available_map_.end() ||
+      available_map_[bytes].empty()) {
+    return allocate_and_cache(bytes);
+  }
+  return use_cached(bytes);


If I'm reading this properly, the cache hit path does 4 lookups in available_map_. (find, [bytes].empty, [bytes].front, [bytes].pop_front). My suggestion would be to fully inline use_cached, then capture the result of find in a variable (I think you would type it as auto&). Then you could use that for the rest of the accesses.

Yeah, makes sense. Need to change coding habit.

dreiss · 2020-08-05T18:31:56Z

c10/core/CPUCachingAllocator.cpp

+    return;
+  }
+  const size_t alloc_size = allocation_map_[ptr];
+  allocation_map_.erase(ptr);


Why do this? Since we're keeping the buffer around, maintaining the size in the map should have a trivial memory cost. Dropping this line would allow us to skip not just this mutation, but also the insertion in the cache hit case.

This was to maintain the invariant that an CPUCachingAllocator allocated ptr can either be in allocation map or in available_map, but not in both places. What you are suggesting is fine. With that the invariant would be: A CPUCachingAllocator 1) allocated pointer can be in allocation_map_ and available_map_ simultaneously and 2) allocated pointer cannot be in available_map_ if it is not in allocation_map_. Thus allocation_map_ will always be growing. Thus if we remove cached allocation from available_map_ it has to be removed from allocation_map_ as well. This is I think fine, given that it gets some performance. I will name allocation_map_ appropriately to reflect that.

dreiss · 2020-08-05T18:32:40Z

c10/core/CPUCachingAllocator.cpp

+  if (available_map_.find(alloc_size) == available_map_.end()) {
+    available_map_[alloc_size] = std::deque<void*>({ptr});
+  } else {
+    available_map_[alloc_size].push_back(ptr);
+  }


I think you can do available_map_[alloc_size].push_back(ptr), which is just one lookup instead of two.

dreiss · 2020-08-05T18:32:57Z

c10/core/CPUCachingAllocator.cpp

+void CPUCachingAllocator::free_cached() {
+  for (const auto& it : available_map_) {
+    for (void* ptr : it.second) {
+      free(ptr);


c10::free_cpu

kimishpatel · 2020-08-05T21:15:05Z

c10/core/CPUCachingAllocator.cpp

+      free(ptr);
+    }
+  }
+  available_map_.clear();


I thought I protected this one too with mutex. Thanks for the catch.

Turns out this was intentional.

kimishpatel · 2020-08-05T23:30:50Z

I think there's a memory leak here. If I allocate a tensor with the caching allocator active, then free it with the allocator inactive, its size will stay in the allocation_map_ forever, right? I'm not sure how to resolve this. Maybe in DefaultMobileCPUAllocator::free, even if the caching allocator is disabled, you could inform the caching allocator that the pointer is being deleted. However, my next suggestion will break that strategy. Maybe we could take advantage of the fact that DataPtr stores more than just the raw buffer pointer.

Great catch. I missed this. We definitely have to handle this perhaps by having free always pass through this allocator for MobileCPUAllocator.

What would you think about this slight refactor: instead of having CPUCachingAllocator be global, require the user to create an own it. Probably rename it to "CPUAllocationCache" as well, to be more specific about what it does. Then WithCPUCachingAllocatorGuard would take an argument: a pointer (or reference) to the cache, and set that pointer in the thread local, so CachingAllocatorInfo is removed and just replaced with a pointer (always assumed to be enabled if it's non-null). This has a few benefits;

During construction of the cache, there's an easy way for the user to specify configuration parameters, like maximum size. I also think a good addition that could be configured here is the minimum allocation size. For example, even on the Pixel 3, I bet jemalloc caches allocations under 4 kB, so we probably don't need to cache those ourselves.

I am wondering if we can do the same via user specific AllocationProfile creation, which is what user owns, instead of user owned allocator itself. And then the Guard uses AllocationProfile to guide allocations.
This way we still use the same allocator but we can allocate accordingly to some profile. A problem I see is that if say allocation of size smaller than a specified size was done, then it gets freed outside of the guard, then we have lost context available from the guard. I guess we can 1) record all the allocations even the ones which are not candidate for caching and 2) make all the free go through this allocator that checks the map to see if ptr being freed was ever allocated by us and then it removes it from allocation_map_. This we probably have to do to avoid the issue you mentioned earlier, anyway.

When the user knows they are done with their use case, the allow the cache to be destroyed, freeing all of the memory. It doesn't stick around for the lifetime of the process.

We dont quite get this, with the approach I mentioned, but I dont know what the usecase would be. Do we create models temporarily and destroy them on the fly after one or few inferences?

If the expectation is that you'll have one cache per use case, we can store some model-specific information (like a memonger-like allocation plan) in the cache object.

This possibly can be achieved on entirely parallel path by a different type of allocator, or add an interface to existing allocator that allocates from a different pool. I dont have much clarity here though.

Downsides:

It's harder to fix the memory leak I described above, since you don't know which cache needs to be informed of the deletion. But I think it's possible.

Can't share cached allocations between use cases.

So I am open to refactoring the way you mentioned. I was just trying to think another way of achieving some of these things you mentioned.

kimishpatel · 2020-08-05T23:36:15Z

since this is a new allocator, please make sure to add reportMemoryUsageToProfiler to it (examples in the code of other allocators), this is a very simple API to allow pytorch profiler to know about memory allocations/deallocations

@ilia-cher, thanks for the pointer. I can try to add an option to do the profiling, mainly because I dont want the hotpath be slow on the account of profiling. I can add that in separate, follow-up PR perhaps, right after this one.

Summary: This PR introduces a simple CPU caching allocator. This is specifically intended for mobile use cases and for inference. There is nothing specific to the implementation that can prevent it from other use cases, however its simplicity may not be suitable everywhere. It simply tracks allocation by sizes and relies on deterministic repeatable behavior where allocation of same sizes are made on every inference. Thus after the first allocation when the pointer is returned, instead of returning it to system, allocator caches it for subsequent use. Memory is freed automatically at the end of the process, or it can be explicitly freed. This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`. Test Plan: android test: cpu_caching_allocator_test Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976) [ghstack-poisoned]

dreiss

A few minor things. The only big thing I'm worried about is the inconsistent handling of failure from c10::alloc_cpu. Your TORCH_CHECK in the catch block implies that it can return null, but we don't check for that in the non-catch case.

dreiss · 2020-08-11T21:20:24Z

binaries/speed_benchmark_torch.cc

+  std::unique_ptr<c10::CPUCachingAllocator> caching_allocator =
+    std::make_unique<c10::CPUCachingAllocator>();


Maybe

c10::CPUCachingAllocator caching_allocator;

instead of putting it on the heap?

I thought it was cleaner to provide a pointer to caching_allocator from heap then from stack.

dreiss · 2020-08-11T21:24:22Z

binaries/speed_benchmark_torch.cc

+  c10::WithCPUCachingAllocatorGuard cachine_allocator_guard(
+      caching_allocator.get(), FLAGS_use_caching_allocator);


How about

c10::optional<c10::WithCPUCachingAllocatorGuard> maybe_guard; if (FLAGS_use_caching_allocator) { maybe_guard.emplace(caching_allocator); }

? This way you could remove a concept from the API.

Sounds fine.

dreiss · 2020-08-11T21:27:06Z

c10/core/CPUCachingAllocator.cpp

    // again.
    // Try again.
    ptr = c10::alloc_cpu(bytes);
+    TORCH_CHECK(ptr, "Memory allocation failed for ", bytes, " bytes.");


Wait, does alloc_cpu throw or return null on failure? It seems like you are looking for both, but only checking null on this path.

My bad. The reason I had it and try..catch was because alloc_cpu was throwing and then I forgot about it and though, OMG I am not checking for valid memory. Thanks for the catch.

dreiss · 2020-08-11T21:29:37Z

c10/core/CPUCachingAllocator.cpp

 bool CachingAllocatorInfo::enabled() {
  return is_enabled_;
 }


What would you think of eliminating CachingAllocatorInfo and enabled entirely, and just having a thread-local "current allocation cache" pointer?

Yeah, I thought about it. I just felt it was cleaner this way to check if you have CachingAllocator enabled. But I dont have a strong preference. I will make the change. I assume you want to do it that way simplify code.

dreiss · 2020-08-11T21:31:30Z

c10/core/CPUCachingAllocator.h

+    ska::flat_hash_map<size_t, c10::SmallVector<void*, 16>> available_map_;
+    static ska::flat_hash_map<void*, size_t> allocation_map_;


I think the mix of static and non-static state makes the code a bit hard to follow. This is probably fine for now since we're talking about breaking this up into two different classes to allow users to override behavior. But if that plan falls through for whatever reason, I think we should come back and refactor this.

OK, so this one I thought about it a bit. It seemed to me that this class is the natural place for it, since it is coupled with what the class does and what it implies. Basically it, allocaiton_map_, wants to keep the global information about everything that happens through any instance of this class. My main issue was should it be outside where I might have to encapsulate in yet another class, would be to decouple two things that have strong association with each other. So if that continues to be case, I would still think this will be the place for it.
I agree about it being little hard to follow, but to me that is ok. What do you think?

I think it should be two separate classes, but let's leave it as-is for now. If/when we allow the user to override the caching logic, we can re-evaluate.

dreiss · 2020-08-11T21:34:12Z

c10/core/CPUCachingAllocator.cpp

+}
+
+inline void* CPUCachingAllocator::use_cached(const size_t bytes) {
+  void* ptr = available_map_[bytes].pop_back_val();


This is still doing an extra map lookup. If you manually inline this code into allocate, you can just return it->second.pop_back_val(), right?

Oh I see. Yeah, makes sense.

dreiss · 2020-08-11T21:35:54Z

c10/core/CPUCachingAllocator.cpp

+  // NB: since we are not really freeing the memory
+  // the cases such as quantization code freeing original weights
+  // on mobile, will not quite work, as we likely will hold
+  // onto that memory.


This is less true now since caching is unlikely to be enabled during model loading, so we won't find those buffers in the allocation_map.

Right now we are freeing original quantized weights during the first time we run the op because qnnpack requires bias to be packed in weights. So this still holds.

dreiss · 2020-08-11T21:36:43Z

c10/core/CPUCachingAllocator.cpp

+    c10::free_cpu(ptr);
+    return;
+  }
+  const size_t alloc_size = allocation_map_[ptr];


Still doing an extra lookup here. Could save the iterator from find above and dereference that here.

Thanks for catching this. I thought I fixed it, but clearly not.

Summary: This PR introduces a simple CPU caching allocator. This is specifically intended for mobile use cases and for inference. There is nothing specific to the implementation that can prevent it from other use cases, however its simplicity may not be suitable everywhere. It simply tracks allocation by sizes and relies on deterministic repeatable behavior where allocation of same sizes are made on every inference. Thus after the first allocation when the pointer is returned, instead of returning it to system, allocator caches it for subsequent use. Memory is freed automatically at the end of the process, or it can be explicitly freed. This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`. Test Plan: android test: cpu_caching_allocator_test Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976) [ghstack-poisoned]

facebook-github-bot · 2020-08-22T02:17:53Z

This pull request has been merged in 2a08566.

ezyang · 2020-09-21T22:01:09Z

c10/core/CPUCachingAllocator.cpp

+namespace c10 {
+
+namespace {
+thread_local CPUCachingAllocator* caching_allocator_ptr{nullptr};


Hey, sorry I'm late to the game. Do you know if thread_local here compiles to a __tls_get_addr on mobile? On recent profiling we've been doing for core overheads, we've noticed that __tls_get_addr is quite expensive and so it's a bit surprising to see this here. (thread local doesn't always compile to this, it's only in certain situations, so I'm wondering if this applies to mobile too.)

ezyang · 2020-09-22T02:23:56Z

c10/core/CPUCachingAllocator.h

+    //    returned the memory to OS via free_cached.
+    //  1.1. Therefore even when the said memory is "freed" via this
+    //       allocator (and thus cached), it will continue to stay
+    //       in allocaiton_map_. Furthermore it will also exist in


allocation not allocaiton

ezyang · 2020-09-22T02:29:50Z

c10/core/CPUCachingAllocator.h

+
+namespace c10 {
+
+class C10_API CPUCachingAllocator {


I'd suggest not naming this a "Caching" allocator, as it makes it sound like this behaves similar to CUDACachingAllocator (when actually there is very little algorithmic similarity at all). This would be less of a big deal if this was in a mobile specific directory, but since the file is in c10/core people are much more likely to get the wrong idea. I'm not sure what a good replacement name is; I'm reminded of "BlackBox" nomenclature from the Caffe2 days; SizeArenaAllocator is another pretty descriptive alternative.

ezyang · 2020-09-22T02:31:37Z

c10/core/CPUCachingAllocator.cpp

+    free_cached();
+    // Furthermore to consider: If we ever come here running out of memory
+    // perhaps it is best to disable caching, since this is likely to happen
+    // again.


Not going to log in this situation at all?

ezyang · 2020-09-22T02:33:16Z

c10/core/CPUCachingAllocator.cpp

+} // namespace
+
+std::mutex CPUCachingAllocator::mutex_;
+ska::flat_hash_map<void*, size_t> CPUCachingAllocator::allocation_map_;


Pray that there are never many small allocations of varying sizes? :o) (Would be nice if the allocator's description had a clear statement of what situations it is suitable for use, and what situations it is likely to do poorly in.)

kimishpatel requested a review from apaszke as a code owner July 24, 2020 14:23

This was referenced Jul 24, 2020

Support aarch32 neon backend for Vec256 #41267

Closed

Refactor qconv to reduce allocations. #42007

Closed

Call qnnpack's conv setup only if input pointer has changed. #42008

Closed

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Jul 24, 2020

kimishpatel requested review from AshkanAliabadi and dreiss July 24, 2020 16:22

kimishpatel mentioned this pull request Jul 28, 2020

Change quantizer to account for input tensor's memory format. #42178

Closed

dreiss requested changes Aug 5, 2020

View reviewed changes

kimishpatel commented Aug 5, 2020

View reviewed changes

kimishpatel added 3 commits August 6, 2020 08:07

kimishpatel mentioned this pull request Aug 11, 2020

Enable input pointer caching in XNNPACK integration. #42840

Closed

dreiss requested changes Aug 11, 2020

View reviewed changes

kimishpatel requested a review from dreiss August 13, 2020 14:27

kimishpatel added 3 commits August 13, 2020 07:39

dreiss approved these changes Aug 14, 2020

View reviewed changes

kimishpatel added 6 commits August 14, 2020 16:16

facebook-github-bot closed this in 2a08566 Aug 22, 2020

facebook-github-bot added the merged label Aug 22, 2020

facebook-github-bot deleted the gh/kimishpatel/35/head branch August 25, 2020 14:14

ezyang mentioned this pull request Sep 21, 2020

Using external memory allocator with PyTorch #43144

Closed

ezyang reviewed Sep 21, 2020

View reviewed changes

ezyang reviewed Sep 22, 2020

View reviewed changes

mruberry added the Merged label Oct 28, 2020

		// Is this assert necessary?
		TORCH_INTERNAL_ASSERT(allocation_map_.find(ptr) == allocation_map_.end());

		std::unique_ptr<c10::CPUCachingAllocator> caching_allocator =
		std::make_unique<c10::CPUCachingAllocator>();

		c10::WithCPUCachingAllocatorGuard cachine_allocator_guard(
		caching_allocator.get(), FLAGS_use_caching_allocator);

		ska::flat_hash_map<size_t, c10::SmallVector<void*, 16>> available_map_;
		static ska::flat_hash_map<void*, size_t> allocation_map_;

Simple caching allocator for CPU. #42006

Simple caching allocator for CPU. #42006

Uh oh!

Conversation

kimishpatel commented Jul 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Jul 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Extra GitHub checks: 1 failed

ci.pytorch.org: 1 failed

Uh oh!

AshkanAliabadi commented Jul 24, 2020

Uh oh!

kimishpatel commented Jul 24, 2020

Uh oh!

vadimkantorov commented Jul 24, 2020

Uh oh!

kimishpatel commented Jul 24, 2020

Uh oh!

vadimkantorov commented Jul 24, 2020

Uh oh!

AshkanAliabadi commented Jul 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kimishpatel commented Jul 24, 2020

Uh oh!

AshkanAliabadi commented Jul 27, 2020

Uh oh!

ilia-cher commented Aug 5, 2020

Uh oh!

dreiss left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kimishpatel Aug 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kimishpatel commented Aug 5, 2020

Uh oh!

kimishpatel commented Aug 5, 2020

kimishpatel commented Jul 24, 2020 •

edited

Loading

dr-ci bot commented Jul 24, 2020 •

edited

Loading

AshkanAliabadi commented Jul 24, 2020 •

edited

Loading

kimishpatel Aug 5, 2020 •

edited

Loading