Skip to content

Conversation

@kimishpatel
Copy link
Contributor

@kimishpatel kimishpatel commented Jul 24, 2020

Stack from ghstack:

Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via WithCPUCachingAllocatorGuard.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D22726976

Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@dr-ci
Copy link

dr-ci bot commented Jul 24, 2020

💊 CI failures summary and remediations

As of commit dd32215 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)

Extra GitHub checks: 1 failed


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 75 times.

@AshkanAliabadi
Copy link
Contributor

High level question: Do you have an eviction policy or will the cache keep growing?

@kimishpatel
Copy link
Contributor Author

High level question: Do you have an eviction policy or will the cache keep growing?

No eviction policy. At the moment it frees entire cache if allocation fails. Otherwise it will keep growing. But keep growing only if you are doing many many allocations of different sizes. Usually I expect that you fall into a pattern and hence get reuse. I did think about adding some watermark to this such that if you go above that watermark you first free some allocation before allocating new, but then that watermark has to be user defined probably.

But please add as much scrutiny to this as you can think of because this is a crucial piece. Plus I wanted to keep it simple so that this itself should not add another layer of overhead.

Also note that you have to explicitly opt-in to use caching allocator.

@vadimkantorov
Copy link
Contributor

That's pretty cool as a move towards more general arena allocators! (even for GPU it makes sense, since we could signal that a new iteration is starting and all previous tensor sizes don't matter much anymore)

@kimishpatel
Copy link
Contributor Author

That's pretty cool as a move towards more general arena allocators! (even for GPU it makes sense, since we could signal that a new iteration is starting and all previous tensor sizes don't matter much anymore)

For GPUs, at least on non-mobile side, there is caching cuda allocator. We definitely need to aim for that on mobile side as well, now that we are adding mobile GPU support.

@vadimkantorov
Copy link
Contributor

cuda caching allocator sometimes beahves sub-optimally (fragmentation and time spent reallocating) for variable batch-sizes, so some hints to it would be nice. but there was a long discussion about this with @ngimel , so far the conclusion has been that the existing caching allocator is good enough

@AshkanAliabadi
Copy link
Contributor

AshkanAliabadi commented Jul 24, 2020

I agree that for the vast majority of use cases memory consumption will stabilize around a reasonable figure, but considering that this solution is incorporated into an open source library as popular as PyTorch that can be used in many difficult-to-predict ways as part of models we are not aware of, I think we need to strike some balance between memory consumption and performance. This is all in on performance, which is great, but it comes at the cost of unpredictable memory consumption for the few scenarios that do not behave according to our expectations. Usually if a program reaches a point where an allocation fails because the system has run out of memory, the system has come to a grinding halt already making for really bad performance and hiccups. It's extreme cases like this that warrants a more robust approach IMHO where having an upper limit on the size of the cache (in megabytes) is a safer design. Just my two cents.

@kimishpatel
Copy link
Contributor Author

Usually if a program reaches a point where an allocation fails because the system has run out of memory, the system has come to a grinding halt already making for really bad performance and hiccups

This is absolutely right and I 100% agree. We can certainly go with some user specified limit which would be the simplest option, but I would also like to see what else we can do which does not add too much runtime complexity to the allocator.

@AshkanAliabadi
Copy link
Contributor

Alternatively on 64-bit systems what I have seen done is to allocate from the virtual address space without worrying much about deallocations and have the OS commit those pages to memory upon use. This requires a 64-bit address space though so does not apply to a good portion of our userbase.

Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976)

[ghstack-poisoned]
@ilia-cher
Copy link
Contributor

since this is a new allocator, please make sure to add reportMemoryUsageToProfiler to it (examples in the code of other allocators), this is a very simple API to allow pytorch profiler to know about memory allocations/deallocations

Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976)

[ghstack-poisoned]
Copy link
Contributor

@dreiss dreiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

I think there's a memory leak here. If I allocate a tensor with the caching allocator active, then free it with the allocator inactive, its size will stay in the allocation_map_ forever, right? I'm not sure how to resolve this. Maybe in DefaultMobileCPUAllocator::free, even if the caching allocator is disabled, you could inform the caching allocator that the pointer is being deleted. However, my next suggestion will break that strategy. Maybe we could take advantage of the fact that DataPtr stores more than just the raw buffer pointer.

What would you think about this slight refactor: instead of having CPUCachingAllocator be global, require the user to create an own it. Probably rename it to "CPUAllocationCache" as well, to be more specific about what it does. Then WithCPUCachingAllocatorGuard would take an argument: a pointer (or reference) to the cache, and set that pointer in the thread local, so CachingAllocatorInfo is removed and just replaced with a pointer (always assumed to be enabled if it's non-null). This has a few benefits;

  • During construction of the cache, there's an easy way for the user to specify configuration parameters, like maximum size. I also think a good addition that could be configured here is the minimum allocation size. For example, even on the Pixel 3, I bet jemalloc caches allocations under 4 kB, so we probably don't need to cache those ourselves.
  • When the user knows they are done with their use case, the allow the cache to be destroyed, freeing all of the memory. It doesn't stick around for the lifetime of the process.
  • If the expectation is that you'll have one cache per use case, we can store some model-specific information (like a memonger-like allocation plan) in the cache object.

Downsides:

  • It's harder to fix the memory leak I described above, since you don't know which cache needs to be informed of the deletion. But I think it's possible.
  • Can't share cached allocations between use cases.

std::cout << module.forward(inputs) << std::endl;
}

c10::WithCPUCachingAllocatorGuard cachine_allocator_guard;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably make a flag to disable this.

Copy link
Contributor Author

@kimishpatel kimishpatel Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean.

// it must not exist in available_map_, and vice-e-versa.
// 2. All the pointers in available_map_ must be unique.
ska::flat_hash_map<void*, size_t> allocation_map_;
ska::flat_hash_map<size_t, std::deque<void*>> available_map_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, std::deque has pretty bad constant factor performance (though it might be better in libc++). I would make this a vector (or even c10::SmallVector) and treat it as a stack. FIFO also has the potential to give better temporal locality in the CPU cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I was avoiding vector to avoid dynamic sizing and copying on allocation path. Did not know about bad constant factor perf of deque though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also another reason to use deque was that the only ops needed were push_back and pop_front or pop_back so I thought it will be faster with deque. But I think SmallVector should do as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently SmallVector isn't compiling with flat_hash_map, @smessmer has a fix in #42694. Will rebase once that lands.

public:
void* allocate(const size_t bytes);
void free(void* ptr);
void free_cached();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantics should be documented.

}

inline void* CPUCachingAllocator::allocate_and_cache(const size_t bytes) {
void* ptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible future optimization: if there's some way query the underlying allocator for the buffer size that will actually be returned when requesting an allocation of size bytes, we can use that here and in use_cached. That way, a cached allocation for 1023 bytes can (most likely) be used to satisfy a request for 1024 bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So CUDACaching allocator does this by using binary tree for allocation, but that adds some latency to allocations. We dont know if this is significant. We probably should add some profiling to confirm if this is the case or not. My intuition is that for static graph like networks allocation patterns are same. The optimization you mentioned can help us reduce memory footprint though.

Comment on lines 38 to 39
// Is this assert necessary?
TORCH_INTERNAL_ASSERT(allocation_map_.find(ptr) == allocation_map_.end());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think so.

* No speculative allocation for any future allocations.
*/
private:
std::mutex mutex_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be used to guard all accesses to allocation_map_ and available_map_.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I realized that there was one place, free_cached where this was not the case, but then it is being used by 1) xctor and 2) allocate_and_cache. The latter one is already protected by lock. The former one is xctor, so that was probably my thought process in the first place. But free_cached was exposed before so now I moved it to function.

Comment on lines 45 to 50
std::lock_guard<std::mutex> guard(mutex_);
if (available_map_.find(bytes) == available_map_.end() ||
available_map_[bytes].empty()) {
return allocate_and_cache(bytes);
}
return use_cached(bytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this properly, the cache hit path does 4 lookups in available_map_. (find, [bytes].empty, [bytes].front, [bytes].pop_front). My suggestion would be to fully inline use_cached, then capture the result of find in a variable (I think you would type it as auto&). Then you could use that for the rest of the accesses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense. Need to change coding habit.

return;
}
const size_t alloc_size = allocation_map_[ptr];
allocation_map_.erase(ptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this? Since we're keeping the buffer around, maintaining the size in the map should have a trivial memory cost. Dropping this line would allow us to skip not just this mutation, but also the insertion in the cache hit case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to maintain the invariant that an CPUCachingAllocator allocated ptr can either be in allocation map or in available_map, but not in both places. What you are suggesting is fine. With that the invariant would be: A CPUCachingAllocator 1) allocated pointer can be in allocation_map_ and available_map_ simultaneously and 2) allocated pointer cannot be in available_map_ if it is not in allocation_map_. Thus allocation_map_ will always be growing. Thus if we remove cached allocation from available_map_ it has to be removed from allocation_map_ as well. This is I think fine, given that it gets some performance. I will name allocation_map_ appropriately to reflect that.

Comment on lines 70 to 74
if (available_map_.find(alloc_size) == available_map_.end()) {
available_map_[alloc_size] = std::deque<void*>({ptr});
} else {
available_map_[alloc_size].push_back(ptr);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do available_map_[alloc_size].push_back(ptr), which is just one lookup instead of two.

void CPUCachingAllocator::free_cached() {
for (const auto& it : available_map_) {
for (void* ptr : it.second) {
free(ptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c10::free_cpu

free(ptr);
}
}
available_map_.clear();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I protected this one too with mutex. Thanks for the catch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out this was intentional.

@kimishpatel
Copy link
Contributor Author

I think there's a memory leak here. If I allocate a tensor with the caching allocator active, then free it with the allocator inactive, its size will stay in the allocation_map_ forever, right? I'm not sure how to resolve this. Maybe in DefaultMobileCPUAllocator::free, even if the caching allocator is disabled, you could inform the caching allocator that the pointer is being deleted. However, my next suggestion will break that strategy. Maybe we could take advantage of the fact that DataPtr stores more than just the raw buffer pointer.

Great catch. I missed this. We definitely have to handle this perhaps by having free always pass through this allocator for MobileCPUAllocator.

What would you think about this slight refactor: instead of having CPUCachingAllocator be global, require the user to create an own it. Probably rename it to "CPUAllocationCache" as well, to be more specific about what it does. Then WithCPUCachingAllocatorGuard would take an argument: a pointer (or reference) to the cache, and set that pointer in the thread local, so CachingAllocatorInfo is removed and just replaced with a pointer (always assumed to be enabled if it's non-null). This has a few benefits;

  • During construction of the cache, there's an easy way for the user to specify configuration parameters, like maximum size. I also think a good addition that could be configured here is the minimum allocation size. For example, even on the Pixel 3, I bet jemalloc caches allocations under 4 kB, so we probably don't need to cache those ourselves.

I am wondering if we can do the same via user specific AllocationProfile creation, which is what user owns, instead of user owned allocator itself. And then the Guard uses AllocationProfile to guide allocations.
This way we still use the same allocator but we can allocate accordingly to some profile. A problem I see is that if say allocation of size smaller than a specified size was done, then it gets freed outside of the guard, then we have lost context available from the guard. I guess we can 1) record all the allocations even the ones which are not candidate for caching and 2) make all the free go through this allocator that checks the map to see if ptr being freed was ever allocated by us and then it removes it from allocation_map_. This we probably have to do to avoid the issue you mentioned earlier, anyway.

  • When the user knows they are done with their use case, the allow the cache to be destroyed, freeing all of the memory. It doesn't stick around for the lifetime of the process.

We dont quite get this, with the approach I mentioned, but I dont know what the usecase would be. Do we create models temporarily and destroy them on the fly after one or few inferences?

  • If the expectation is that you'll have one cache per use case, we can store some model-specific information (like a memonger-like allocation plan) in the cache object.

This possibly can be achieved on entirely parallel path by a different type of allocator, or add an interface to existing allocator that allocates from a different pool. I dont have much clarity here though.

Downsides:

  • It's harder to fix the memory leak I described above, since you don't know which cache needs to be informed of the deletion. But I think it's possible.
  • Can't share cached allocations between use cases.

So I am open to refactoring the way you mentioned. I was just trying to think another way of achieving some of these things you mentioned.

@kimishpatel
Copy link
Contributor Author

since this is a new allocator, please make sure to add reportMemoryUsageToProfiler to it (examples in the code of other allocators), this is a very simple API to allow pytorch profiler to know about memory allocations/deallocations

@ilia-cher, thanks for the pointer. I can try to add an option to do the profiling, mainly because I dont want the hotpath be slow on the account of profiling. I can add that in separate, follow-up PR perhaps, right after this one.

Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976)

[ghstack-poisoned]
Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976)

[ghstack-poisoned]
Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976)

[ghstack-poisoned]
Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976)

[ghstack-poisoned]
Copy link
Contributor

@dreiss dreiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things. The only big thing I'm worried about is the inconsistent handling of failure from c10::alloc_cpu. Your TORCH_CHECK in the catch block implies that it can return null, but we don't check for that in the non-catch case.

Comment on lines 207 to 208
std::unique_ptr<c10::CPUCachingAllocator> caching_allocator =
std::make_unique<c10::CPUCachingAllocator>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

c10::CPUCachingAllocator caching_allocator;

instead of putting it on the heap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was cleaner to provide a pointer to caching_allocator from heap then from stack.

Comment on lines 209 to 210
c10::WithCPUCachingAllocatorGuard cachine_allocator_guard(
caching_allocator.get(), FLAGS_use_caching_allocator);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

c10::optional<c10::WithCPUCachingAllocatorGuard> maybe_guard;
if (FLAGS_use_caching_allocator) {
  maybe_guard.emplace(caching_allocator);
}

? This way you could remove a concept from the API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds fine.

// again.
// Try again.
ptr = c10::alloc_cpu(bytes);
TORCH_CHECK(ptr, "Memory allocation failed for ", bytes, " bytes.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, does alloc_cpu throw or return null on failure? It seems like you are looking for both, but only checking null on this path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. The reason I had it and try..catch was because alloc_cpu was throwing and then I forgot about it and though, OMG I am not checking for valid memory. Thanks for the catch.

Comment on lines 101 to 103
bool CachingAllocatorInfo::enabled() {
return is_enabled_;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think of eliminating CachingAllocatorInfo and enabled entirely, and just having a thread-local "current allocation cache" pointer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about it. I just felt it was cleaner this way to check if you have CachingAllocator enabled. But I dont have a strong preference. I will make the change. I assume you want to do it that way simplify code.

Comment on lines +41 to +42
ska::flat_hash_map<size_t, c10::SmallVector<void*, 16>> available_map_;
static ska::flat_hash_map<void*, size_t> allocation_map_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the mix of static and non-static state makes the code a bit hard to follow. This is probably fine for now since we're talking about breaking this up into two different classes to allow users to override behavior. But if that plan falls through for whatever reason, I think we should come back and refactor this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so this one I thought about it a bit. It seemed to me that this class is the natural place for it, since it is coupled with what the class does and what it implies. Basically it, allocaiton_map_, wants to keep the global information about everything that happens through any instance of this class. My main issue was should it be outside where I might have to encapsulate in yet another class, would be to decouple two things that have strong association with each other. So if that continues to be case, I would still think this will be the place for it.
I agree about it being little hard to follow, but to me that is ok. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be two separate classes, but let's leave it as-is for now. If/when we allow the user to override the caching logic, we can re-evaluate.

}

inline void* CPUCachingAllocator::use_cached(const size_t bytes) {
void* ptr = available_map_[bytes].pop_back_val();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still doing an extra map lookup. If you manually inline this code into allocate, you can just return it->second.pop_back_val(), right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Yeah, makes sense.

Comment on lines +46 to +49
// NB: since we are not really freeing the memory
// the cases such as quantization code freeing original weights
// on mobile, will not quite work, as we likely will hold
// onto that memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is less true now since caching is unlikely to be enabled during model loading, so we won't find those buffers in the allocation_map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we are freeing original quantized weights during the first time we run the op because qnnpack requires bias to be packed in weights. So this still holds.

c10::free_cpu(ptr);
return;
}
const size_t alloc_size = allocation_map_[ptr];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still doing an extra lookup here. Could save the iterator from find above and dereference that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. I thought I fixed it, but clearly not.

Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976)

[ghstack-poisoned]
@kimishpatel kimishpatel requested a review from dreiss August 13, 2020 14:27
Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976)

[ghstack-poisoned]
Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976)

[ghstack-poisoned]
Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976)

[ghstack-poisoned]
Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976)

[ghstack-poisoned]
Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976)

[ghstack-poisoned]
Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976)

[ghstack-poisoned]
Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976)

[ghstack-poisoned]
Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976)

[ghstack-poisoned]
Summary:
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only. Plus use has to be explicitly opted in via `WithCPUCachingAllocatorGuard`.

Test Plan:
android test: cpu_caching_allocator_test

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22726976](https://our.internmc.facebook.com/intern/diff/D22726976)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 2a08566.

namespace c10 {

namespace {
thread_local CPUCachingAllocator* caching_allocator_ptr{nullptr};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry I'm late to the game. Do you know if thread_local here compiles to a __tls_get_addr on mobile? On recent profiling we've been doing for core overheads, we've noticed that __tls_get_addr is quite expensive and so it's a bit surprising to see this here. (thread local doesn't always compile to this, it's only in certain situations, so I'm wondering if this applies to mobile too.)

// returned the memory to OS via free_cached.
// 1.1. Therefore even when the said memory is "freed" via this
// allocator (and thus cached), it will continue to stay
// in allocaiton_map_. Furthermore it will also exist in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocation not allocaiton


namespace c10 {

class C10_API CPUCachingAllocator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest not naming this a "Caching" allocator, as it makes it sound like this behaves similar to CUDACachingAllocator (when actually there is very little algorithmic similarity at all). This would be less of a big deal if this was in a mobile specific directory, but since the file is in c10/core people are much more likely to get the wrong idea. I'm not sure what a good replacement name is; I'm reminded of "BlackBox" nomenclature from the Caffe2 days; SizeArenaAllocator is another pretty descriptive alternative.

free_cached();
// Furthermore to consider: If we ever come here running out of memory
// perhaps it is best to disable caching, since this is likely to happen
// again.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not going to log in this situation at all?

} // namespace

std::mutex CPUCachingAllocator::mutex_;
ska::flat_hash_map<void*, size_t> CPUCachingAllocator::allocation_map_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pray that there are never many small allocations of varying sizes? :o) (Would be nice if the allocator's description had a clear statement of what situations it is suitable for use, and what situations it is likely to do poorly in.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants