-
Notifications
You must be signed in to change notification settings - Fork 552
kernel crashing, due to mismatch with disk cache for updated JIT code. #3182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👍🏾 It is awesome that you finally found out the reason for the JIT kernel cache issue. I am just concerned that it increases the startup cost slightly in the case of loading from disk. Did you happen to time by how much the loading time increased with this change ? @umar456 Is there a way we can actually fix the kernel name generation of JIT ? I was under the impression that JIT kernel names encoded the some form of kernel operations info. |
|
There is indeed a time increase, although it is much smaller than my
measurement precision.
A comparison for JIT:
1st run (empty disk cache)
File Old New Timing difference
jit.cpp:146 generate functionname generate functionname =
jit.cpp:147 hash functionname hash functionname =
jit.cpp:152 search mem with hash functionname search mem with hash
functionname =
jit.cpp:154 not found in mem cache not found in mem cache =
jit.cpp:155 generate full code generate full code =
jit.cpp:159 generate hash full code generate hash full code =
jit.cpp:171 call getKernel call getkernel =
kernel_cache.cpp:82 hash functionname hash functionname =
kernel_cache.cpp:93 search mem with hash functionname search mem with hash
functionname =
kernel_cache.cpp:95 not found in mem cache not found in mem cache =
kernel_cache.cpp:101 combine hash full code {jit_cl_src, jitKer_cl_src}
4Bytes
kernel_cache.cpp:104 combine hash options " -D USE_DOUBLE" 14Bytes
kernel_cache.cpp:112 search disk with hash functionname search disk with
hash full code =
kernel_cache.cpp:113 not found on disk not found on disk =
kernel_cache.cpp:118 compile compile =
kernel_cache.cpp:118 save on disk with hash functionname save on disk with
hash full code =
kernel_cache.cpp:129 save in mem with hash functionname save in mem with
hash functionname =
kernel_cache.cpp:136 return kernel return kernel =
jit.cpp:175 return kernel return kernel =
1st run (cached on disk)
Old New Timing difference
jit.cpp:146 generate functionname generate functionname =
jit.cpp:147 hash functionname hash functionname =
jit.cpp:152 search mem with hash functionname search mem with hash
functionname =
jit.cpp:154 not found in mem cache not found in mem cache =
jit.cpp:155 generate full code generate full code =
jit.cpp:159 generate hash full code generate hash full code =
jit.cpp:171 call getKernel call getkernel =
kernel_cache.cpp:82 hash functionname hash functionname =
kernel_cache.cpp:93 search mem with hash functionname search mem with hash
functionname =
kernel_cache.cpp:95 not found in mem cache not found in mem cache =
kernel_cache.cpp:101 combine hash full code {jit_cl_src, jitKer_cl_src}
4Bytes
kernel_cache.cpp:104 combine hash options " -D USE_DOUBLE" 14Bytes
kernel_cache.cpp:112 search disk with hash functionname search disk with
hash full code =
kernel_cache.cpp:112 found on disk found on disk =
kernel_cache.cpp:112 read from disk read from disk =
kernel_cache.cpp:129 save in mem with hash functionname save in mem with
hash functionname =
kernel_cache.cpp:136 return kernel return kernel =
jit.cpp:175 return kernel return kernel =
2nd run (independent if previous cached on disk)
Old New Timing difference
jit.cpp:146 generate functionname generate functionname =
jit.cpp:147 hash functionname hash functionname =
jit.cpp:152 search mem with hash functionname search mem with hash
functionname =
jit.cpp:154 found in mem found in mem =
jit.cpp:175 return kernel return kernel =
As you can see, ONLY AT 1st CALL the existing common::Source.hash (header,
generated code & options) are combined --> 18Bytes (18x XOR, 18x MUL).
For all following runs, no extra time is spent.
af::timeit() will even not contain these extra operations, because the func
is called in the sample_times before the real measurement starts.
By the way, static code performs the above operation EACH time it gets a
Kernel, since it calls getKernel each time.
BR, Willy
…On Wed, 13 Oct 2021 at 12:08, pradeep ***@***.***> wrote:
👍🏾 It is awesome that you finally found out the reason for the JIT
kernel cache issue. I am just concerned that it increases the startup cost
slightly in the case of loading from disk. Did you happen to time by how
much the loading time increased with this change ?
@umar456 <https://github.com/umar456> Is there a way we can actually fix
the kernel name generation of JIT ? I was under the impression that JIT
kernel names encoded the some form of kernel operations info.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3182 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQ2WGPHMHAMYSXNOQASWS7LUGVLCDANCNFSM5F4XD56A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
|
Regarding the JIT kernel names. As result, fixes in the code, without dynamic interface changes, do result in the same JIT kernel name. |
|
@willyborn I couldn't see the times in the log you posted. I assume its pretty small based on your comments. I can't imagine this hashing would increase the times significantly. I am curious if we can't achieve the same thing by including the AF_REVISION when storing the names. This will allow us to refresh the cache for each commit and avoid issues like this. Can you think of a reason why this would fail? |
|
@umar
I first included the AF_REVISION, as proposed.
l still had to clear the cache for each test run, because this version
number is constant during development recompiles.
I assume the version nr changes after a commit in GitHub, which is not the
case during development. On my system they never changed.
With the static kernels, I did not have this problem so I searched further
to understand why, with this PR as result.
Regarding the timings: the 1st run gives a theoretical time almost equal to
the static kernel (any run), where the combining of the hashes is always
performed.
I did not give times, because as soon as you put it inside a loop, the
extra steps are no longer executed. The measured times, before or after
the change, are as result equal. (I believe we are talking here over
microseconds, if not picoseconds).
PS: Only the combining is performed because the hashing on the code text is
already performed before the call, also before the change. These hashes
are stored in common::Source struct and are present in the parameters of
the getKernel call. (also for JIT)
…On Thu, Oct 14, 2021, 20:03 Umar Arshad ***@***.***> wrote:
@willyborn <https://github.com/willyborn> I couldn't see the times in the
log you posted. I assume its pretty small based on your comments. I can't
imagine this hashing would increase the times significantly. I am curious
if we can't achieve the same thing by including the AF_REVISION when
storing the names. This will allow us to refresh the cache for each commit
and avoid issues like this. Can you think of a reason why this would fail?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3182 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQ2WGPGR7Z7UFRMHW2QVZ3TUG4LNRANCNFSM5F4XD56A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
umar456
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. This should alleviate a lot of confusion when developing kernels. I was able to test it on my local system and everything was running correctly and working as expected. Thank you for your contribution.
| vector<string> sources_str; | ||
| for (auto s : sources) { sources_str.push_back({s.ptr, s.length}); } | ||
| currModule = compileModule(to_string(moduleKey), sources_str, | ||
| for (const auto& s : sources) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
Disk caching continued loading expired JIT code, resulting in exceptions. Clearing the disk cached, hided the problem.
For interface changes, an exception was generated.
For all other changes, the old code (initial cached) was loaded and executed.
Description
JIT code uses the hashing of the function name (based on parameters for JIT) for fast retrieval of the corresponding kernel. The same hashing was also used to save the code on disk. When JIT code was updated, the same function name still applies, resulting in an mismatch with a previous saved/cached version.
Static code uses the hashing based on the full code including options.
Now for JIT code, the hashing for disk caching is based on the full code including options (same as static code). Once the kernel is saved into memory everything remains the same:
. JIT code remains using the hashing based on function name (no code generation)
. static code remains using the hashing based on full code + options (hashing is also static)
Result, JIT is now reacting exactly the same as the static code and the correct kernel is always loaded.
Changes to Users
Checklist