kernel crashing, due to mismatch with disk cache for updated JIT code. #3182

willyborn · 2021-10-13T09:34:27Z

Disk caching continued loading expired JIT code, resulting in exceptions. Clearing the disk cached, hided the problem.
For interface changes, an exception was generated.
For all other changes, the old code (initial cached) was loaded and executed.

Description

JIT code uses the hashing of the function name (based on parameters for JIT) for fast retrieval of the corresponding kernel. The same hashing was also used to save the code on disk. When JIT code was updated, the same function name still applies, resulting in an mismatch with a previous saved/cached version.
Static code uses the hashing based on the full code including options.

Now for JIT code, the hashing for disk caching is based on the full code including options (same as static code). Once the kernel is saved into memory everything remains the same:
. JIT code remains using the hashing based on function name (no code generation)
. static code remains using the hashing based on full code + options (hashing is also static)
Result, JIT is now reacting exactly the same as the static code and the correct kernel is always loaded.

This is a bug fix.
No new functions,
This PR can be back-ported to previous versions.

Changes to Users

No changes to the end-user. Clearing of the disk cache is no longer needed to eliminate the exception.

Checklist

Rebased on latest master
Code compiles
Tests pass

9prady9 · 2021-10-13T10:08:21Z

👍🏾 It is awesome that you finally found out the reason for the JIT kernel cache issue. I am just concerned that it increases the startup cost slightly in the case of loading from disk. Did you happen to time by how much the loading time increased with this change ?

@umar456 Is there a way we can actually fix the kernel name generation of JIT ? I was under the impression that JIT kernel names encoded the some form of kernel operations info.

willyborn · 2021-10-13T13:31:58Z

There is indeed a time increase, although it is much smaller than my measurement precision. A comparison for JIT: 1st run (empty disk cache) File Old New Timing difference jit.cpp:146 generate functionname generate functionname = jit.cpp:147 hash functionname hash functionname = jit.cpp:152 search mem with hash functionname search mem with hash functionname = jit.cpp:154 not found in mem cache not found in mem cache = jit.cpp:155 generate full code generate full code = jit.cpp:159 generate hash full code generate hash full code = jit.cpp:171 call getKernel call getkernel = kernel_cache.cpp:82 hash functionname hash functionname = kernel_cache.cpp:93 search mem with hash functionname search mem with hash functionname = kernel_cache.cpp:95 not found in mem cache not found in mem cache = kernel_cache.cpp:101 combine hash full code {jit_cl_src, jitKer_cl_src} 4Bytes kernel_cache.cpp:104 combine hash options " -D USE_DOUBLE" 14Bytes kernel_cache.cpp:112 search disk with hash functionname search disk with hash full code = kernel_cache.cpp:113 not found on disk not found on disk = kernel_cache.cpp:118 compile compile = kernel_cache.cpp:118 save on disk with hash functionname save on disk with hash full code = kernel_cache.cpp:129 save in mem with hash functionname save in mem with hash functionname = kernel_cache.cpp:136 return kernel return kernel = jit.cpp:175 return kernel return kernel = 1st run (cached on disk) Old New Timing difference jit.cpp:146 generate functionname generate functionname = jit.cpp:147 hash functionname hash functionname = jit.cpp:152 search mem with hash functionname search mem with hash functionname = jit.cpp:154 not found in mem cache not found in mem cache = jit.cpp:155 generate full code generate full code = jit.cpp:159 generate hash full code generate hash full code = jit.cpp:171 call getKernel call getkernel = kernel_cache.cpp:82 hash functionname hash functionname = kernel_cache.cpp:93 search mem with hash functionname search mem with hash functionname = kernel_cache.cpp:95 not found in mem cache not found in mem cache = kernel_cache.cpp:101 combine hash full code {jit_cl_src, jitKer_cl_src} 4Bytes kernel_cache.cpp:104 combine hash options " -D USE_DOUBLE" 14Bytes kernel_cache.cpp:112 search disk with hash functionname search disk with hash full code = kernel_cache.cpp:112 found on disk found on disk = kernel_cache.cpp:112 read from disk read from disk = kernel_cache.cpp:129 save in mem with hash functionname save in mem with hash functionname = kernel_cache.cpp:136 return kernel return kernel = jit.cpp:175 return kernel return kernel = 2nd run (independent if previous cached on disk) Old New Timing difference jit.cpp:146 generate functionname generate functionname = jit.cpp:147 hash functionname hash functionname = jit.cpp:152 search mem with hash functionname search mem with hash functionname = jit.cpp:154 found in mem found in mem = jit.cpp:175 return kernel return kernel = As you can see, ONLY AT 1st CALL the existing common::Source.hash (header, generated code & options) are combined --> 18Bytes (18x XOR, 18x MUL). For all following runs, no extra time is spent. af::timeit() will even not contain these extra operations, because the func is called in the sample_times before the real measurement starts. By the way, static code performs the above operation EACH time it gets a Kernel, since it calls getKernel each time. BR, Willy

…

On Wed, 13 Oct 2021 at 12:08, pradeep ***@***.***> wrote: 👍🏾 It is awesome that you finally found out the reason for the JIT kernel cache issue. I am just concerned that it increases the startup cost slightly in the case of loading from disk. Did you happen to time by how much the loading time increased with this change ? @umar456 <https://github.com/umar456> Is there a way we can actually fix the kernel name generation of JIT ? I was under the impression that JIT kernel names encoded the some form of kernel operations info. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3182 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQ2WGPHMHAMYSXNOQASWS7LUGVLCDANCNFSM5F4XD56A> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

willyborn · 2021-10-13T13:52:04Z

Regarding the JIT kernel names.
It does contain a description of the dynamic calling parameters, incl types etc.
It does not include, the real code or a description of the always present parameters.

As result, fixes in the code, without dynamic interface changes, do result in the same JIT kernel name.
For my opinion, I would keep this, because each time generating the code of those kernels is really time consuming and are static inside one run. (not between runs/disk caches, since code could have changed).

umar456 · 2021-10-14T18:02:53Z

@willyborn I couldn't see the times in the log you posted. I assume its pretty small based on your comments. I can't imagine this hashing would increase the times significantly. I am curious if we can't achieve the same thing by including the AF_REVISION when storing the names. This will allow us to refresh the cache for each commit and avoid issues like this. Can you think of a reason why this would fail?

willyborn · 2021-10-14T20:12:54Z

@umar I first included the AF_REVISION, as proposed. l still had to clear the cache for each test run, because this version number is constant during development recompiles. I assume the version nr changes after a commit in GitHub, which is not the case during development. On my system they never changed. With the static kernels, I did not have this problem so I searched further to understand why, with this PR as result. Regarding the timings: the 1st run gives a theoretical time almost equal to the static kernel (any run), where the combining of the hashes is always performed. I did not give times, because as soon as you put it inside a loop, the extra steps are no longer executed. The measured times, before or after the change, are as result equal. (I believe we are talking here over microseconds, if not picoseconds). PS: Only the combining is performed because the hashing on the code text is already performed before the call, also before the change. These hashes are stored in common::Source struct and are present in the parameters of the getKernel call. (also for JIT)

…

On Thu, Oct 14, 2021, 20:03 Umar Arshad ***@***.***> wrote: @willyborn <https://github.com/willyborn> I couldn't see the times in the log you posted. I assume its pretty small based on your comments. I can't imagine this hashing would increase the times significantly. I am curious if we can't achieve the same thing by including the AF_REVISION when storing the names. This will allow us to refresh the cache for each commit and avoid issues like this. Can you think of a reason why this would fail? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3182 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQ2WGPGR7Z7UFRMHW2QVZ3TUG4LNRANCNFSM5F4XD56A> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

umar456

Thanks for the explanation. This should alleviate a lot of confusion when developing kernels. I was able to test it on my local system and everything was running correctly and working as expected. Thank you for your contribution.

umar456 · 2021-10-15T15:43:16Z

src/backend/common/kernel_cache.cpp

            vector<string> sources_str;
-            for (auto s : sources) { sources_str.push_back({s.ptr, s.length}); }
-            currModule = compileModule(to_string(moduleKey), sources_str,
+            for (const auto& s : sources) {


Disk hash is now based on the full code + options, also for JIT code.

53f8769

umar456 approved these changes Oct 15, 2021

View reviewed changes

umar456 merged commit 1d03d07 into arrayfire:master Oct 15, 2021

willyborn deleted the JIT-DiskCaching branch September 29, 2022 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kernel crashing, due to mismatch with disk cache for updated JIT code. #3182

kernel crashing, due to mismatch with disk cache for updated JIT code. #3182

Uh oh!

willyborn commented Oct 13, 2021 •

edited

Loading

Uh oh!

9prady9 commented Oct 13, 2021

Uh oh!

willyborn commented Oct 13, 2021 via email

Uh oh!

willyborn commented Oct 13, 2021

Uh oh!

umar456 commented Oct 14, 2021

Uh oh!

willyborn commented Oct 14, 2021 via email

Uh oh!

umar456 left a comment

Uh oh!

umar456 Oct 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kernel crashing, due to mismatch with disk cache for updated JIT code. #3182

kernel crashing, due to mismatch with disk cache for updated JIT code. #3182

Uh oh!

Conversation

willyborn commented Oct 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes to Users

Checklist

Uh oh!

9prady9 commented Oct 13, 2021

Uh oh!

willyborn commented Oct 13, 2021 via email

Uh oh!

willyborn commented Oct 13, 2021

Uh oh!

umar456 commented Oct 14, 2021

Uh oh!

willyborn commented Oct 14, 2021 via email

Uh oh!

umar456 left a comment

Choose a reason for hiding this comment

Uh oh!

umar456 Oct 15, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

willyborn commented Oct 13, 2021 •

edited

Loading