Conversation
Summary of ChangesHello @danielhanchen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the model loading process by introducing a mechanism to prefer FlexAttention when it is supported and explicitly enabled via an environment variable. This aims to improve performance by leveraging more efficient attention implementations where possible, ensuring that both LLM and vision models can dynamically adapt their attention backend while maintaining consistent configuration. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a helper function to prefer 'flex attention' when it's available and enabled via an environment variable. The changes are well-structured and improve performance options. My review includes a couple of suggestions to improve code clarity and robustness by refining exception handling and removing a redundant check. Overall, a good addition.
| setattr(config, "attn_implementation", "flex_attention") | ||
| return "flex_attention" |
There was a problem hiding this comment.
Using a broad except Exception: can hide unexpected errors and make debugging more difficult. It's better to catch more specific exceptions that you expect to handle, such as ImportError or AttributeError. This will make the code more robust.
| setattr(config, "attn_implementation", "flex_attention") | |
| return "flex_attention" | |
| except (ImportError, AttributeError): | |
| return None |
unsloth/models/vision.py
Outdated
| flex_attn_impl = None | ||
| if os.environ.get("UNSLOTH_PREFER_FLEX_ATTENTION", "0") == "1": | ||
| if auto_config is None: | ||
| auto_config = AutoConfig.from_pretrained( | ||
| model_name, | ||
| token = token, | ||
| trust_remote_code = trust_remote_code, | ||
| ) | ||
| try: | ||
| model_class = auto_model._model_mapping[auto_config.__class__] | ||
| except Exception: | ||
| model_class = None | ||
| flex_attn_impl = prefer_flex_attn_if_supported(model_class, auto_config) |
There was a problem hiding this comment.
This code block can be simplified.
- The check for the
UNSLOTH_PREFER_FLEX_ATTENTIONenvironment variable is redundant, asprefer_flex_attn_if_supportedalready handles this internally. Removing this outerifwill simplify the code and avoid duplication. - The
except Exception:is too broad. It's better to catch a more specificKeyErrorwhich might be raised when accessingauto_model._model_mapping.
Refactoring this will make the code cleaner and more maintainable.
if auto_config is None:
auto_config = AutoConfig.from_pretrained(
model_name,
token = token,
trust_remote_code = trust_remote_code,
)
try:
model_class = auto_model._model_mapping[auto_config.__class__]
except KeyError:
model_class = None
flex_attn_impl = prefer_flex_attn_if_supported(model_class, auto_config)|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for flex_attention as a preferred backend, which is a good performance enhancement. The changes are applied consistently across both language and vision models. My feedback focuses on improving code robustness by narrowing down exception handling and increasing efficiency by avoiding redundant configuration loading. Overall, the changes look good and align with the goal of optimizing attention mechanisms.
| except Exception: | ||
| return None |
There was a problem hiding this comment.
Using a broad except Exception: can hide unexpected errors and make debugging more difficult. It's better to catch more specific exceptions. In this case, since you are mainly guarding against a missing transformers utility, except ImportError: would be more appropriate. If other exceptions are possible, they should be caught explicitly or logged.
| except Exception: | |
| return None | |
| except ImportError: | |
| return None |
| if auto_config is None: | ||
| auto_config = AutoConfig.from_pretrained( | ||
| model_name, | ||
| token = token, | ||
| trust_remote_code = trust_remote_code, | ||
| ) | ||
| try: | ||
| model_class = auto_model._model_mapping[auto_config.__class__] | ||
| except Exception: | ||
| model_class = None | ||
| flex_attn_impl = prefer_flex_attn_if_supported(model_class, auto_config) |
There was a problem hiding this comment.
The current logic loads the model configuration twice using AutoConfig.from_pretrained (here and later around line 671). This is inefficient as it can involve unnecessary network and disk I/O.
Consider refactoring to load the configuration only once. You can load auto_config, modify it with the chosen attention implementation, and then reuse it as model_config later, avoiding the second from_pretrained call.
For example:
- Load
auto_configonce at the beginning. - Determine
flex_attn_impland updatekwargs['attn_implementation']. - Instead of reloading, directly set the
attn_implementationon theauto_configobject if it wasn't already set byprefer_flex_attn_if_supported. - Use the modified
auto_configasmodel_config.
This would make the model loading process more efficient.
| except Exception: | ||
| model_class = None |
There was a problem hiding this comment.
Using a broad except Exception: is not ideal as it can mask bugs. Here, you are accessing a dictionary _model_mapping. A KeyError is a more specific exception to catch if the key is not found. If auto_model might not have _model_mapping, an AttributeError could also be caught.
| except Exception: | |
| model_class = None | |
| except (KeyError, AttributeError): | |
| model_class = None |
a6f0859 to
16a72ca
Compare
Summary
UNSLOTH_ENABLE_FLEX_ATTENTION.Testing