-
Notifications
You must be signed in to change notification settings - Fork 2.2k
sprintf: add support for explicit arg positions #11857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
sprintf: add support for explicit arg positions #11857
Conversation
|
Looking at
So I think we have these options:
The first option is what is implemented now. There would be no user-facing changes. Translators could use argument reordering, but only for messages which don't appear in fish scripts. This is problematic, because supporting different syntax for different messages is highly unintuitive. If messages appear both in Rust and a fish script, they will show up in the Rust section at the beginning of the PO file, so there are no messages which can safely use the new syntax. Under these circumstances, I don't think the first option makes sense. Options 2 and 3 are similar. For option 2, the Option 2 has the downside of potentially breaking existing user code. Its advantage over option 3 is that we would continue to have a single If we are ok with breaking the implicit |
|
Option 1 and 3 are OK, I think. For option 1, the repeated behavior still exists when an argument doesn't specify position explicitly. And I hope that no code that needs i18n uses this behavior:) |
That might be the best approach:
This doesn't break existing behavior, while adding support for the new behavior. |
|
That would be great. I hope this feature would complete as soon as possible :) |
The `have_foo: bool` + `foo: i64` combination is more idiomatically represented as `foo: Option<i64>`. This change is applied for `field_width` and `precision`. In addition, the sketchy explicit cast from `i64` to `c_int`, and the subsequent implicit cast from `c_int` to `i32` are avoided by using `i64` consistently.
The abbreviation is ambiguous, which makes the code unnecessarily hard to read. (possible misleading expansions: direct, direction, director, ...)
As with `msgmerge`, this introduces unwanted empty comment lines above `#, c-format` lines. We don't need this strict formatting, so we get rid of the flag and the associated empty comment lines.
This allows explicit specification of the argument order in sprintf invocations, e.g. when using these arguments `"%2$s %1$s" "a" "b"`, the expected result would be `"b a"`. These specifiers are also available for the width and precision, and should have the same semantics as other sprintf implementations. The main motivation for this change is giving translators the freedom to change the argument order, which can allow for more idiomatic translations. This can also be made available to fish's `printf` command, but that has some additional input validation in `src/builtins/printf.rs:print_formatted` which would have to be updated to make it compatible with the expanded syntax.
This allows specifying explicit argument positions in the format string of the `printf` command, e.g. `'%2$*1$.*2$d' 10 4` Note that this feature is incompatible with applying arguments to the format string until all arguments are used up. Also note that using explicit argument positions is all-or-nothing. If they appear in the format string, referring to arguments implicitly is not possible. Documentation updates and tests are missing for now.
becc0ef to
3d0b89c
Compare
|
I added code to make the I'm in favor of getting rid of length modifier support. We have no need for them, they are not mentioned in our docs, they make the code somewhat more complicated, add unnecessary characters to format string, and result in a few duplicate entries in our PO files which only differ in length modifiers. I have a branch with according changes ready at https://github.com/danielrainer/fish-shell/tree/printf_remove_length_modifier_support, but these changes will probably result in a lot of conflicts with the two ongoing translation update PRs, so I think these changes should probably wait until these PRs are finished, then I can apply the necessary changes on top of them. This PR is now also based on top of #11863, to avoid useless empty comments in PO files. |
|
|
||
| # Get rid of duplicates and sort. | ||
| msguniq --no-wrap --strict --sort-output $rust_extraction_file | ||
| msguniq --no-wrap --sort-output $rust_extraction_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at
printfit seems that the its current semantics are
incompatible with explicit argument positions
yeah builtin printf's implicit repetition is used a lot; breaking
that would be unnecessarily surprising. printf also doesn't take any
options so the design space is pretty narrow.
builtin printf.
It's probably a big problem in practice to leave the reordering
feature out of fish script for now, as long as we can add it later.
What I could glean from the implementation so far look good.
It's less permissive than ZSH behavior;
ZSH allows combining repetition with ordering parameters:
zsh% printf '%2$s\n' a b c d
b
d
zsh% printf '%2$s\n' a b c
zsh:printf:1: 2: argument specifier out of range
b
I think we would be able to allow this too in future without breaking change,
but I don't know if we want to.
If we want to make bigger steps, we can give it a different name,
maybe abandon printf format-string syntax altogether (like we're also
trying to move Rust sources to format! syntax):
status localize-messages my-msgid --key1=value ...
which would remove the need to touch the printf builtin, but it seems worse overall.
As of today, user-written scripts cannot bring their own translations anyway;
If we want to allow that in future, we could do that either way
status add-translation my-msgid --keyword-argument=key1 -- zh_CN=... de_DE=...
though I don't expect that this will be a good idea:
shell scripts can already use the gettext command for simple things, and if they become more complicated, users are better off writing them in Python etc,
so I'm not sure how much actual use there is for exposing the reordering feature in our custom gettext engine.
(In general, I'm not sure how many good reasons there are to write nontrival fish scripts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the only reason why we are still stuck on legacy1 printf syntax is that there were concerns around localization.
I don't think those concerns are relevant anymore (not sure if they ever were);
if we can translate "%s", then we can also translate "{}" the same way.
so we want to
- convert all non-lozalized sprintf etc. calls to Rust syntax (simple but tedious)
- do the same for localized ones, and make sure to update existing translations accordingly
- this implies adding the parameter-ordering syntax for Rust
- have the option to switch to fluent later, which uses different format string syntax
I think 2 and 3 suggest
that we may not want to add this feature to builtin printf:
this is because we want to change the parameter syntax of translatable
strings (most of which are in Rust sources2) once or twice (first to Rust syntax, and possibly to fluent),
and crucially, neither of the destination syntaxes works for builtin printf.
Unless we are okay with a mix of printf-style and non-printf-style msgids in po/,
I'd rather create a dedicated builtin like status localize-messages sketched above that calls wgettext_fmt!,
that has exactly the same powers available to Rust code.
Footnotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we can translate "%s", then we can also translate "{}" the same way.
We can do that if our arguments are available as the desired types in Rust, which is not the case if they are taken from a shell script. If we want type-less format strings in Rust, we will have to use a different syntax for fish scripts, or treat all inputs as strings.
Having more than one syntax style might make translation more annoying, but generally the formatting syntax can just be copied, so that might be not much of an issue.
Do you have a concrete proposal for how Rust format strings should look, especially considering reordering for localization? We would have to write our own formatter for it, otherwise we lose the ability to print arbitrary bytes.
I'd rather create a dedicated builtin like
status localize-messagessketched above that callswgettext_fmt!,
that has exactly the same powers available to Rust code.
I agree that would be nice, but as mentioned above, we don't have type info in fish scripts, so it's just not possible to do the same stuff we can do in Rust.
So essentially I see these options:
Stick with printf for everything
Advantages:
- No need to come up with a new formatting syntax (and converting to it)
- Works for Rust and shell scripts, so we can have a single formatter implementation
- Well-known format
- Consistent format is nicer for translators
- Reasonably flexible for both Rust and shell scripts
- Argument reordering can be supported using explicit argument index syntax
Disadvantages:
- Requires redundant type info in Rust
- Unnecessarily restrictive in Rust (arbitrary types can have formatting implementations). Currently one would have to work around this by first formatting the argument separately, and then using
%sin the format string. - Fairly ugly syntax for more complicated directives
Use Rust-style strings for everything
Advantages:
- Eliminates redundant type info in Rust (although it does not shorten the strings, both
{}and%stake two chars) - Allows more Rust types to be formatted directly
- Nicer, more intuitive style, hopefully?
- Compile-time checks for some things. Checking that the arguments implement some formatting trait should be simple. Other checks seem harder, especially when the format string is localized. Maybe checking Rust's
format!can help with understanding what can be done there.
Disadvantages:
- Dumbs down formatting for shell scripts. Every argument would have to be treated as a string, unless there is explicit info telling us otherwise (e.g. if the width can be specified as an argument, we can probably treat that argument as an integer. This depends on the exact format we choose).
- All usages need to be converted. A large part might be easily automatable (converting
%sto{}is not exactly hard). But avoiding false positives, and handling more complicated cases might require significant effort. - Unclear how argument reordering would be handled. Argument indices in brackets are one option, similar to how it is done in printf. Argument names provide more info (especially to translators), but are a bit harder to parse. For localization, it would also require having both the original and the translated format string available to figure out how the reordering should be done. I don't have a good idea yet how that could be implemented sensibly.
Use Rust-style strings for Rust, separate syntax for shell scripts
Advantages:
- Choices can be made independently to best suit the respective use cases (see previous sections for how/why that might be helpful)
Disadvantages:
- Need to maintain 2 formatting implementations (3 if shell formatting uses something other than printf)
- Translators don't have one consistent syntax to work with
My conclusion
I think a reasonable case can be made for all three approaches, with none being clearly better or worse than the others. If someone is willing to propose a more detailed plan for the second or third option, and is willing to implement it, I don't think I would object. If that's not the case, I think sticking with the current approach is by far the least effort, since it already works and there is a clear way of implementing the missing feature of argument reordering, so that would be my default choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably a big problem in practice to leave the reordering
feature out of fish script for now, as long as we can add it later.
I'm assuming you meant that its not a big problem. While I would agree with that, the primary reason for introducing the feature is providing more flexibility for translators. If the printf command doesn't support explicit positions, translators need to know whether a string might be used by the printf command, which would already be really annoying if it could be determined from the section headers in the PO files, but that's not the case, since deduplication means that if a message is present in a fish script and the Rust sources, it will show up in the Rust section. So I prefer not introducing the feature at all over introducing it only for sprintf but not the printf command.
Regarding implicit/explicit mixing, see #11874 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with taking this PR assuming we add docs and tests, and assuming
it doesn't add any behavior to printf that we ever want to remove again.
(I haven't read too much of the second commit though, but you can take care
of keeping it correct and reasonably simple. It's an ugly problem because,
as you say, only Rust consumers have type info.)
But I'm deifnitely interested in the next steps:
Do you have a concrete proposal for how Rust format strings should look,
I think I have a clear picture of our new localizable format strings' would look like in Rust
(there are only loose ends on the equivalent for fish script, should we want to use a subset of the syntax there).
Roughly:
-
replace
sprintf!()withwformat!().- maybe share no code with the printf implementation, not yet sure
wformat!()is to support a subset offormat!()'s format string syntax- The main difference is that
wformat!()uses our PUA-trick (str2wcstring), so the result can represent non-UTF-8 data - The format string argument to
wformat!()must always be a compile-time constant
(except perhaps for translations, see below).- even if
wformat!()doesn't do compile-time checking for format args yet,
it's a step in the right direction, and we can add that later.
- even if
-
For localizable strings where we need parameter reordering,
implement either a relevant subset of- explicit parameter positions
- or of named parameters
I probably wouldn't bother with positionals, since we never use them,
and using named parameters everywhere seems like better practice anyway.
I have not looked into how complicated it would be to implement this,
but remember, it's fine if we do everything at runtime for now
(because that's whatsprintfdoes today), so I don't expect any blockers.
There are some pre-existing issue:
just like withsprintf!, we can fail (i.e. panic) if we're given a string that we don't support yet.
If something unsupported likewformat!("{:02}")sneaks through code review, that'll be tragic but we can probably write a clippy check for that.
We'd also want to check all translations.
-
Regarding localizable strings in fish script: I checked all existing uses,
and no placeholder produces anything other that what we'd get withformat!("{}", val).
So I think we get away with implementing only that for now.
I didn't see anything like octal (printf %d 0123) or floating point numbers, whereprintfuses a locale-specific decimal separator.
If we need those, we can still use printf, and feed the resultfish_localize_message "{percentage}% complete" -- percentage=(printf %0.2f (math $have / $want))
This new DSL template language might be weird for a shell.
but I also don't expect that we want to stick to printf for translations forever,
since it probably can't support future translations features like plurals.
I haven't worked out the concrete details how this would work in fish script
(since I'm not yet sure when we would do this)
Stick with
printffor everything
yeah, this seems fine as a first step
Use Rust-style strings for everything
Use Rust-style strings for Rust, separate syntax for shell scripts
I think either of these two options is a good next step.
- Dumbs down formatting for shell scripts.
This is only about localized-message formatting right.
I'm no i18n expert but I'd think that's fine.
It does take away some (already very limited) power from translators
(e.g. they'd no longer have the choice between %X and %x)
but Fluent probably has the same issue (coming from the Javascript land),
so I'm sure they have a way more flexible solution for letting translators format dynamic values.
To be safe, I'd do some research into Fluent,
we don't want to make it harder for us to switch to that.
All uses of %s need to be converted.
that's something we want to do regardless
Unclear how argument reordering would be handled.
As suggested above, we'd implement a subset of Rust's std::fmt syntax
For localization, it would also require having both the original and the translated format string available to figure out how the reordering should be done. I don't have a good idea yet how that could be implemented sensibly.
this is about the hypothetical fish_localize_message builtin, right?
That builtin takes a msgid and some key-value pairs for placeholders,
looks like the msgid like _ does, and then
- either invokes a version of
wformat!()that takes a dynamic format string - or have our
CATALOGSnot expose translations as&wstrbut rather
as aFn(msgid: &wstr, params: ParameterBag) -> WString.
This should never fail unless there's a bug in our code.
Again, I'd want to take a look at how fluent-rs does it
(our switch to using that.. I know that it might be awkward for completion scripts,
but maybe we can rework our existing po file generation to create *.ftl files.. not sure if that's feasible).
If the
printfcommand doesn't support explicit positions, translators need to know whether a string might be used by theprintfcommand,
Right, if we didn't add this feature to the printf builtin,
that would speak for option 3, to make it more obvious that
one of the syntaxes doesn't allow reordering.
But since it looks like we are gonna add it to printf, it doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't read too much of the second commit though, but you can take care
of keeping it correct and reasonably simple.
Yeah, don't bother with reviewing the current state. Before we add features to the printf implementation it should see some major refactoring.
(e.g. they'd no longer have the choice between %X and %x)
It's worse than that, they'd no longer have the choice to format anything as something other than a string if the format specifier does not contain type info. Which may be ok, I haven't actually checked if anything else is used/needed.
this is about the hypothetical fish_localize_message builtin, right?
I was thinking of the Rust side as well. I just realized that default Rust formatting supports named parameters by specifying arguments as key-value pairs. With k-v approach for arguments (both in Rust and shell scripts) the concerns I raised about named parameters are void.
One option, which for some reason I didn't think of earlier, is using existing formatting implementations (like Rust's format!, or the formatx crate for localized format strings), and then add a post-processing step which converts PUA-encoded bytes back into the original bytes before printing. This should also work with fluent, I think. Is there anything preventing this approach from working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we add features to the printf implementation it should see some major refactoring.
the sprintf change seems like a lot of bang for the buck (if translators are waiting on this),
so we could do only that initially, your call
Which may be ok
it probably is, I'd be very surprised if any existing translation changes any format specifiers relative to the source string
Is there anything preventing this approach from working?
So you're saying we should operate on String, because WString
invariants already guarantee that every char is a valid char,
i.e. doing effectively:
let fish_string: WString;
let rust_string: String = fish_string.to_string()
let formatted_message = format!("some message: {rust_string}");
let formatted_wide_string = WString::from_str(&formatted_message);yeah that should work actually.
The number of temporary allocations seems odd but probably not a show-stopper.
A wide-char-native implementation would write to WString directly,
but IIRC the standard traits like Display, Debug want String,
so that might be an uphill battle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the
sprintfchange seems like a lot of bang for the buck (if translators are waiting on this),
so we could do only that initially, your call
The actual buck is quite a bit larger, since it requires more refactoring of printf (the current code path for explicit arg positions doesn't support escape sequences, for example, and there is already way too much code duplication). I might continue to work on it, but I haven't decided yet.
i.e. doing effectively:
let fish_string: WString;
let rust_string: String = fish_string.to_string()
let formatted_message = format!("some message: {rust_string}");
let formatted_wide_string = WString::from_str(&formatted_message);I think the second step is unnecessary, our wstr type implements Display: https://docs.rs/widestring/latest/widestring/utfstr/struct.Utf32Str.html#impl-Display-for-Utf32Str
While we could convert the formatting result back to WString, it might make sense to directly convert to bytes in the cases where we want to output and not keep the formatted string around (which I assume is most/all cases). With that, the only allocation we need is for formatted_message, which would happen even if we were only dealing with Rust's str and no non-UTF-8 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The format strings themselves should of course be Rust strs, but that should not be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the second step is unnecessary,
Right. So as long as we implement Display for everything, it's at most one extra allocation, which is surely fine. Nice.
it might make sense to directly convert to bytes in the cases where we want to output
yeah, if we're filling an IoBuffer or similar, we can skip WString::from_str+wcs2string in favor of
fn internal_chars_to_external_bytes(out: &mut Vec<u8>, cs: Iterator<Item = char>) {
for c in cs {
if is_pua(c) {
out.push(decode_byte_from_char(c));
} else {
for b in wcrtomb(c) {
out.push(b);
}
}
}
}| /// The format string contains a format specifier using explicit argument indices and one using | ||
| /// implicit indices. | ||
| /// Only all-explicit and all-implicit are allowed. | ||
| InconsistentUseOfArgumentIndices, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inconsistent Arg/Argument.
I'd change everything to Argument,
and use use printf::Error::* in callers.
zsh supports mixing explicit and implicit indices:
$ zsh -c 'printf \'%2$s%s\n\' a b c d'
ba
dc
I don't have a super strong opinion here,
as long as we're implementing a subset of POSIX printf(3p) or zsh printf,
(that we can still extend it later if we really want).
Maybe we should document which printf dialect we're following, if any.
| InconsistentUseOfArgumentIndices, | ||
| /// Not all arguments are used. | ||
| /// Vector contains indices of unused arguments (1-based) | ||
| UnusedArgs(Vec<usize>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem reasonable to err on the side of giving better errors for now rather than implementing rare edge cases.
| buffer.as_str() | ||
| } | ||
|
|
||
| fn take_arg_position(&mut self, num_args: usize) -> Result<Option<usize>, Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oof, the (existing) duplication is ugly.
I guess we could try to extract a shared trait.
| s.advance_by(1); | ||
| let arg_index = match arg_positions { | ||
| ArgPositions::Uninitialized => { | ||
| panic!("arg_positions must be initialized before parsing width.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we replace ArgPositions::Uninitialized enum variant with
Option<ArgPositions>, to get rid of repeated assertions?
Description
This allows explicit specification of the argument order in sprintf invocations, e.g. when using these arguments
"%2$s %1$s" "a" "b", the expected result would be"b a". These specifiers are also available for the width and precision, and should have the same semantics as other sprintf implementations.The main motivation for this change is giving translators the freedom to change the argument order, which can allow for more idiomatic translations.
This can also be made available to fish's
printfcommand, but that has some additional input validation insrc/builtins/printf.rs:print_formattedwhich would have to be updated to make it compatible with the expanded syntax.TODOs: