75

I have recently seen a colleague of mine using std::string as a buffer:

std::string receive_data(const Receiver& receiver) {
  std::string buff;
  int size = receiver.size();
  if (size > 0) {
    buff.resize(size);
    const char* dst_ptr = buff.data();
    const char* src_ptr = receiver.data();
    memcpy((char*) dst_ptr, src_ptr, size);
  }
  return buff;
}

I guess this guy wants to take advantage of auto destruction of the returned string so he needs not worry about freeing of the allocated buffer.

This looks a bit strange to me since according to cplusplus.com the data() method returns a const char* pointing to a buffer internally managed by the string:

const char* data() const noexcept;

Memcpy-ing to a const char pointer? AFAIK this does no harm as long as we know what we do, but have I missed something? Is this dangerous?

16
  • 31
    With C++17 data() has an overload returning a pointer to non-const qualified char. Commented Jun 3, 2019 at 7:35
  • 40
    ...which is mentioned by cppreference. cplusplus.com is not the best source. Commented Jun 3, 2019 at 7:36
  • 11
    I think that the operation of casting from const char* to char* by itself implies some sort of hazard in your program. If dst_ptr points to a read-only memory block, then you should not attempt to write into that block using this pointer. Commented Jun 3, 2019 at 7:37
  • 8
    Whenever you see code using C-style casts (like e.g. (char*) dst_ptr) you should take that as a red flag. Commented Jun 3, 2019 at 7:38
  • 9
    I think this question is mostly opinion-based. IMO using std::string as a buffer is fine if you know you're receiving text data. If you're receiving binary data std::vector<char> is probably a better choice. Commented Jun 3, 2019 at 7:42

9 Answers 9

78

Don't use std::string as a buffer.

It is bad practice to use std::string as a buffer, for several reasons (listed in no particular order):

  • std::string was not intended for use as a buffer; you would need to double-check the description of the class to make sure there are no "gotchas" which would prevent certain usage patterns (or make them trigger undefined behavior).
  • As a concrete example: Before C++17, you can't even write through the pointer you get with data() - it's const Tchar *; so your code would cause undefined behavior. (But &(str[0]), &(str.front()), or &(*(str.begin())) would work.)
  • Using std::strings for buffers is confusing to readers of your function's definition, who assume you would be using std::string for, well, strings. In other words, doing so breaks the Principle of Least Astonishment.
  • Worse yet, it's confusing for whoever might use your function - they too may think what you're returning is a string, i.e. valid human-readable text.
  • std::unique_ptr would be fine for your case, or even std::vector. In C++17, you can use std::byte for the element type, too. A more sophisticated option is a class with an SSO-like feature, e.g. Boost's small_vector (thank you, @gast128, for mentioning it).
  • (Minor point:) libstdc++ had to change its ABI for std::string to conform to the C++11 standard, so in some cases (which by now are rather unlikely), you might run into some linkage or runtime issues that you wouldn't with a different type for your buffer.

Also, your code may make two instead of one heap allocations (implementation dependent): Once upon string construction and another when resize()ing. But that in itself is not really a reason to avoid std::string, since you can avoid the double allocation using the construction in @Jarod42's answer.

Sign up to request clarification or add additional context in comments.

13 Comments

And with std::string_view, you can define readable portions of the buffer.
@M.M For example the terminating \0 character which std::string guarantees to be present. Would you know off-hand whether it is included in the valid range of data()? Overwriting it is UB! It needs a few extra mental cycles (and possibly reference lookups) to verify this function does not have UB from that.
@M.M: The point is - you don't intuitively know! I wasn't even aware of what MaxLanghof wrote in his comment.
+1 for std::byte, unbelievable I haven't heard of it before! It's crazy how often one looks into some C++ reference every day and still new things keep popping up...
@einpoklum: note that it's not only in creation but also in access pattern. With SSO one probably has locality of memory access patterns which modern processors optimize for. boost offers small_vector which is a vector with SSO. Pity that these things don't end up in the std.
|
68

You can completely avoid a manual memcpy by calling the appropriate constructor:

std::string receive_data(const Receiver& receiver) {
    return {receiver.data(), receiver.size()};
}

That even handles \0 in a string.

BTW, unless content is actually text, I would prefer std::vector<std::byte> (or equivalent).

10 Comments

As original buffer data should not be not const, you should not have UB.
And by then a good programmer goes: "Why the hell do I need a separate one-liner conversion function? Not only that, but this function named 'receive' doesn't actually perform any 'receiving'. Delete!"
@screwnut: To be fair, it is theoretically possible that receiver.data() waits on the reception to happen, or does something else beyond returning a member pointer.
@screwnut: Then I'm a bad programmer; I'll keep the conversion in a separate function, even if it's a one liner, because I appreciate abstraction and do not like to repeat myself. If I later need to add some checks, some logging, etc... the function is right here and I don't have to hunt around the codebase for all instances of conversion.
@screwnut: Prefer non-member non-friend functions.
|
10

Memcpy-ing to a const char pointer? AFAIK this does no harm as long as we know what we do, but is this good behavior and why?

The current code may have undefined behavior, depending on the C++ version. To avoid undefined behavior in C++14 and below take the address of the first element. It yields a non-const pointer:

buff.resize(size);
memcpy(&buff[0], &receiver[0], size);

I have recently seen a colleague of mine using std::string as a buffer...

That was somewhat common in older code, especially circa C++03. There are several benefits and downsides to using a string like that. Depending on what you are doing with the code, std::vector can be a bit anemic, and you sometimes used a string instead and accepted the extra overhead of char_traits.

For example, std::string is usually a faster container than std::vector on append, and you can't return std::vector from a function. (Or you could not do so in practice in C++98 because C++98 required the vector to be constructed in the function and copied out). Additionally, std::string allowed you to search with a richer assortment of member functions, like find_first_of and find_first_not_of. That was convenient when searching though arrays of bytes.

I think what you really want/need is SGI's Rope class, but it never made it into the STL. It looks like GCC's libstdc++ may provide it.


There a lengthy discussion about this being legal in C++14 and below:

const char* dst_ptr = buff.data();
const char* src_ptr = receiver.data();
memcpy((char*) dst_ptr, src_ptr, size);

I know for certain it is not safe in GCC. I once did something like this in some self tests and it resulted in a segfault:

std::string buff("A");
...

char* ptr = (char*)buff.data();
size_t len = buff.size();

ptr[0] ^= 1;  // tamper with byte
bool tampered = HMAC(key, ptr, len, mac);

GCC put the single byte 'A' in register AL. The high 3-bytes were garbage, so the 32-bit register was 0xXXXXXX41. When I dereferenced at ptr[0], GCC dereferenced a garbage address 0xXXXXXX41.

The two take-aways for me were, don't write half-ass self tests, and don't try to make data() a non-const pointer.

8 Comments

Prefer std::copy for type-safety. It won't be slower.
Seems like the only answer directly answering the question.
"you can't return std::vector from a function. (Or you could not do so back in C++98 or C++03)" is wrong.
And it has never been a legal optimization for a compiler to confuse an address with stored content. buff.data() cannot be a register containing 'A', it must be an address.
@jww: Indeed, although you could avoid that copy with NRVO and a swap() call. But the copy would also be required of std::string. Small string optimization could make it a little better. I think some implementations tried to solve this using copy-on-write (for string, never allowed for vector) but even in C++98 and C++03 there were some specifications on std::string that couldn't be reasonable met by COW. Of course rvalue references and moves solved it neatly.
|
8

From C++17, data can return a non const char *.

Draft n4659 declares at [string.accessors]:

const charT* c_str() const noexcept;
const charT* data() const noexcept;
....
charT* data() noexcept;

18 Comments

@SergeBallesta - Removing a const qualifier is not UB. Modifying a const object is UB. The object in question is not const.
@SergeBallesta - Seriously? And how do you reconcile that with &str[0] being a non-const pointer to the very same buffer? The object is guaranteed not to be const. The core rules of the language still apply, even to pointers returned from library types, ergo, no UB.
@Jarod42: I agree that I am nitpicking here, but the library could expect the buffer not be be changed, and later reuse a cached version. Coming for old K&R C I am now frightened by optimizing compilers and very cautious for constness and strict aliasing.
@StoryTeller Serge is correct that "Modifying the character array accessed through the const overload of data has undefined behavior." according to cppreference and the standard.
(For completeness, here is the C++11 wording: timsong-cpp.github.io/cppwp/n3337/string.ops#string.accessors-3)
|
7

The code is unnecessary, considering that

std::string receive_data(const Receiver& receiver) {
    std::string buff;
    int size = receiver.size();
    if (size > 0) {
        buff.assign(receiver.data(), size);
    }
    return buff;
}

will do exactly the same.

4 Comments

You can cut even more code; the if is also unnecessary. assign will be a no-op then. But continue to unnecessary remove code, and you end up with Jarod42's answer. None of these lines are necessary, as std::string already has an appropriate constructor.
@MSalters I prefer not to assume things that are not given. What if receiver.size() can return negative values?
That would be rather unexpected, given that sizes are typically a size_t and therefore unsigned. That does show a possible problem with your code: it might suffer from signed integer overflow, which is undefined behavior. And that's on a code path which handles input, so this may constitute an externally exploitable vulnerability.
@MSalters True, Jarod42's changes could introduce an externally exploitable vulnerability. They could also introduce crashes if receiver.data() is UB when receiver.size() is zero.
5

The big optimization opportunity I would investigate here is: Receiver appears to be some kind of container that supports .data() and .size(). If you can consume it, and pass it in as a rvalue reference Receiver&&, you might be able to use move semantics without making any copies at all! If it’s got an iterator interface, you could use those for range-based constructors or std::move() from <algorithm>.

In C++17 (as Serge Ballesta and others have mentioned), std::string::data() returns a pointer to non-const data. A std::string has been guaranteed to store all its data contiguously for years.

The code as written smells a bit, although it’s not really the programmer’s fault: those hacks were necessary at the time. Today, you should at least change the type of dst_ptr from const char* to char* and remove the cast in the first argument to memcpy(). You could also reserve() a number of bytes for the buffer and then use a STL function to move the data.

As others have mentioned, a std::vector or std::unique_ptr would be a more natural data structure to use here.

Comments

4

One downside is performance. The .resize method will default-initialize all the new byte locations to 0. That initialization is unnecessary if you're then going to overwrite the 0s with other data.

Comments

2

I do feel std::string is a legitimate contender for managing a "buffer"; whether or not it's the best choice depends on a few things...

Is your buffer content textual or binary in nature?

One major input to your decision should be whether the buffer content is textual in nature. It will be less potentially confusing to readers of your code if std::string is used for textual content.

char is not a good type for storing bytes. Keep in mind that the C++ Standard leaves it up to each implementation to decide whether char will be signed or unsigned, but for generic black-box handling of binary data (and sometimes even when passing characters to functions like std::toupper(int) that have undefined behaviour unless the argument is in range for unsigned char or is equal to EOF) you probably want unsigned data: why would you assume or imply that the first bit of each byte is a sign bit if it's opaque binary data?

Because of that, it's undeniably somewhat hackish to use std::string for "binary" data. You could use std::basic_string<std::byte>, but that's not what the question asks about, and you'd lose some inoperability benefits from using the ubiquitous std::string type.

Some potential benefits of using std::string

Firstly a few benefits:

  • it sports the RAII semantics we all know and love

  • most implementations feature short-string optimisation (SSO), which ensures that if the number of bytes is small enough to fit directly inside the string object, dynamic allocation/deallocation can be avoided (but there may be an extra branch each time the data is accessed)

    • this is perhaps more useful for passing around copies of data read or to be written, rather than for a buffer which should be pre-sized to accept a decent chunk of data if available (to improve throughput by handling more I/O at a time)
  • there's a wealth of std::string member functions, and non-member functions designed to work well with std::strings (including e.g. cout << my_string): if your client code would find them useful to parse/manipulate/process the buffer content, then you're off to a flying start

  • the API is very familiar to most C++ programmers

Mixed blessings

  • being a familiar, ubiquitous type, the code you interact with may have specialisations to for std::string that better suit your use for buffered data, or those specialisations might be worse: do evaluate that

Concerned

As Waxrat observed, what is lacking API wise is the ability to grow the buffer efficiently, as resize() writes NULs/'\0's into the characters added which is pointless if you're about to "receive" values into that memory. This isn't relevant in the OPs code where a copy of received data is being made, and the size is already known.

Discussion

Addressing einpoklum's concern:

std::string was not intended for use as a buffer; you would need to double-check the description of the class to make sure there are no "gotchas" which would prevent certain usage patterns (or make them trigger undefined behavior).

While it's true that std::string wasn't originally intended for this, the rest is mainly FUD. The Standard has made concessions to this kind of usage with C++17's non-const member function char* data(), and string has always supported embedded zero bytes. Most advanced programmers know what's safe to do.

Alternatives

  • static buffers (C char[N] array or std::array<char, N>) sized to some maximum message size, or ferrying slices of the data per call

  • a manually allocated buffer with std::unique_ptr to automate destruction: leaves you to do fiddly resizing, and track the allocated vs in-use sizes yourself; more error-prone overall

  • std::vector (possibly of std::byte for the element type; is widely understood to imply binary data, but the API is more restrictive and (for better or worse) it can't be expected to have anything equivalent to Short-String Optimisation.

  • Boost's small_vector: perhaps, if SSO was the only thing holding you back from std::vector, and you're happy using boost.

  • returning a functor that allows lazy access to the received data (providing you know it won't be deallocated or overwritten), deferring the choice of how it's stored by client code

Comments

1

use C++23's string::resize_and_overwrite

https://en.cppreference.com/w/cpp/string/basic_string/resize_and_overwrite

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1072r10.html

[[nodiscard]] static inline string formaterr (DWORD errcode) {
    string strerr;
    
    strerr.resize_and_overwrite(2048, [errcode](char* buf, size_t buflen) {
        // https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-formatmessage
        return FormatMessageA(
            FORMAT_MESSAGE_FROM_SYSTEM | FORMAT_MESSAGE_IGNORE_INSERTS,
            nullptr,
            errcode,
            0,
            buf,
            static_cast<DWORD>(buflen),
            nullptr
        );
    });
    
    return strerr;
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.