Skip to content

Conversation

@massakam
Copy link
Contributor

@massakam massakam commented Feb 6, 2025

Motivation

If a consumer with a receiver queue size of 0 is waiting for messages using receive or receiveAsync, and the topic is unloaded or the broker is restarted, the consumer will no longer be able to receive messages. This is because receive and receiveAsync have different bugs. Both bugs are highly reproducible.

receive

When the receive method of a zero queue consumer is executed, fetchSingleMessageFromBroker is called internally. In fetchSingleMessageFromBroker, the connection of the received message is compared with the connection when the flow permit was sent, and if they do not match, the message is discarded.

ClientConnectionPtr currentCnx = getCnx().lock();
Lock lock(mutexForReceiveWithZeroQueueSize);
// Just being cautious
if (incomingMessages_.size() != 0) {
LOG_ERROR(
getName() << "The incoming message queue should never be greater than 0 when Queue size is 0");
incomingMessages_.clear();
}
waitingForZeroQueueSizeMessage = true;
sendFlowPermitsToBroker(currentCnx, 1);
while (true) {
if (!incomingMessages_.pop(msg)) {
return ResultInterrupted;
}
{
// Lock needed to prevent race between connectionOpened and the check "msg.impl_->cnx_ ==
// currentCnx.get())"
Lock localLock(mutex_);
// if message received due to an old flow - discard it and wait for the message from the
// latest flow command
if (msg.impl_->cnx_ == currentCnx.get()) {

If the topic is unloaded or the broker is restarted between sending the flow permit and receiving the message, the connections at the two times will no longer match, so the message will be discarded and no further flow permits will be sent, causing message delivery to stop.

receiveAsync

If the consumer is reopened by unloading the topic or restarting the broker, flow permits that were sent before the reopening must be sent again. However, if a consumer with a receiver queue size of 0 is waiting for messages with receiveAsync, it does not seem to send flow permits even if the consumer is reopened.

LOG_INFO(getName() << "Created consumer on broker " << cnx->cnxString());
{
Lock lock(mutex_);
setCnx(cnx);
incomingMessages_.clear();
possibleSendToDeadLetterTopicMessages_.clear();
state_ = Ready;
backoff_.reset();
// Complicated logic since we don't have a isLocked() function for mutex
if (waitingForZeroQueueSizeMessage) {
sendFlowPermitsToBroker(cnx, 1);
}
availablePermits_ = 0;
}
LOG_DEBUG(getName() << "Send initial flow permits: " << config_.getReceiverQueueSize());
if (config_.getReceiverQueueSize() != 0) {
sendFlowPermitsToBroker(cnx, config_.getReceiverQueueSize());
} else if (messageListener_) {
sendFlowPermitsToBroker(cnx, 1);
}
consumerCreatedPromise_.setValue(get_shared_this_ptr());

Modifications

receive

In fetchSingleMessageFromBroker, I think the connection of the received message should be compared with the current connection, not the connection at the time the flow permit was sent. In fact, the Java client makes such a comparison.
https://github.com/apache/pulsar/blob/8a40b30cf47a91ec02d931e6371d02409ba5951e/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ZeroQueueConsumerImpl.java#L121

In addition, if the handleCreateConsumer method, which is executed when the consumer is reopened, is executed "after setting waitingForZeroQueueSizeMessage to true" and "before sending a flow permit", there is a possibility that the flow permit will be sent twice, so we lock mutex_ to perform exclusive control.

receiveAsync

We can find the number of flow permits sent before the reconnection and the number of callbacks waiting for receiveAsync to complete by checking the number of elements in pendingReceives_.

pendingReceives_.push(callback);
lock.unlock();
if (config_.getReceiverQueueSize() == 0) {
sendFlowPermitsToBroker(getCnx().lock(), 1);
}

Therefore, when a zero queue consumer is reopened, it should send the same number of flow permits as the elements of pendingReceives_.

In this case too, if handleCreateConsumer is executed "after adding an element to pendingReceives_" and "before sending a flow permit", it is possible that more flow permits will be sent than expected. So we need to lock mutex_ in receiveAsync too.

Verifying this change

  • Make sure that the change passes the CI checks.

Documentation

  • doc-not-needed

@massakam massakam added the bug Something isn't working label Feb 6, 2025
@massakam massakam added this to the 3.8.0 milestone Feb 6, 2025
@massakam massakam self-assigned this Feb 6, 2025
Copy link
Contributor

@BewareMyPower BewareMyPower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great fix!

@BewareMyPower BewareMyPower merged commit afeac78 into apache:main Feb 8, 2025
15 checks passed
@massakam massakam deleted the fix-zero-queue-consumer branch February 12, 2025 01:32
BewareMyPower pushed a commit that referenced this pull request Apr 28, 2025
…s after topic unloading (#473)

(cherry picked from commit afeac78)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants