Newest 'quantization' Questions

-1 votes

0 answers

31 views

YOLOv8s TensorRT INT8 engine produces wrong bounding boxes with saturated confidence scores on Jetson Orin

I'm trying to quantize a YOLOv8s model to INT8 using TensorRT on a Jetson Orin (JetPack, TensorRT 8.6.2, Ultralytics 8.2.83, CUDA 12.2). The FP16 engine works correctly but the INT8 engine produces ...

Adel Ali Taleb

77

asked Mar 31 at 16:37

3 votes

0 answers

36 views

How to convert the MLP in MoE to 4 bit quantization?

I'm doing some research about the information encoding with LLMs and need to find a way to quantize the weights of the MLP layers(MoE) to 4 bits and even customized mixed precision. Consider from ...

ShoutOutAndCalculate

623

asked Mar 18 at 14:38

0 votes

0 answers

32 views

post training quantized model gets the error "Copying from quantized Tensor to non-quantized Tensor is not allowed" even though I'm not copying tensor

I got a pretrained resnet 18 model from this lane detection repo in order to use it as an ADAS(advanced driver assistance systems) function for an electric car making competition. My current goal is ...

Ekim

3

asked Feb 6 at 14:02

0 votes

1 answer

125 views

Apply Quantization on a CNN

I want to apply a quantization function to a deep CNN. This CNN is used for an image classification(in 4 classes) task, and my data consists of 224×224 images. When I run this code, I get an error. ...

jasmine

31

asked Dec 9, 2025 at 11:36

2 votes

0 answers

100 views

Issue Replicating TF-Lite Conv2D Quantized Inference Output

I am trying to reproduce the exact layer-wise output of a quantized EfficientNet model (TFLite model, TensorFlow 2.17) by re-implementing Conv2D, DepthwiseConv2D, FullyConnected, Add, Mul, Sub and ...

Jolverine

1

asked Nov 17, 2025 at 8:58

0 votes

2 answers

240 views

Why does TFLite INT8 quantization decompose BatchMatMul (from Einsum) into many FullyConnected layers?

I’m debugging a model conversion using onnx2tf and post-training quantization issue involving Einsum, BatchMatMul, and FullyConnected layers across different model formats. Pipeline: ONNX → TF ...

Saurav Rai

2,217

asked Nov 13, 2025 at 11:26

0 votes

0 answers

59 views

Error while converting quantized Torch model to ONNX

I’m applying QAT to YOLOv8n model with the following configuration: QConfig( activation=FakeQuantize.with_args( observer=MovingAverageMinMaxObserver, quant_min=0, quant_max=...

Matteo

111

asked Sep 5, 2025 at 14:39

1 vote

0 answers

43 views

Quantization In Tensorflow2, Instance error

I am trying to quantize a model in tensorflow using tfmot. This is a sample model, inputs = keras.layers.Input(shape=(512, 512, 1)) x = keras.layers.Conv2D(3, kernel_size=1, padding='same')(inputs) x =...

Sai

11

asked Aug 29, 2025 at 17:03

0 votes

1 answer

316 views

RuntimeError: CUDA error: named symbol not found when using TorchAoConfig with Qwen2.5-VL-7B-Instruct model

I'm trying to load the Qwen2.5-VL-7B-Instruct model from hugging face with 4-bit weight-only quantization using TorchAoConfig (similar to how its mentioned in the documentation here), but I'm getting ...

Sankalp Dhupar

71

asked Jul 21, 2025 at 23:41

1 vote

0 answers

159 views

Fine-tuned LLaMA 2–7B with QLoRA, but reloading fails: missing 4bit metadata. Likely saved after LoRA+resize. Need proper 4bit save method

I’ve been working on fine-tuning LLaMA 2–7B using QLoRA with bitsandbytes 4-bit quantization and ran into a weird issue. I did adaptive pretraining on Arabic data with a custom tokenizer (vocab size ~...

orchid Ali

11

asked Jun 26, 2025 at 17:50

0 votes

2 answers

70 views

Straight-Through estimation for vector quantization inside a recurrent neural network

in my model, I use vector quantization (VQ) inside a recurrent neural network. The VQ is trained using straight-through estimation with that particular code being identical to [1]: ...

Cola Lightyear

23

asked Jun 11, 2025 at 11:46

0 votes

0 answers

42 views

Mismatch between PyTorch inference and manual implementation

I’m trying to manually reproduce the inference forward-pass to understand exactly how quantized inference works. To do so, I trained and quantized a model in PyTorch using QAT, manually simulate the ...

greifswald

1

asked Apr 28, 2025 at 19:06

1 vote

0 answers

112 views

how to convert a QAT quantization aware trained tensorflow graph into tflite model?

I have am quantizing a neural network using QAT and I want to convert it into tflite. Quantization nodes get added to the skeleton graph and we get a new graph. I am able to load the trained QAT ...

Prateek Sharma

11

asked Apr 8, 2025 at 9:08

1 vote

0 answers

59 views

Issues with MP3-like Compression: Quantization and File Size

I’m trying to implement an MP3-like compression algorithm for audio and have followed the general steps, but I’m encountering a few issues with the quantization step. Here's the overall process I'm ...

Muchacho

17

asked Jan 6, 2025 at 13:18

0 votes

1 answer

276 views

Trying to quantize YOLOv11 in tensorflow, is this topology normal?

I'm trying to quantize the YOLO v11 model in tensorflow and get this as a result: The target should be int8. Is this normal behaviour? When running it with tflite micro on an esp32 I quicly run out of ...

gillo04

148

asked Dec 11, 2024 at 7:00

1 vote

0 answers

106 views

Transforming a picture into a posterized image with matching grid overlay and symbols

First of all, I want to help my mom with her embroidery projects and secondly, I want to get better in Python. So I don't need an exact solution. But it would be great to be pointed in the right ...

Ricked

11

asked Nov 12, 2024 at 16:33

3 votes

1 answer

793 views

RuntimeError: "Unused kwargs" and "frozenset object has no attribute discard" with BitsAndBytes bf16 Quantized Model in Hugging Face Gradio App

I'm encountering a RuntimeError while running a BitsAndBytes bf16 quantized Gemma-2-2b model on Hugging Face Spaces with a Gradio UI. The error specifically mentions unused kwargs and an ...

doniker99

56

asked Nov 10, 2024 at 16:07

0 votes

1 answer

720 views

Why are model_q4.onnx and model_q4f16.onnx not 4 times smaller than model.onnx?

I see on https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct/tree/main/onnx: File Name Size model.onnx 654 MB model_fp16.onnx 327 MB model_q4.onnx 200 MB model_q4f16.onnx 134 MB I understand ...

Franck Dernoncourt

85.2k

asked Nov 7, 2024 at 17:52

1 vote

0 answers

61 views

pytorch quantized linear function gives shape invalid error

I am trying to implement write a simple quantized tensor linear multiplication. Assuming the weight matrix w3 of shape (14336, 4096) and the input tensor x of shape (2, 512, 4096) where first dim is ...

hafezmg48

99

asked Oct 30, 2024 at 20:32

3 votes

2 answers

3k views

How to Load a 4-bit Quantized VLM Model from Hugging Face with Transformers?

I’m new to quantization and working with visual language models (VLM).I’m trying to load a 4-bit quantized version of the Ovis1.6-Gemma model from Hugging Face using the transformers library. I ...

meysam

194

asked Oct 27, 2024 at 9:31

1 vote

0 answers

110 views

Inference speed for tflite fp16 converted model is slow on intel core i5 cpu

I converted an existing tensorflow efficient net model built on tensorflow version 2.3.1 to a tflite fp16 version to reduce its size. I want to run it on CPU and use in my API. But while testing I ...

Harry Ali

11

asked Oct 13, 2024 at 17:13

1 vote

1 answer

1k views

valueError: Supplied state dict for layers does not contain `bitsandbytes__*` and possibly other `quantized_stats`(when load saved quantized model)

We are trying to deploy a quantized Llama 3.1 70B model(from Huggingface, using bitsandbytes), quantizing part works fine as we check the model memory which is correct and also test getting ...

Luis Leal

3,554

asked Oct 9, 2024 at 2:25

1 vote

0 answers

2k views

Quantize and fine-tune Llama 3.1 8B for Ollama

I want to fine-tune locally the Meta's Llama 3.1 8B Instruct model with custom data and then save it in a format compatible with Ollama for further inference. As I do everything locally and don't have ...

Adrien

13

asked Aug 28, 2024 at 8:55

1 vote

0 answers

162 views

Openvino set_input_tensor() must be called on a function with exactly one parameter

RuntimeError: 'inputs.size() == 1' when setting input tensor for OpenVINO model with multiple inputs I'm trying to use an OpenVINO model that was originally designed for PyTorch, and I'm running into ...

Framefact

11

asked Aug 20, 2024 at 11:08

2 votes

1 answer

4k views

How to quantize a HF safetensors model and save it to llama.cpp GGUF format with less than q8_0 quantization?

I'm developing LLM agents using llama.cpp as inference engine. Sometimes I want to use models in safetensors format and there is a python script (https://github.com/ggerganov/llama.cpp/blob/master/...

arkuzo

41

asked Aug 7, 2024 at 6:10

0 votes

1 answer

2k views

cannot import name 'AliasGenerator' from 'pydantic'

stuck at this issue, any idea on how i can rectify this? I tried installing openbb and upgrading pydantic. however i am unable to rectify this issue. Please help me provide any suggestions. thank you ...

milner pch

11

asked Aug 4, 2024 at 9:26

1 vote

0 answers

70 views

ValueError: ('Expected `model` argument to be a `Model` instance, got ', <keras.engine.sequential.Sequential object at 0x7f234263dfd0>)

I want to do Quantization Aware Training, Here's my model architecture. Model: "sequential_4" _________________________________________________________________ Layer (type) ...

Vina

27

asked Jul 22, 2024 at 6:14

0 votes

1 answer

133 views

Unable to build interpreter for TFLITE ViT-based image classifiers on Dart / Flutter: Didn't find op for builtin opcode 'CONV_2D' version '6'

We are trying to deploy vision transformer models (EfficientViT_B0, MobileViT_V2_175, and RepViT_M11) on our flutter application using the tflite_flutter_plus and tflite_flutter_plus_helper ...

D.Varam

1

asked Jul 17, 2024 at 13:08

1 vote

0 answers

148 views

Convert Quantization to Onnx

I am new and want to try converting models to Onnx format and I have the following issue. I have a model that has been quantized to 4-bit, and then I converted this model to Onnx. My quantized model ...

Toàn Nguyễn Phúc

11

asked Jul 11, 2024 at 3:03

0 votes

1 answer

158 views

What is the difference, if any, between model.half() and model.to(dtype=torch.float16) in huggingface-transformers?

Example: # pip install transformers from transformers import AutoModelForTokenClassification, AutoTokenizer # Load model model_path = 'huawei-noah/TinyBERT_General_4L_312D' model = ...

Franck Dernoncourt

85.2k

asked Jul 7, 2024 at 23:33

5 votes

1 answer

2k views

ONNX-Python: Can someone explain the Calibration_Data_Reader requested by the static_quantization-function?

I am using the ONNX-Python-library. I am trying to quantize ai-models statically using the quantize_static() function imported from onnxruntime.quantization. This function takes a ...

Zylon

51

asked Jun 18, 2024 at 12:13

2 votes

0 answers

1k views

Cannot Export HuggingFace Model to ONNX with Optimum-CLI

Summary I am trying to export the CIDAS/clipseg-rd16 model to ONNX using optimum-cli as given in the HuggingFace documentation. However, I get an error saying ValueError: Unrecognized configuration ...

Sattwik Kumar Sahu

21

asked Jun 18, 2024 at 6:10

3 votes

2 answers

1k views

Speeding up load time of LLMs

I am currently only able to play around with a V100 on GCP. I understand that I can load a LLM in 4bit quantization as shown below. However, (assuming due to the quantization) it is taking up to 10 ...

sachinruk

10k

asked Jun 3, 2024 at 12:30

0 votes

1 answer

1k views

How to resolve Import Error when using quantization in bitsandbytes

I am trying to make a gradio chatbot in Hugging Face Spaces using Mistral-7B-v0.1 model. As this is a large model, I have to quantize, else the free 50G storage gets full. I am using bitsandbytes to ...

Anish

13

asked May 23, 2024 at 5:02

0 votes

0 answers

68 views

Fixed point vs Float point number

I have a project that is basically to analyze the effects of quantization on orientation estimation algorithms. I have sensor data from gyroscope that looks like this when using float datatype: gx=-0....

user3662181

11

asked May 3, 2024 at 17:18

0 votes

1 answer

209 views

How to set training=False for keras-model/layer outside of the call method?

I’m using Keras with tensorflow-model-optimization (tf_mot) for quantization aware training (QAT). My model is based on a pre-trained backbone from keras.application. As mentioned in the transfer ...

Никита Шубин

28

asked Apr 22, 2024 at 20:29

0 votes

1 answer

614 views

Diffrence between gguf and lora

Does the gguf format perform model quantization even though it's already quantized with LORA? Hello ! im new to Llms ,and l've fine-tuned the CODELLAMA model on kaggle using LORA.I've merged and ...

Samar

3

asked Apr 17, 2024 at 10:30

1 vote

0 answers

282 views

error: 'tf.TensorListSetItem' op is neither a custom op nor a flex op while trying to quantize a model

I am trying to learn about quantization so was playing with a github repo trying to quantize it into int8 format. I have used the following code to quantize the model. modelClass = DTLN_model() ...

Niaz Palak

327

asked Apr 13, 2024 at 7:02

2 votes

0 answers

1k views

On onnxruntime-gpu,CUDAProvider,Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on perf

I have been facing an issue when I am trying to inference using a dynamically quantized yolov8s onnx model on GPU. I have used yolov8s.pt and exported it to yolov8.onnx using onnx export. Then I ...

Suraj Rao

21

asked Apr 4, 2024 at 6:40

3 votes

1 answer

3k views

Quantization and torch_dtype in huggingface transformer

Not sure if its the right forum to ask but. Assuming i have a gptq model that is 4bit. how does using from_pretrained(torch_dtype=torch.float16) work? In my understanding 4 bit meaning changing the ...

aceminer

4,375

asked Apr 3, 2024 at 12:48

5 votes

2 answers

6k views

ValueError: You can't pass `load_in_4bit`or `load_in_8bit` as a kwarg when passing `quantization_config` argument at the same time

I'm currently fine-tuning the Mistral 7B model and encountered the following error: ValueError: You cannot simultaneously pass the load_in_4bit or load_in_8bit arguments while also passing the ...

Jyoti yadav

300

asked Apr 1, 2024 at 13:55

0 votes

1 answer

1k views

Quantization 4 bit and 8 bit - error in 'quantization_config'

I am using model = 'filipealmeida/Mistral-7B-Instruct-v0.1-sharded' and quantize it in 4_bit with the following function. def load_quantized_model(model_name: str): """ :param ...

Gabriele Castaldi

3

asked Mar 31, 2024 at 12:54

1 vote

2 answers

715 views

How to manually dequantize the output of a layer and requantize it for the next layer in Pytorch?

I am working on school project that requires me to perform manual quantization of each layer of a model. Specifically, I want to implement manually: Quantized activation, combined with quantized ...

longbow

11

asked Mar 28, 2024 at 17:17

0 votes

1 answer

219 views

Image quantization with Numpy

I wanted to have a look at the example code for image quantization from here However, it's rather old and Python and NP have changed since then. from pylab import imread,imshow,figure,show,subplot ...

Ghoul Fool

7,047

asked Mar 26, 2024 at 15:34

0 votes

0 answers

102 views

Is there a way to make the tflite converter cut the tails of the distributions when using the representative dataset?

I am in the process of quantizing a model to int8 in order to make it run on the coral edgetpu. In order to do that I am using the tflite converter. My code looks like this one class ...

Kilian Tiziano Le Creurer

1

asked Mar 21, 2024 at 19:23

2 votes

1 answer

3k views

How to quantize sentence-transformer model on CPU to use it on GPU?

I wanted to use the 'Salesforce/SFR-Embedding-Mistral' embedding model, but it is too large for the GPU partition I have access to. Therefore, I considered quantizing the model, but I couldn't find a ...

Firevince

21

asked Mar 7, 2024 at 18:27

2 votes

1 answer

724 views

Does static quantization enable the model to feed a layer with the output of the previous one, without converting to fp (and back to int)?

I was reading about quantization (specifically abount int8) and trying to figure it out if there is a method to avoid dequantize and requantize the output of a node before feeding it to the next one. ...

Andrea Tedeschi

43

asked Feb 20, 2024 at 9:42

1 vote

0 answers

833 views

Unable to Run Huggingface Transformers without Bitsandbytes on Windows

I'm trying to run Llama 2 locally on my Windows PC. This is my code here: import torch import transformers model_id = 'meta-llama/Llama-2-7b-chat-hf' device = f'cuda:{torch.cuda.current_device()}' ...

Scaevola

11

asked Jan 19, 2024 at 0:05

2 votes

0 answers

237 views

why the weight of a 8-bit quantized LLM (using GPTQ) is float 16

I know that quantization use int8 to reduce the usage of memory But when I print the weight, it is float16 so how come quantization helps accelerate? do they convert float to int only when doing ...

ada

21

asked Jan 10, 2024 at 8:22

1 vote

1 answer

76 views

What's an elegant way to avoid "hopping" quantization errors when graphing a divergent 2D function?

I have some Qt-based software that graphs an audio-transform function in 2D (with frequency-in-Hz as the X axis, and decibels-gain on the Y axis). It does this by choosing a set of X positions to ...

Jeremy Friesner

74.1k

asked Dec 20, 2023 at 23:50

Collectives™ on Stack Overflow