Skip to main content
Filter by
Sorted by
Tagged with
-1 votes
0 answers
31 views

I'm trying to quantize a YOLOv8s model to INT8 using TensorRT on a Jetson Orin (JetPack, TensorRT 8.6.2, Ultralytics 8.2.83, CUDA 12.2). The FP16 engine works correctly but the INT8 engine produces ...
Adel Ali Taleb's user avatar
3 votes
0 answers
36 views

I'm doing some research about the information encoding with LLMs and need to find a way to quantize the weights of the MLP layers(MoE) to 4 bits and even customized mixed precision. Consider from ...
ShoutOutAndCalculate's user avatar
0 votes
0 answers
32 views

I got a pretrained resnet 18 model from this lane detection repo in order to use it as an ADAS(advanced driver assistance systems) function for an electric car making competition. My current goal is ...
Ekim's user avatar
  • 3
0 votes
1 answer
125 views

I want to apply a quantization function to a deep CNN. This CNN is used for an image classification(in 4 classes) task, and my data consists of 224×224 images. When I run this code, I get an error. ...
jasmine's user avatar
  • 31
2 votes
0 answers
100 views

I am trying to reproduce the exact layer-wise output of a quantized EfficientNet model (TFLite model, TensorFlow 2.17) by re-implementing Conv2D, DepthwiseConv2D, FullyConnected, Add, Mul, Sub and ...
Jolverine's user avatar
0 votes
2 answers
240 views

I’m debugging a model conversion using onnx2tf and post-training quantization issue involving Einsum, BatchMatMul, and FullyConnected layers across different model formats. Pipeline: ONNX → TF ...
Saurav Rai's user avatar
  • 2,217
0 votes
0 answers
59 views

I’m applying QAT to YOLOv8n model with the following configuration: QConfig( activation=FakeQuantize.with_args( observer=MovingAverageMinMaxObserver, quant_min=0, quant_max=...
Matteo's user avatar
  • 111
1 vote
0 answers
43 views

I am trying to quantize a model in tensorflow using tfmot. This is a sample model, inputs = keras.layers.Input(shape=(512, 512, 1)) x = keras.layers.Conv2D(3, kernel_size=1, padding='same')(inputs) x =...
Sai's user avatar
  • 11
0 votes
1 answer
316 views

I'm trying to load the Qwen2.5-VL-7B-Instruct model from hugging face with 4-bit weight-only quantization using TorchAoConfig (similar to how its mentioned in the documentation here), but I'm getting ...
Sankalp Dhupar's user avatar
1 vote
0 answers
159 views

I’ve been working on fine-tuning LLaMA 2–7B using QLoRA with bitsandbytes 4-bit quantization and ran into a weird issue. I did adaptive pretraining on Arabic data with a custom tokenizer (vocab size ~...
orchid Ali's user avatar
0 votes
2 answers
70 views

in my model, I use vector quantization (VQ) inside a recurrent neural network. The VQ is trained using straight-through estimation with that particular code being identical to [1]: ...
Cola Lightyear's user avatar
0 votes
0 answers
42 views

I’m trying to manually reproduce the inference forward-pass to understand exactly how quantized inference works. To do so, I trained and quantized a model in PyTorch using QAT, manually simulate the ...
greifswald's user avatar
1 vote
0 answers
112 views

I have am quantizing a neural network using QAT and I want to convert it into tflite. Quantization nodes get added to the skeleton graph and we get a new graph. I am able to load the trained QAT ...
Prateek Sharma's user avatar
1 vote
0 answers
59 views

I’m trying to implement an MP3-like compression algorithm for audio and have followed the general steps, but I’m encountering a few issues with the quantization step. Here's the overall process I'm ...
Muchacho's user avatar
0 votes
1 answer
276 views

I'm trying to quantize the YOLO v11 model in tensorflow and get this as a result: The target should be int8. Is this normal behaviour? When running it with tflite micro on an esp32 I quicly run out of ...
gillo04's user avatar
  • 148
1 vote
0 answers
106 views

First of all, I want to help my mom with her embroidery projects and secondly, I want to get better in Python. So I don't need an exact solution. But it would be great to be pointed in the right ...
Ricked's user avatar
  • 11
3 votes
1 answer
793 views

I'm encountering a RuntimeError while running a BitsAndBytes bf16 quantized Gemma-2-2b model on Hugging Face Spaces with a Gradio UI. The error specifically mentions unused kwargs and an ...
doniker99's user avatar
0 votes
1 answer
720 views

I see on https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct/tree/main/onnx: File Name Size model.onnx 654 MB model_fp16.onnx 327 MB model_q4.onnx 200 MB model_q4f16.onnx 134 MB I understand ...
Franck Dernoncourt's user avatar
1 vote
0 answers
61 views

I am trying to implement write a simple quantized tensor linear multiplication. Assuming the weight matrix w3 of shape (14336, 4096) and the input tensor x of shape (2, 512, 4096) where first dim is ...
hafezmg48's user avatar
3 votes
2 answers
3k views

I’m new to quantization and working with visual language models (VLM).I’m trying to load a 4-bit quantized version of the Ovis1.6-Gemma model from Hugging Face using the transformers library. I ...
meysam's user avatar
  • 194
1 vote
0 answers
110 views

I converted an existing tensorflow efficient net model built on tensorflow version 2.3.1 to a tflite fp16 version to reduce its size. I want to run it on CPU and use in my API. But while testing I ...
Harry Ali's user avatar
1 vote
1 answer
1k views

We are trying to deploy a quantized Llama 3.1 70B model(from Huggingface, using bitsandbytes), quantizing part works fine as we check the model memory which is correct and also test getting ...
Luis Leal's user avatar
  • 3,554
1 vote
0 answers
2k views

I want to fine-tune locally the Meta's Llama 3.1 8B Instruct model with custom data and then save it in a format compatible with Ollama for further inference. As I do everything locally and don't have ...
Adrien's user avatar
  • 13
1 vote
0 answers
162 views

RuntimeError: 'inputs.size() == 1' when setting input tensor for OpenVINO model with multiple inputs I'm trying to use an OpenVINO model that was originally designed for PyTorch, and I'm running into ...
Framefact's user avatar
2 votes
1 answer
4k views

I'm developing LLM agents using llama.cpp as inference engine. Sometimes I want to use models in safetensors format and there is a python script (https://github.com/ggerganov/llama.cpp/blob/master/...
arkuzo's user avatar
  • 41
0 votes
1 answer
2k views

stuck at this issue, any idea on how i can rectify this? I tried installing openbb and upgrading pydantic. however i am unable to rectify this issue. Please help me provide any suggestions. thank you ...
milner pch's user avatar
1 vote
0 answers
70 views

I want to do Quantization Aware Training, Here's my model architecture. Model: "sequential_4" _________________________________________________________________ Layer (type) ...
Vina's user avatar
  • 27
0 votes
1 answer
133 views

We are trying to deploy vision transformer models (EfficientViT_B0, MobileViT_V2_175, and RepViT_M11) on our flutter application using the tflite_flutter_plus and tflite_flutter_plus_helper ...
D.Varam's user avatar
1 vote
0 answers
148 views

I am new and want to try converting models to Onnx format and I have the following issue. I have a model that has been quantized to 4-bit, and then I converted this model to Onnx. My quantized model ...
Toàn Nguyễn Phúc's user avatar
0 votes
1 answer
158 views

Example: # pip install transformers from transformers import AutoModelForTokenClassification, AutoTokenizer # Load model model_path = 'huawei-noah/TinyBERT_General_4L_312D' model = ...
Franck Dernoncourt's user avatar
5 votes
1 answer
2k views

I am using the ONNX-Python-library. I am trying to quantize ai-models statically using the quantize_static() function imported from onnxruntime.quantization. This function takes a ...
Zylon's user avatar
  • 51
2 votes
0 answers
1k views

Summary I am trying to export the CIDAS/clipseg-rd16 model to ONNX using optimum-cli as given in the HuggingFace documentation. However, I get an error saying ValueError: Unrecognized configuration ...
Sattwik Kumar Sahu's user avatar
3 votes
2 answers
1k views

I am currently only able to play around with a V100 on GCP. I understand that I can load a LLM in 4bit quantization as shown below. However, (assuming due to the quantization) it is taking up to 10 ...
sachinruk's user avatar
  • 10k
0 votes
1 answer
1k views

I am trying to make a gradio chatbot in Hugging Face Spaces using Mistral-7B-v0.1 model. As this is a large model, I have to quantize, else the free 50G storage gets full. I am using bitsandbytes to ...
Anish's user avatar
  • 13
0 votes
0 answers
68 views

I have a project that is basically to analyze the effects of quantization on orientation estimation algorithms. I have sensor data from gyroscope that looks like this when using float datatype: gx=-0....
user3662181's user avatar
0 votes
1 answer
209 views

I’m using Keras with tensorflow-model-optimization (tf_mot) for quantization aware training (QAT). My model is based on a pre-trained backbone from keras.application. As mentioned in the transfer ...
Никита Шубин's user avatar
0 votes
1 answer
614 views

Does the gguf format perform model quantization even though it's already quantized with LORA? Hello ! im new to Llms ,and l've fine-tuned the CODELLAMA model on kaggle using LORA.I've merged and ...
Samar's user avatar
  • 3
1 vote
0 answers
282 views

I am trying to learn about quantization so was playing with a github repo trying to quantize it into int8 format. I have used the following code to quantize the model. modelClass = DTLN_model() ...
Niaz Palak's user avatar
2 votes
0 answers
1k views

I have been facing an issue when I am trying to inference using a dynamically quantized yolov8s onnx model on GPU. I have used yolov8s.pt and exported it to yolov8.onnx using onnx export. Then I ...
Suraj Rao's user avatar
3 votes
1 answer
3k views

Not sure if its the right forum to ask but. Assuming i have a gptq model that is 4bit. how does using from_pretrained(torch_dtype=torch.float16) work? In my understanding 4 bit meaning changing the ...
aceminer's user avatar
  • 4,375
5 votes
2 answers
6k views

I'm currently fine-tuning the Mistral 7B model and encountered the following error: ValueError: You cannot simultaneously pass the load_in_4bit or load_in_8bit arguments while also passing the ...
Jyoti yadav's user avatar
0 votes
1 answer
1k views

I am using model = 'filipealmeida/Mistral-7B-Instruct-v0.1-sharded' and quantize it in 4_bit with the following function. def load_quantized_model(model_name: str): """ :param ...
Gabriele Castaldi's user avatar
1 vote
2 answers
715 views

I am working on school project that requires me to perform manual quantization of each layer of a model. Specifically, I want to implement manually: Quantized activation, combined with quantized ...
longbow's user avatar
  • 11
0 votes
1 answer
219 views

I wanted to have a look at the example code for image quantization from here However, it's rather old and Python and NP have changed since then. from pylab import imread,imshow,figure,show,subplot ...
Ghoul Fool's user avatar
  • 7,047
0 votes
0 answers
102 views

I am in the process of quantizing a model to int8 in order to make it run on the coral edgetpu. In order to do that I am using the tflite converter. My code looks like this one class ...
Kilian Tiziano Le Creurer's user avatar
2 votes
1 answer
3k views

I wanted to use the 'Salesforce/SFR-Embedding-Mistral' embedding model, but it is too large for the GPU partition I have access to. Therefore, I considered quantizing the model, but I couldn't find a ...
Firevince's user avatar
2 votes
1 answer
724 views

I was reading about quantization (specifically abount int8) and trying to figure it out if there is a method to avoid dequantize and requantize the output of a node before feeding it to the next one. ...
Andrea Tedeschi's user avatar
1 vote
0 answers
833 views

I'm trying to run Llama 2 locally on my Windows PC. This is my code here: import torch import transformers model_id = 'meta-llama/Llama-2-7b-chat-hf' device = f'cuda:{torch.cuda.current_device()}' ...
Scaevola's user avatar
2 votes
0 answers
237 views

I know that quantization use int8 to reduce the usage of memory But when I print the weight, it is float16 so how come quantization helps accelerate? do they convert float to int only when doing ...
ada's user avatar
  • 21
1 vote
1 answer
76 views

I have some Qt-based software that graphs an audio-transform function in 2D (with frequency-in-Hz as the X axis, and decibels-gain on the Y axis). It does this by choosing a set of X positions to ...
Jeremy Friesner's user avatar

1
2 3 4 5
10