Skip to main content
Filter by
Sorted by
Tagged with
0 votes
1 answer
202 views

GCC offers a 16-bit floating point type, outside of the C language standard: _Float16 - at least for x86_64. This allowance is described here. However - the GCC documentation does not seem to indicate ...
einpoklum's user avatar
  • 138k
0 votes
1 answer
70 views

I have a C project configured with CMake. Some program within this project uses _Float16 (a "half-precision" type). I know how to determine, within the code, whether _Float16 is available: ...
einpoklum's user avatar
  • 138k
4 votes
2 answers
223 views

I have a _Float16 half-precision variable named x in my C program, and would like to printf() it. Now, I can write: printf("%f", (double) x);, and this will work; but - can I printf x ...
einpoklum's user avatar
  • 138k
6 votes
1 answer
570 views

I'm developing a library which uses _Float16s for many of the constants to save space when passing them around. However, just testing, it seems that telling GCC to just "set it to 1" isn't ...
Coarse Rosinflower's user avatar
1 vote
0 answers
62 views

I'm working on implementing a mathematical approach to bit flipping in IEEE 754 FP16 floating-point numbers without using direct bit manipulation. The goal is to flip a specific bit (particularly in ...
Muhammad Zaky's user avatar
2 votes
0 answers
149 views

I'm concerning mixed precision in deep learning LLM. The intermediates are mostly F32 and weights could be any other type like BF16, F16, even quantized type Q8_0, Q4_0. it would be much useful if ...
dentry's user avatar
  • 21
1 vote
1 answer
582 views

Is it safe to assume that all machines on which AVX2 is supported also support F16C instructions? I haven't encountered any machine that doesn't do that, currently. Thanks
Srihari S's user avatar
  • 107
2 votes
1 answer
108 views

I am implementing emulation of ARM float16_t for X64 using SSE; the idea is to have bit-exact values on both platforms. I mostly finished the implementation, except for one thing, I cannot correctly ...
Bogi's user avatar
  • 2,718
0 votes
1 answer
67 views

everyone. I've been learning floating-point truncation errors recently. But I found print(np.half(500.2)) and print(f"{np.half(500.2)}") yield different results. Here are the logs I got in ...
Cestimium's user avatar
-2 votes
1 answer
667 views

I read on https://github.com/huggingface/smollm/tree/main/smol_tools (mirror 1): All models are quantized to 16-bit floating-point (F16) for efficient inference. Training was done on BF16, but in our ...
Franck Dernoncourt's user avatar
3 votes
2 answers
791 views

I'm the developer of aerobus and I'm facing difficulties with half precision arithmetic. At some point in the library, I need to convert a IntType to related FloatType (same bit count) in a constexpr ...
Regis Portalez's user avatar
0 votes
1 answer
137 views

Example: # pip install transformers from transformers import AutoModelForTokenClassification, AutoTokenizer # Load model model_path = 'huawei-noah/TinyBERT_General_4L_312D' model = ...
Franck Dernoncourt's user avatar
-1 votes
1 answer
3k views

I load a huggingface-transformers float32 model, cast it to float16, and save it. How can I load it as float16? Example: # pip install transformers from transformers import ...
Franck Dernoncourt's user avatar
0 votes
1 answer
777 views

I train a Huggingface model with fp16=True, e.g.: training_args = TrainingArguments( output_dir="./results", evaluation_strategy="epoch", learning_rate=4e-5, ...
Franck Dernoncourt's user avatar
6 votes
1 answer
1k views

On CPU's with AVX-512 and BF16 support, you can use the 512 bit vector registers to store 32 16 bit floats. I have found intrinsics to convert FP32 values to BF16 values (for example: ...
Thijs Steel's user avatar
  • 1,272
0 votes
1 answer
330 views

To date I have had no issue compiling and running complex ARM Neon assembly language routines in Xcode/CLANG, and the Apple M1 supposedly supports ARMv8.4. But - when I try to use half precision with ...
user2465201's user avatar
0 votes
1 answer
143 views

I would like to know if CUDA provides a concept similar to std::floating_point but including all IEE754 types including e.g. __half. I provide below a sample code that test that __half template ...
Dimitri Lesnoff's user avatar
0 votes
0 answers
209 views

I know that in my ARM FEAT_FP16 is supported. I expect seeing fp16 in the list of features reported by cat /proc/cpuinfo: $ cat /proc/cpuinfo | grep fp | sort -u Features : fp asimd evtstrm aes ...
pmor's user avatar
  • 6,775
4 votes
2 answers
540 views

_mm256_mul_ps is the Intel intrinsic for "Multiply packed single-precision (32-bit) floating-point elements". _mm256_mul_ph is the intrinsic for "Multiply packed half-precision (16-bit) ...
dmeister's user avatar
  • 35.9k
1 vote
1 answer
583 views

I want to train a Yolov8 model on a custom dataset with my Mac and this is my first time working on deep learning. Unfortunately, I experienced an error, RuntimeError: "...
Figtor's user avatar
  • 11
0 votes
1 answer
92 views

In an application that can write numeric values to a file using BinaryWriter I have a class that is typed to the number type that should be used for the file. It looks like this: class ValueCollection&...
ygoe's user avatar
  • 20.8k
3 votes
1 answer
436 views

How do I use arm float16 intrinsics on Android? Consider the following program: #include <arm_neon.h> int main(int, char** argv) { const float16x8_t a = vdupq_n_f16(1.0F); const ...
fabian's user avatar
  • 1,881
2 votes
2 answers
638 views

This is a variant of: How to print float value from binary file in shell? in that question, we wanted to print IEEE 754 single-precision (i.e. 32-bit) floating-point values from a binary file. Now ...
einpoklum's user avatar
  • 138k
1 vote
1 answer
879 views

I have a kernel I'm running on an NVIDIA GPU, which uses the FP16 type __half, provided by cuda_fp16.hpp. To check something about its behavior, I also want to manipulate such __half values on the CPU....
einpoklum's user avatar
  • 138k
0 votes
1 answer
243 views

I have some CUDA code which uses the half2 datatype. It should be just two 16 bit floating point numbers packed together in a 32 bit space. Apparently there are the methods __low2half and __high2half ...
Martin Ueding's user avatar
1 vote
0 answers
270 views

I am working on an IEEE 754 16-bit adder, and I am confused at the round to nearest, ties to even logic. The first addition which confuses me is 169.8 (0x594E) + -0.06256 (0xAC01). After shifting and ...
Benjamin Owen's user avatar
0 votes
0 answers
90 views

Am I correct in my assumption that reading a value from .r16SNorm texture into Metal Shading Language half data type always unavoidably incur precision loss? It wasn't obvious to me from the start ...
simd's user avatar
  • 2,059
5 votes
3 answers
872 views

Refer to https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ , each SM has three type cuda cores, e.g int32 core/fp32 core/fp64 core. If the datatype is int32/fp32/fp64, I think the ...
irasin's user avatar
  • 175
1 vote
1 answer
1k views

I'm trying to use bfloat16 as a format for an application for work on HPC-clusters. For this I've installed g++13 which supposedly supports the bfloat16 format but this hasn't been working ...
Vistemboir's user avatar
1 vote
3 answers
4k views

How can I convert a float (float32) to a half (float16) and the other way around in C while accounting for edge cases like NaN, Infinity etc. I don't need arithmetic because I just need the types in ...
juffma's user avatar
  • 189
0 votes
1 answer
269 views

I am trying to compile a simple CUDA kernel with CuPy using the half precision format provided by the cuda_fp16 header file. My kernel looks like this: code = r''' extern "C" { #include <...
Markus Holzer's user avatar
0 votes
0 answers
125 views

how can I divide a 16-bit float point number by a 16-bit float point number (half-precision)? I did the sign with XOR gate, the exponent with 5bit subtractor, but couldn't do the mantissa. how can I ...
Arthur's user avatar
  • 1
0 votes
1 answer
2k views

Arm Architecture Reference Manual for A-profile architecture (emphasis added): FPHP, bits [27:24] 0b0011 As for 0b0010, and adds support for half-precision floating-point arithmetic. A simple ...
pmor's user avatar
  • 6,775
2 votes
0 answers
102 views

In my OpenCL kernel I use 16bit floating point values of type half from the cl_khr_fp16 extension. Although this gives me code that works well, I noticed with AMD's radeon developer tools that the ...
Bram's user avatar
  • 8,463
3 votes
0 answers
151 views

I'm trying to train a TensorFlow (version 2.11.0) code in float16. I checked that FP16 is supported on the RTX 3090 GPU. So, I followed the below link to train the whole code in reduced precision. ...
Sherlock's user avatar
1 vote
1 answer
986 views

For example, according to https://cocktailpeanut.github.io/dalai/#/ the relevant figures for LLaMA-65B are: Full: The model takes up 432.64GB Quantized: 5.11GB * 8 = 40.88GB The full model won't fit ...
rwallace's user avatar
  • 34.2k
0 votes
0 answers
26 views

I have met a question that the value of a tensor is 6.3982e-2 in float32. After I changed it to float16 using half() function, it became 6.3965e-2. Will there be a method to convert tensor without ...
zhangbw's user avatar
  • 13
1 vote
1 answer
2k views

I am trying to atomically add a float value to a __half in CUDA 5.2. This architecture does support the __half data type and its conversion functions, but it does not include any arithmetic and atomic ...
Skip's user avatar
  • 40
0 votes
0 answers
742 views

I want to train the model with FP32 and perform inference with FP16. For other networks (ResNet) with FP16, it worked. But EDSR (super resolution) with FP16 did not work. The differences I found are ...
SIwoo Lee's user avatar
0 votes
1 answer
897 views

it's clearly that float16 can save bandwidth, but is float16 can save compute cycles while computing transcendental functions, like exp()?
Leonardo Physh's user avatar
38 votes
2 answers
6k views

I wonder why operating on Float64 values is faster than operating on Float16: julia> rnd64 = rand(Float64, 1000); julia> rnd16 = rand(Float16, 1000); julia> @benchmark rnd64.^2 ...
Shayan's user avatar
  • 6,722
0 votes
1 answer
434 views

__device__​ __half2 __h2div ( const __half2 a, const __half2 b ) Description: Divides half2 input vector a by input vector b in round-to-nearest mode. __device__​ __half2 __hmul2 ( const __half2 a, ...
Aryan's user avatar
  • 442
1 vote
0 answers
296 views

I am converting from f32 to bf16 in rust, and want to control the direction of the rounding error. Is there an easy way to do this? Converting using the standard bf16::to_f32 rounds to the nearest ...
Amir's user avatar
  • 898
4 votes
1 answer
197 views

I'm using half floats as implemented in the SoftFloat library (read: 100% IEEE 754 compliant), and, for the sake of completeness, I wish to provide my code with definitions equivalent to those ...
cesss's user avatar
  • 913
1 vote
1 answer
3k views

I'm trying to write a basic FP16 based calculator in python to help me debug some hardware. Can't seem to find how to convert 16b hex values unto floating point values I can use in my code to do the ...
ajcrm125's user avatar
  • 353
2 votes
2 answers
1k views

I have a simple question in C language. I am implementing a half-precision software using _Float16 in C (My mac is based on ARM), but running time is not quite faster than single or double-precision ...
YUNBLACK's user avatar
8 votes
1 answer
2k views

It's clear why a 16-bit floating-point format has started seeing use for machine learning; it reduces the cost of storage and computation, and neural networks turn out to be surprisingly insensitive ...
rwallace's user avatar
  • 34.2k
2 votes
1 answer
6k views

i'm trying to train a deep learning model on vs code so i would like to use the GPU for that. I have cuda 11.6 , nvidia GeForce GTX 1650, TensorFlow-gpu==2.5.0 and pip version 21.2.3 for windows 10. ...
samar's user avatar
  • 33
1 vote
2 answers
2k views

I have run recently into a surprising and annoying bug in which I converted an integer into a float16 and the value changed: >>> import numpy as np >>> np.array([2049]).astype(np....
guhur's user avatar
  • 2,916
1 vote
2 answers
2k views

I have no choice but to read in 2 bytes that make up a half-float. I would like to work with this in the form of a 4 byte float. Ive done some research and the only thing I can come up with is bit ...
Justin Barren's user avatar