GPUs are specialized for enormous small tasks in parallel, while CPUs are optimized for few huge tasks sequentially. The typical procedure for a CUDA program includes: 1) allocating memory on the GPU, 2) copying data from CPU to GPU, 3) launching kernels on the GPU, and 4) copying results back to the CPU. Measuring GPU performance focuses on throughput or tasks processed per hour rather than latency of each task.