Miriam Leeser

Northeastern University, Electrical and Computer Engineering, Faculty Member

Followers

Following

Co-authors

Public Views

InterestsView All (8)

Uploads

Papers by Miriam Leeser

Verifying a logic synthesis tool in Nuprl

Computer Aided Verification, 1993

<title>Effect of data truncation in an implementation of pixel clustering on a custom computing machine</title>

Reconfigurable Technology: FPGAs for Computing and Applications II, 2000

We investigate the e ect of truncating the precision of hyperspectral image data for the purpose ... more We investigate the e ect of truncating the precision of hyperspectral image data for the purpose of more e ciently segmenting the image using a variant of k-means clustering. We describe the implementation of the algorithm on eld-programmable gate array (FPGA) hardware. Truncating the data to only a few bits per pixel in each spectral channel permits a more compact hardware design, enabling greater parallelism, and ultimately a more rapid execution. It also enables the storage of larger images in the onboard memory. In exchange for faster clustering, however, one trades o the quality of the produced segmentation. We nd, however, that the clustering algorithm can tolerate considerable data truncation with little degradation in cluster quality. This robustness to truncated data can be extended by computing the cluster centers to a few more bits of precision than the data. Since there are so many more pixels than centers, the more aggressive data truncation leads to signi cant gains in the number of pixels that can be stored in memory and processed in hardware concurrently.

format_quoteData truncation tests revealed minimal clustering result variation when retaining significant bits, validating preprocessing for efficiency.format_quote

Download

Algorithmic transformations in the implementation of K- means clustering on reconfigurable hardware

Proceedings of the 2001 ACM/SIGDA ninth international symposium on Field programmable gate arrays - FPGA '01, 2001

In mapping the k-means algorithm to FPGA hardware, we examined algorithm level transforms that dr... more In mapping the k-means algorithm to FPGA hardware, we examined algorithm level transforms that dramatically increased the achievable parallelism. We apply the k-means algorithm to multi-spectral and hyper-spectral images, which have tens to hundreds of channels per pixel of data. Kmeans is an iterative algorithm that assigns assigns to each pixel a label indicating which o f K clusters the pixel belongs to.

Download

Rothko: A three dimensional FPGA architecture, its fabrication, and design tools

Lecture Notes in Computer Science, 1997

We are designing and plan to fabricate a 3-dimensional field programmable gate array. The three d... more We are designing and plan to fabricate a 3-dimensional field programmable gate array. The three dimensional technology, developed at Northeastern University, is based on transferred circuits with interconnections between layers of active devices. Interconnections are in metal, and can be placed anywhere on the chip. Our FPGA architecture, called Rothko, extends the Routing and Logic Block (RLB) model developed for the Triptych architecture [1]. This model is similar to a sea-of-gates model where individual cells can be used for routing or logic. We extend this to three dimensions by adding connections to each RLB from above and below. This makes our architecture truly 3-D with each logic block having connections to logic blocks on other layers. In this paper we present the architecture of a two layer RLB, discuss the 3-D technology we use, and discuss CAD tools for mapping designs onto Rothko.

Download

From programs to transistors: Verifying hardware synthesis tools

Lecture Notes in Computer Science, 1990

We describe a project for synthesizing circuits from a high-level language description. The aims ... more We describe a project for synthesizing circuits from a high-level language description. The aims of this project are to guarantee the correctness of the resulting designs while allowing the designer flexibility in interacting with the system. In this paper we discuss two components of the project. The first starts with a state transition system and generates a specification of a datapath and an implementation of a controller as a microcode ROM. The second generates correct CMOS implementations of boolean expressions. This component produces highly optimized circuits which contain transmission gates as well as series and parallel networks of transistors. These two components are part of a larger goal: to go from programs to transistors with a flexible, yet guaranteed correct system.

Efficient FPGA implementation of qr decomposition using a systolic array architecture

Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays - FPGA '08, 2008

QR decomposition is used in many signal processing applications. We have implemented a systolic a... more QR decomposition is used in many signal processing applications. We have implemented a systolic array QR decomposition on a Xilinx Virtex5 FPGA using the Givens rotation algorithm. It uses a truly two dimensional systolic array architecture so latency scales well for large matrices. To accommodate the dynamic range of input data, floating-point arithmetic is chosen, using the Northeastern University Variable Precision Floating-Point (VFloat) library. We support any general floating-point format including IEEE single precision. Our design uses straightforward floating-point divide and square root implementations, compared to prior work which uses special operations or formats such as CORDIC or the logarithmic number system (LNS). This makes our design more standard and portable to different systems, thus easier to fit into a larger design. We support square, tall and short matrices. The input matrix size can be configured at compile-time to virtually any size. Therefore, it can be easily scaled to future larger FPGA devices, or over multiple FPGAs. The QR module is fully pipelined with a throughput of over 130 MHz for IEEE single precision floating-point format. 35 GFlops throughput peak performance is achieved for a 12 by 12 matrix with this implementation

Implementing a Highly Parameterized Digital PIV System on Reconfigurable Hardware

2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors, 2009

Parameterization of circuits is increasingly in demand. It opens the door both for investigating ... more Parameterization of circuits is increasingly in demand. It opens the door both for investigating the right parameters for different application domains and reuse of components with the specific parameters in building new custom hardware. This can significantly reduce the time to market. In this work, we investigate the parameterization of particle image velocimetry (PIV), a technique that is used in many engineering domains.

format_quoteImplemented PIV system shows an average of 50x speedup compared to a 3 GHz PC software implementation, attributed to parallelism and pipelining.format_quote

Download

Accelerating phase unwrapping and affine transformations for optical quadrature microscopy using CUDA

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units - GPGPU-2, 2009

Optical Quadrature Microscopy (OQM) is a process which uses phase data to capture information abo... more Optical Quadrature Microscopy (OQM) is a process which uses phase data to capture information about the sample being studied. OQM is part of an imaging framework developed by the Optical Science Laboratory at Northeastern University. In one particular application of interest, the framework is used to extract phase information from the image of an embryo to determine embryo viability.

Download

A Methodology for Reusable Hardware Proofs

Higher Order Logic Theorem Proving and its Applications, 1993

Verifying a logic synthesis tool in Nuprl: A case study in software verification

Lecture Notes in Computer Science, 1993

We have proved a logic synthesis tool with the Nuprl proof development system. The logic synthesi... more We have proved a logic synthesis tool with the Nuprl proof development system. The logic synthesis tool, Pbs, implements the weak division algorithm, and is part of the Bedroc hardware synthesis system. Our goal was to develop a proven and usable implementation of a hardware synthesis tool. Pbs consists of approximately 1000 lines of code implemented in a functional subset of Standard ML. The program was verified by embedding this subset of SML in Nuprl and then verifying the correctness of the implementation of Pbs in Nuprl. In the process of doing the proof we learned many lessons which can be applied to efforts in verifying functional software. In particular, we were able to safely perform several optimizations to the program. In addition, we have invested effort into verifying software which will be used many times, rather than verifying the output of that software each time the program is used. The work required to verify hardware design tools and other similar software is worthwhile because the results of the proofs will be used many times.

format_quotePbs's proof ensures output circuits are functionally equivalent to inputs and meet minimality, enhancing fault testability.format_quote

Download

Reasoning about pipelines with structural hazards

Lecture Notes in Computer Science, 1995

We have developed a formal definition of correctness for pipelines that ensures that transactions... more We have developed a formal definition of correctness for pipelines that ensures that transactions terminate and satisfy a functional specification. This definition separates the correctness criteria associated with the pipelining aspects of a design from the functional relationship between input and output transactions. Using this definition, we developed and formally verified a technique that divides the verification of a pipeline

Automatic Sliding Window Operation Optimization for FPGA-Based

2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2006

FPGA-based computing boards are frequently used as hardware accelerators for image processing alg... more FPGA-based computing boards are frequently used as hardware accelerators for image processing algorithms based on sliding window operations (SWOs). SWOs are both computationally intensive and data intensive and benefit from hardware acceleration with FPGAs, especially for delay sensitive applications. The current design process requires that, for each specific application using SWOs with different size of window, image, etc.; a detail

<title>Design issues for hardware implementation of an algorithm for segmenting hyperspectral imagery</title>

Imaging Spectrometry VI, 2000

ABSTRACT Modern hyperspectral imagers can produce data cubes with hundreds of spectral channels a... more ABSTRACT Modern hyperspectral imagers can produce data cubes with hundreds of spectral channels and millions of pixels. One way to cope with this massive volume is to organize the data so that pixels with similar spectral content are clustered together in the same category. This provides both a compression of the data and a segmentation of the image that can be useful for other image processing tasks downstream. The classic approach for segmentation of multidimensional data is the k-means algorithm; this is an iterative method that produces successively better segmentations. It is a simple algorithm, but the computational expense can be considerable, particularly for clustering large hyperspectral images into many categories. The ASAPP (Accelerating Segmentation And Pixel Purity) project aims to relieve this processing bottleneck by putting the k-means algorithm into eld-programmable gate array (FPGA) hardware. The standard software implementation of k-means uses oating-point arithmetic and...

CUDA and OpenCL implementations of 3D CT reconstruction for biomedical imaging

by S. Mukherjee and Miriam Leeser

2012 IEEE Conference on High Performance Extreme Computing, 2012

format_quoteProposed implementation of Feldkamp CT aims to enhance processing speed using GPU compatibility and parallelization.format_quote

Download

A Library of Parameterized Floating-Point Modules and Their Use

Lecture Notes in Computer Science, 2002

We present a parameterized floating-point library for use with reconfigurable hardware. Our forma... more We present a parameterized floating-point library for use with reconfigurable hardware. Our format is both general and flexible. All IEEE formats are a subset of our format, as are all previously published floating-point formats for reconfigurable hardware. We have developed a library of fully parameterized hardware modules for format control, arithmetic operations and conversion to and from any fixed-point format. The format converters allow for hybrid implementations that combine both fixed and floating-point calculations. This permits the designer to choose between the increased range of floating-point and the increased precision of fixed-point within the same application. We illustrate the use of this library with a hybrid implementation of the K-means clustering algorithm applied to multispectral satellite images.

Download

Enabling a RealTime Solution for Neuron Detection with Reconfigurable Hardware (abstract only)

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays - FPGA '05, 2005

FPGAs provide a speed advantage in processing for embedded systems, especially when processing is... more FPGAs provide a speed advantage in processing for embedded systems, especially when processing is moved close to the sensors. Perhaps the ultimate embedded system is a neural prosthetic, where probes are inserted into the brain and recorded electrical activity is analyzed to determine which neurons have fired. In turn, this information can be used to manipulate an external device such as a robot arm or a computer mouse.

format_quoteTransitioning the EM algorithm from software to hardware enables real-time processing, significantly enhancing user command training speed.format_quote

Download

Real-Time Particle Image Velocimetry for Feedback Loops Using FPGA Implementation

Journal of Aerospace Computing, Information, and Communication, 2006

Digital Particle Image Velocimetry (PIV) is well established as a fluid dynamics measurement tool... more Digital Particle Image Velocimetry (PIV) is well established as a fluid dynamics measurement tool, being capable of non-intrusively and concurrently measuring a distributed velocity filed. Yet the intensive computational requirements of PIV limit its usage almost exclusively to off-line processing, analysis and modelling. This paper proposes hardware implementation of the cross-correlation algorithm as a means to make real-time PIV available for closed-loop control. This paper introduces a real-time PIV system which exploits the low-level parallelism of the cross-correlation computation by implementing it with reconfigurable hardware. The system processes 15 complete image pairs per second, which is more than 70 times speedup over a sequential software implementation. Moreover, our hardware structure can be easily expanded to a more parallel design for faster processing given sufficient hardware resources. This design can be reused with only minor modifications for different image sizes and interrogation areas.

Download

Accelerating protein coordinate conversion using GPUs

by Miriam Leeser and Mahsa Bayati

2014 IEEE High Performance Extreme Computing Conference (HPEC), 2014

ABSTRACT For modeling proteins in conformational states, two methods of representation are used: ... more ABSTRACT For modeling proteins in conformational states, two methods of representation are used: internal coordinates and Cartesian coordinates. Each of these representations contain a large amount of structural and simulation information. Different processing steps require one or the other representation. Our goal is to rapidly translate between these coordinate spaces so that a scientist can choose whichever method he or she would like independent of the coordinate representation required. An algorithm to convert Cartesian to internal coordinates is implemented by taking a protein structure file and the trajectories of protein&#39;s atoms within a time frame. The implementation then computes bond distances, bond angles and torsion angles of the atoms. This is implemented on two types of hardware: CPU and a heterogeneous system combining CPU and GPU. The CPU sequential codes in MATLAB and C are compared with MATLAB Parallel Computing Toolbox, OpenMP, and GPU versions in CUDA-C and CUDA-MATLAB. The performance is evaluated on two different protein structure files and their trajectories. Our results show that this computation is well suited to the parallelism offered in modern Graphics Processing Units. We see many orders of magnitude improvement in speed over the original MATLAB code and have brought the computation time from over an hour down to tens of milliseconds.

Toward a super duper hardware tactic

Lecture Notes in Computer Science, 1994

We present techniques for automating many of the tedious aspects of hardware veri cation in a hig... more We present techniques for automating many of the tedious aspects of hardware veri cation in a higher order logic theorem proving environment. We employ two complementary approaches. The rst involves intelligent tactics which incorporate many of the smaller steps currently applied by the user. The second uses hardware combinators to partially automate inductive proofs for iterated hardware structures. We envision a system that captures most of this reasoning in one tactic, SuperDuperHWTac. Ideally, users would use this tactic on a goal for proving that a hardware component meets its speci cation, and get back a proof documented at a level they would have written by hand. This paper presents preliminary work toward SuperDuperHWTac in both the HOL and Nuprl proof development systems.

format_quoteInitial tactics developed show progress toward achieving the ideal tactic, SuperDuperHWTac.format_quote

Download

Heterogeneous tasks and conduits framework for rapid application portability and deployment

2012 Innovative Parallel Computing (InPar), 2012

ABSTRACT Emerging heterogeneous and homogeneous processing architectures demonstrate significant ... more ABSTRACT Emerging heterogeneous and homogeneous processing architectures demonstrate significant increases in throughput for scientific applications over traditional single core processors. Each of these processing architectures vary widely in their processing capabilities, memory hierarchies, and programming models. Determining the system architecture best suited to an application or deploying an application that is portable across a number of different platforms is increasingly complex and error prone within this rapidly increasing and evolving design space. Quickly and easily designing portable, high-performance applications that can function and maintain their correctness properly across these widely varied systems has become paramount. To deal with these programming challenges, there is a great need for new models and tools to be developed. One example is MIT Lincoln Laboratory&#39;s Parallel Vector Tile Optimizing Library (PVTOL) which simplifies the task of developing software in C++ for these complex systems. This work extends the Tasks and Conduits framework in PVTOL to support GPU architectures and other heterogeneous platforms supported by the NVIDIA CUDA and OpenCL programming models. This allows the rapid portability of applications to a very wide range of architectures and clusters. Using this framework, porting applications from a single CPU core to a GPU requires a change of only 5 source lines of code (SLOC) in addition to the CUDA or OpenCL kernel. Using GPU-PVTOL we have achieved 22x speedup in an application of Monte Carlo simulations of photon propagation through a biological medium, and a 60x speedup of a 3D cone beam computed tomography (CT) image reconstruction algorithm.

Miriam Leeser

Uploads

Papers by Miriam Leeser

Log In