• Volume/Page
  • Keyword
  • DOI
  • Citation
  • Advanced
   
 
 
 

Med. Phys. 38, 6775 (2011); http://dx.doi.org/10.1118/1.3661998 (12 pages)

Fully 3D list-mode time-of-flight PET image reconstruction on GPUs using CUDA

Jing-yu Cui

Department of Electrical Engineering, Stanford University, Stanford, California 94305

Guillem Pratx

Department of Radiation Oncology, Stanford University, Stanford, California 94305

Sven Prevrhal

Philips Healthcare, San Jose, California 95134

Craig S. Levin

Departments of Radiology, Electrical Engineering, Physics, and Molecular Imaging Program at Stanford (MIPS), Stanford University, Stanford, California 94305

View MapView Map

(Received 31 May 2011; accepted 22 October 2011; revised 28 September 2011; published online 1 December 2011)

Full Text: Read Online (HTML) | Download PDF FREE | View Cart
Purpose: List-mode processing is an efficient way of dealing with the sparse nature of positron emission tomography (PET) data sets and is the processing method of choice for time-of-flight (ToF) PET image reconstruction. However, the massive amount of computation involved in forward projection and backprojection limits the application of list-mode reconstruction in practice, and makes it challenging to incorporate accurate system modeling.
Methods: The authors present a novel formulation for computing line projection operations on graphics processing units (GPUs) using the compute unified device architecture (CUDA) framework, and apply the formulation to list-mode ordered-subsets expectation maximization (OSEM) image reconstruction. Our method overcomes well-known GPU challenges such as divergence of compute threads, limited bandwidth of global memory, and limited size of shared memory, while exploiting GPU capabilities such as fast access to shared memory and efficient linear interpolation of texture memory. Execution time comparison and image quality analysis of the GPU-CUDA method and the central processing unit (CPU) method are performed on several data sets acquired on a preclinical scanner and a clinical ToF scanner.
Results: When applied to line projection operations for non-ToF list-mode PET, this new GPU-CUDA method is >200 times faster than a single-threaded reference CPU implementation. For ToF reconstruction, we exploit a ToF-specific optimization to improve the efficiency of our parallel processing method, resulting in GPU reconstruction >300 times faster than the CPU counterpart. For a typical whole-body scan with 75 × 75 × 26 image matrix, 40.7 million LORs, 33 subsets, and 3 iterations, the overall processing time is 7.7 s for GPU and 42 min for a single-threaded CPU. Image quality and accuracy are preserved for multiple imaging configurations and reconstruction parameters, with normalized root mean squared (RMS) deviation less than 1% between CPU and GPU-generated images for all cases.
Conclusions: A list-mode ToF OSEM library was developed on the GPU-CUDA platform. Our studies show that the GPU reformulation is considerably faster than a single-threaded reference CPU method especially for ToF processing, while producing virtually identical images. This new method can be easily adapted to enable more advanced algorithms for high resolution PET reconstruction based on additional information such as depth of interaction (DoI), photon energy, and point spread functions (PSFs).

© 2011 American Association of Physicists in Medicine

ACKNOWLEDGMENTS

The authors would like to thank Peter Olcott and Garry Chinn at Stanford University for helpful discussions and comments, and NVIDIA for providing a GTX 285 GPU. This work was sponsored by a research grant from Philips Healthcare.

Article Outline

  1. INTRODUCTION
  2. MATERIALS AND METHODS
    1. Image reconstruction
    2. Reformulation and implementation for the CUDA platform
      1. Global memory layout
      2. Caching with shared memory
      3. Line partitioning
      4. Projection kernel calculation
      5. Forward projection
      6. Backprojection
      7. Time-of-flight projections
      8. Multiplicative update
      9. Fast math
      10. Loop unrolling
      11. Bank conflicts
    3. Evaluation
      1. CPU and GPU hardware
      2. Processing time analysis
      3. Imaging experiments and reconstruction parameters
  3. RESULTS
    1. Processing time
    2. Image reconstruction accuracy
  4. DISCUSSION
  5. CONCLUSION

  1. S. Surti, S. Karp, L. M. Popescu, E. Daube-Witherspoon, and M. Werner, “Investigation of time-of-flight benefit for fully 3D PET,” IEEE Trans. Med. Imaging 25, 529–538 (2006). [MEDLINE]
  2. J. S. Karp, S. Surti, Margaret E. Daube-Witherspoon, and G. Muehllehner, “Benefit of time-of-flight in PET: Experimental and clinical results,” J. Nucl. Med. 49, 462–470 (2008). [MEDLINE]
  3. S. Matej, S. Surti, S. Jayanthi, M. E. Daube-Witherspoon, R. M. Lewitt, and J. S. Karp, “Efficient 3D TOF PET reconstruction using view-grouped histo-images: DIRECT-direct image reconstruction for TOF,” IEEE Trans. Med. Imaging 28, 739–751 (2009).
  4. M. Watanabe, H. Uchida, H. Okada, K. Shimizu, N. Satoh, E. Yoshikawa, T. Ohmura, T. Yamashita, and E. Tanaka, “A high resolution PET for animal studies,” IEEE Trans. Med. Imaging 11, 577–580 (1992).
  5. S. R. Cherry, Y. Shao, R. W. Silverman, K. Meadors, S. Siegel, A. Chatziioannou, J. W. Young, W. Jones, J. C. Moyers, D. Newport, A. Boutefnouchet, T. H. Farquhar, M. Andreaco, M. J. Paulus, D. M. Binkley, R. Nutt, and M. E. Phelps, “MicroPET: A high resolution PET scanner for imaging small animals,” IEEE Trans. Nucl. Sci. 44, 1161–1166 (1997). [Inspec] [ISI]
  6. J. -Y. Peng, J. A. D. Aston, R. N. Gunn, C.-Y. Liou, and J. Ashburner, “Dynamic positron emission tomography data-driven analysis using sparse bayesian learning,” IEEE Trans. Med. Imaging 27, 1356–1369 (2008).
  7. L. A. Shepp and Y. Vardi, “Maximum likelihood reconstruction for emission tomography,” IEEE Trans. Med. Imaging 1, 113–122 (1982). [Inspec] [MEDLINE]
  8. H. M. Hudson and R. S. Larkin, “Accelerated image reconstruction using ordered subsets of projection data,” IEEE Trans. Med. Imaging 13, 601–609 (1994). [Inspec] [ISI] [MEDLINE]
  9. A. J. Reader, K. Erlandsson, M. A. Flower, and R. J. Ott, “Fast accurate iterative reconstruction for low-statistics positron volume imaging,” Phys. Med. Biol. 43, 835–846 (1998). [MEDLINE]
  10. L. Parra and H. H. Barrett, “List-mode likelihood: EM algorithm and image quality estimation demonstrated on 2-D PET,” IEEE Trans. Med. Imaging 17, 228–235 (1998). [Inspec] [MEDLINE]
  11. L. M. Popescu, S. Matej, and R. M. Lewitt, “Iterative image reconstruction using geometrically ordered subsets with list-mode data,” Nuclear Science Symposium Conference Record, Vol. 6 (IEEE, Rome, Italy, 2004), pp. 3536–3540.
  12. A. Rahmim, M. Lenox, A. J. Reader, C. Michel, Z. Burbar, T. J. Ruth, and V. Sossi, “Statistical list-mode image reconstruction for the high resolution research tomograph,” Phys. Med. Biol. 49, 4239 (2004). [ISI] [MEDLINE]
  13. G. Pratx, P. D. Olcott, G. Chinn, and C. S. Levin, “Fast, accurate and shift-varying line projections for iterative reconstruction using the GPU,” IEEE Trans. Med. Imaging 28, 435–445 (2009). [Inspec] [MEDLINE]
  14. I. K. Hong, S. T. Chung, H. K. Kim, Y. B. Kim, Y. D. Son, and Z. H. Cho, “Ultra fast symmetry and SIMD-based projection-backprojection (SSP) algorithm for 3D PET image reconstruction,” IEEE Trans. Med. Imaging 26, 789–803 (2007).
  15. F. Xu and K. Mueller, “Accelerating popular tomographic reconstruction algorithms on commodity PC graphics hardware,” IEEE Trans. Nucl. Sci. 52, 654–663 (2005).
  16. F. Xu and K. Mueller, “Real-time 3D computed tomographic reconstruction using commodity graphics hardware,” Phys. Med. Biol. 52, 3405 (2007).
  17. H. G. Hofmann, B. Keck, C. Rohkohl, and J. Hornegger, “Comparing performance of many-core CPUs and GPUs for static and motion compensated reconstruction of C-arm CT data,” Med. Phys. 38, 468–473 (2011)MPHYA6000038000001000468000001.
  18. F. Xu, A. Khamene, and O. Fluck, “High performance tomosynthesis enabled via a GPU-based iterative reconstruction framework,” Proc. SPIE 7258, pp. 72585A (2009)PSISDG00725800000172585F000001.
  19. R. Jacques, J. Wong, R. Taylor, and T. McNutt, “Real-time dose computation: GPU-accelerated source modeling and superposition/convolution,” Med. Phys. 38, 294–305 (2011)MPHYA6000038000001000294000001.
  20. S. Hissoiny, B. Ozell, H. Bouchard, and P. Despres, “GPUMCD: A new GPU-oriented Monte Carlo dose calculation platform,” Med. Phys. 38, 754–764 (2011)MPHYA6000038000002000754000001.
  21. G. Pratx and L. Xing, “GPU computing in medical physics: A review,” Med. Phys. 38, 2685–2697 (2011)MPHYA6000038000005002685000001.
  22. Z. Hu, W. Wang, E. E. Gualtieri, M. J. Parma, E. S. Walsh, D. Sebok, Y. L. Hsieh, C. H. Tung, J. J. Griesmer, J. A. Kolthammer, L. M. Popescu, M. Werner, J. S. Karp, A. Bucur, J. van Leeuwen, and D. Gagnon, “Dynamic load balancing on distributed listmode time-of-flight image reconstruction,” IEEE Nuclear Science Symposium Conference Record (IEEE, San Diego, California, 2006), pp. 3392–3396.
  23. NVIDIA, NVIDIA CUDA Programming Guide 3.0 (2010).
  24. E. Veklerov, J. Llacer, and E. J. Hoffman, “MLE reconstruction of a brain phantom using a Monte Carlo transition matrix and a statistical stopping rule,” IEEE Trans. Nucl. Sci. 35, 603–607 (1988). [Inspec]
  25. R. H. Huesman, G. J. Klein, W. W. Moses, J. Qi, B. W. Reutter, and P. R. G. Virador, “List-mode maximum-likelihood reconstruction applied to positron emission mammography (PEM) with irregular sampling,” IEEE Trans. Med. Imaging 19, 532–537 (2000). [Inspec] [MEDLINE]
  26. C. A. Johnson, Y. Yan, R. E. Carson, R. L. Martino, and M. E. Daube-Witherspoon, “A system for the 3D reconstruction of retracted-septa PET data using the EM algorithm,” Nuclear Science Symposium and Medical Imaging Conference Record, Vol. 3 (IEEE, Norfolk, Virginia, 1994), pp. 1325–1329.
  27. J. Qi, R. M. Leahy, S. R Cherry, A. Chatziioannou, and T. H. Farquhar, “High-resolution 3D bayesian image reconstruction using the microPET small-animal scanner,” Phys. Med. Biol. 43, 1001 (1998). [ISI] [MEDLINE]
  28. L. Zhang, S. Staelens, R. V. Holen, J. D. Beenhouwer, J. Verhaeghe, I. Kawrakow, and S. Vandenberghe, “Fast and memory-efficient Monte Carlo-based image reconstruction for whole-body PET,” Med. Phys. 37, 3667–3676 (2010)MPHYA6000037000007003667000001. [MEDLINE]
  29. R. L. Siddon, “Fast calculation of the exact radiological path for a three-dimensional CT array,” Med. Phys. 12, 252–255 (1985).
  30. P. M. Joseph, “An improved algorithm for reprojecting rays through pixel images,” IEEE Trans. Med. Imaging 1, 192–196 (1982). [Inspec] [MEDLINE]
  31. G. Pratx and C. Levin, “Online detector response calculations for high-resolution PET image reconstruction,” Phys. Med. Biol. 56, 4023 (2011).
  32. S. Matej and R. M. Lewitt, “Practical considerations for 3D image reconstruction using spherically symmetric volume elements,” IEEE Trans. Med. Imaging 15, 68–78 (1996). [Inspec] [ISI] [MEDLINE]
  33. P. Aguiar, M. Rafecas, J. E. Ortuño, G. Kontaxakis, A. Santos, J. Pavía, and D. Ros, “Geometrical and Monte Carlo projectors in 3D PET reconstruction,” Med. Phys. 37, 5691–5702 (2010)MPHYA6000037000011005691000001.
  34. M. Herlihy and N. Shavit, The Art of Multiprocessor Programming (Morgan Kaufmann, Waltham, Massachusetts, 2008).
  35. Y. Wang, J. Seidel, B. M. W. Tsui, J. J. Vaquero, and M. G. Pomper, “Performance evaluation of the GE Healthcare eXplore VISTA dual-ring small-animal PET scanner,” J. Nucl. Med. 47, 1891–1900 (2006). [Inspec] [MEDLINE]
  36. S. Surti, A. Kuhn, M. E. Werner, A. E. Perkins, J. Kolthammer, and J. S. Karp, “Performance of Philips Gemini TF PET/CT scanner with special consideration for its time-of-flight imaging capabilities,” J. Nucl. Med. 48, 471–480 (2007). [MEDLINE]
  37. A. M. Alessio, C. W. Stearns, S. Tong, S. G. Ross, S. Kohlmyer, A. Ganin, and P. E. Kinahan, “Application and evaluation of a measured spatially variant system model for PET image reconstruction,” IEEE Trans. Med. Imaging 29, 938–949 (2010). [Inspec] [MEDLINE]

Figures (9) Tables (5)

Figures (click on thumbnails to view enlargements)

FIG.1
Illustration of the caching mechanism of the proposed GPU-CUDA method.

FIG.1 Download High Resolution Image (.zip file) | Export Figure to PowerPoint

FIG.2
By slicing the image volume orthogonal to the predominant TOR direction, the area of the intersection between the TOR and the slice is bounded. Here, a y direction TOR and x-z slice are shown as an example.

FIG.2 Download High Resolution Image (.zip file) | Export Figure to PowerPoint

FIG.3
Number of collisions per voxel for backprojecting a random set of 1 million lines. White means no collision, gray means one collision, and black means two collisions.

FIG.3 Download High Resolution Image (.zip file) | Export Figure to PowerPoint

FIG.4
With the ToF information, fewer LOR-slice pairs need to be processed. Four LORs are shown intersecting the current slice. The dots on each line denote the ToF center, and the bell shapes denote the ToF kernels. Only LORs 2 and 3 contribute to the current slice significantly.

FIG.4 Download High Resolution Image (.zip file) | Export Figure to PowerPoint

FIG.5
Cumulative contributions of different optimization strategies to the overall speedup of the GPU-CUDA method compared to the CPU-based code. Simple GPU implementation refers to the method that directly maps the computation to the GPU hardware without using the subsequent optimizations listed in this figure.

FIG.5 Download High Resolution Image (.zip file) | Export Figure to PowerPoint

FIG.6
Number of randomly-generated LORs that can be processed per second, as a function of the number of thread blocks for the GPU-CUDA method. Due to hardware limitations, block size of GTX 285 cannot be set to 1024.

FIG.6 Download High Resolution Image (.zip file) | Export Figure to PowerPoint

FIG.7
Hot rod phantom (a) acquired on a preclinical PET scanner and reconstructed with 2 iterations and 40 subsets of list-mode OSEM, using (c) CPU method and (d) GPU-CUDA method. (b) The normalization map. Profiles of (c) and (d) through the centers of the hot rods, depicted in (a), are shown in (e). Contrast between the two ROIs in (a) and noise as functions of the number of iterations are shown in (f). The method for computing contrast and noise are explained in Sec. 2C3. The processing time for the GPU and the CPU is 7.0 s and 23 min, respectively.

FIG.7 Download High Resolution Image (.zip file) | Export Figure to PowerPoint

FIG.8
Mouse PET scan (maximum intensity projection), reconstructed with three iterations and five subsets of list-mode OSEM using the CPU method (a) and the GPU-CUDA method (b). The processing time for the GPU and the CPU is 8.0 s and 28 min, respectively.

FIG.8 Download High Resolution Image (.zip file) | Export Figure to PowerPoint

FIG.9
Transaxial image taken through slice 30 of the liver of the reconstructed patient data from a Philips Gemini TF PET/CT scanner. A CT image at the same slice location is shown in (e) with a soft tissue window and inverse gray scale to provide an anatomical frame of reference. The (cropped) normalization image is shown in (f). The lesion is visualized with higher contrast for the ToF data. For non-ToF, the lesion contrast for the CPU and GPU methods are 2.6 and 2.7, respectively. For ToF, the values are 3.0 and 3.1, respectively. The processing time for the GPU and the CPU is 7.7 s and 42 min, respectively.

FIG.9 Download High Resolution Image (.zip file) | Export Figure to PowerPoint

Tables

Table I. Execution time (ms) for processing varying numbers of randomly-generated LORs for the GPU-CUDA method.

View Table
Table II. Effect of using fast math for one iteration of 1 million random ToF LORs in a 75 × 75 × 26 image.

View Table
Table III. Execution time for processing 1 million random events in an image matrix of L × L × L with TOR width Tw increasing simultaneously with L.

View Table
Table IV. Execution time for processing 1 million random events in an image matrix of L × L × L with fixed TOR width 3 × 3.

View Table
Table V. Execution time for processing 1 million random LORs in an image matrix of 75 × 75 × 26 for different TOR width Tw. Tw2 is the maximum number of voxels in a TOR-slice intersection.

View Table


Close

close