<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<item>
  <id>05793361</id>
  <dt>j</dt>
  <an>05793361</an>
  <augroup>
    <au>Soliman, Mostafa I.</au>
  </augroup>
  <ti>Exploiting ILP, TLP, and DLP to improve multi-core performance of one-sided Jacobi SVD.</ti>
  <so>Parallel Process. Lett. 19, No. 2, 355-375 (2009).</so>
  <py>2009</py>
  <pu>World Scientific, Singapore</pu>
  <lagroup>
    <la>EN</la>
  </lagroup>
  <ccgroup>
  </ccgroup>
  <utgroup>
    <ut>multi-core computing</ut>
    <ut>multi-threading techniques</ut>
    <ut>ILP</ut>
    <ut>TLP</ut>
    <ut>DLP</ut>
    <ut>SVD</ut>
    <ut>one-sided Jacobi</ut>
    <ut>block algorithms</ut>
    <ut>high-performance computing</ut>
    <ut>performance evaluation</ut>
  </utgroup>
  <cigroup>
  </cigroup>
  <ligroup>
    <li>doi:10.1142/S0129626409000262</li>
  </ligroup>
  <abgroup>
    <ab>Summary: This paper shows how the performance of singular value decomposition (SVD) is enhanced through the exploitation of ILP, TLP, and DLP on Intel multi-core processors using superscalar execution, multi-threading computation, and streaming SIMD extensions, respectively. To facilitate the exploitation of TLP on multiple execution cores, the well-known cyclic one-sided Jacobi algorithm is restructured to work in parallel. On two dual-core Intel Xeon processors with hyper-threading technology running at 3.0 GHz, our results show that the multi-threaded implementation of one-sided Jacobi SVD gives about four times faster than the single-threaded superscalar implementation. Furthermore, the multi-threaded SIMD implementation speeds up the execution of single-threaded one-sided Jacobi by a factor of 10, which is close to the ideal speedup. On a reasonable large matrix size fitted in the L2 cache, our results show a performance of 11 GFLOPS (double-precision) is achieved on the target system through the exploitation of ILP, TLP, and DLP as well as memory hierarchy.</ab>
    <rv></rv>
  </abgroup>
</item>