Many sorting algorithms have been studied in the past, but there are only a few algorithms that can effectively exploit both SIMD instructions and thread-level parallelism of today’s multi-core processors. In this paper, we propose a new parallel sorting algorithm, called Aligned-Access sort (AA-sort), for shared-memory multi processors. The AA-sort algorithm can take advantage of the SIMD instructions. It is scalable with increasing numbers of processor cores. The key to high performance is eliminating unaligned memory accesses that would prevent the use of SIMD instructions. We implemented and evaluated the AA-sort on PowerPC 970MP and Cell Broadband Engine processors. In summary, a sequential version of the AA-sort using SIMD instructions outperformed IBM’s optimized sequential sorting library by 1.8 times and GPUTeraSort using SIMD instructions by 3.3 times on PowerPC 970MP when sorting 32 M of 32-bit random integers. Furthermore, a parallel version of the AA-sort demonstrated better scalability with increasing numbers of cores than a parallel version of GPUTeraSort. As a result the AA-sort was 4.2 times faster on 4 cores of PowerPC 970MP and 4.9 times faster on 16 cores of Cell BE processors compared to GPUTeraSort.