Subthreshold circuit design, while energy efficient, has the drawback of performance degradation. To retain the excellent energy efficiency while reducing performance loss, we propose to investigate near subthreshold techniques on chip multi-processors (CMP). We show that logic and memory cells have different optimal supply and threshold voltages, and, therefore we propose to allow the cores and memory to operate in different voltage regions. With the memory operating at a different voltage, we then explore the design space in which several slower cores clustered together share a faster L1 cache. We show that an architecture such as this is optimal for energy efficiency. In particular, SPLASH2 benchmarks show a 53% energy reduction over the conventional CMP approach (70% energy reduction over a single core machine). In addition we explore the design trade-offs that occur if we have a separate instruction and data cache. We show that some applications prefer the data cache to be clustered while the instruction cache is kept private to the core allowing further energy savings of a 77% reduction over a single core machine.