Modular exponentiation is fundamental in computer arithmetic and is widely applied in cryptography such as ElGamal cryptography, Diffie-Hellman key exchange protocol, and RSA cryptography. Implementation of modular exponentiation in residue number system leads to high parallelism in computation, and has been applied in many hardware architectures. While most RNS based architectures utilizes RNS Montgomery algorithm with two residue number systems, the recent modular multiplication algorithm with sum-residues performs modular reduction in only one residue number system with about the same parallelism. In this work, it is shown that high-performance modular exponentiation and RSA cryptography can be implemented in RNS. Both the algorithm and architecture are improved to achieve high performance with extra area overheads, where a 1024-bit modular exponentiation can be completed in 0.567 ms in Xilinx XC6VLX195t-3 platform, costing 26,489 slices, 87,357 LUTs, 363 dedicated multipilers of $18\times 18$ bits, and 65 Block RAMs.