and we are down to 1.01 seconds in the 95th percentile!
sounds good. hopefully this not only on this synthetic test so, but as well on the real application.
The messages "all connection are used" is a condition, which can be used as well with all other servers, but naviserver tells the user what is going on.
NaviServer crashes are seldom (on our production site very few times a year, i see the last one from jan 18, most times in explainable cases). In case, you see many of these (3 cores per week qualifies as many), you should compile NaviServer and Tcl with -g an keep the cores for determining the cause.
For the snipped you are showing, It would be easy for me to say "it the problem of the oracle driver", ... but probably it is not. Oracle just tell the system that there was an abort() happening. It would be certainly interesting, where this abort() is coming from (via core).
bug fix and push it out during the day without restarting naviserver which can take about 1 minute to restart
When using OpenACS, you can use the reload feature from the package manager to reload packages without a restart. Why you are not using this?
I am not saying that horizontal scaling is a bad idea, but - depending on your application - that might be a longer way, especially for cache coherency. One has in NaviServer per-request caches, per-thread caches, caches via ns_cache* and nsv, where ns_cache supports transaction semantics (if there is a complex transaction with API calls performed, where the API calls cache some content, but the transaction fails, the cached content needs to follow as well the first three characters of the ACID properties). Getting these correct for all packages is a longer way to go. The easiest approach would be to deactivate all caching in OpenACS, but this has some performance impact as well.