Forum OpenACS Q&A: Re: OpenACS clustering setup and how it relates to xotcl-core.

and we are down to 1.01 seconds in the 95th percentile!
sounds good. hopefully this not only on this synthetic test so, but as well on the real application.

The messages "all connection are used" is a condition, which can be used as well with all other servers, but naviserver tells the user what is going on.

NaviServer crashes are seldom (on our production site very few times a year, i see the last one from jan 18, most times in explainable cases). In case, you see many of these (3 cores per week qualifies as many), you should compile NaviServer and Tcl with -g an keep the cores for determining the cause.

For the snipped you are showing, It would be easy for me to say "it the problem of the oracle driver", ... but probably it is not. Oracle just tell the system that there was an abort() happening. It would be certainly interesting, where this abort() is coming from (via core).

bug fix and push it out during the day without restarting naviserver which can take about 1 minute to restart

When using OpenACS, you can use the reload feature from the package manager to reload packages without a restart. Why you are not using this?

I am not saying that horizontal scaling is a bad idea, but - depending on your application - that might be a longer way, especially for cache coherency. One has in NaviServer per-request caches, per-thread caches, caches via ns_cache* and nsv, where ns_cache supports transaction semantics (if there is a complex transaction with API calls performed, where the API calls cache some content, but the transaction fails, the cached content needs to follow as well the first three characters of the ACID properties). Getting these correct for all packages is a longer way to go. The easiest approach would be to deactivate all caching in OpenACS, but this has some performance impact as well.

Thanks Gustaf,

I will look into compiling naviserver with the -g option to see if we can generate a core file that can generate a backtrace with gdb.

We had started a standard some time ago to always restart the server after an upgrade because of some oddities some developers were seeing. We believe it was when a developer would instantiate a proc in the shell (for debug purposes) that the issue would arise. After talking to Tony about it we think that you are right and we should just use the "Reload" option during day releases

Your point is well taken about caching. It would be interesting to disable the caching for testing purposes. Is there a parameter that does this or a compile option? Or would we need to disable it for each type of cache by modifying the code?

Thanks for your help, Marty
Just a question about compiling with debug info:

I can run the install-ns.sh script to build naviserver but I do not see that it has a -g option? I can see that "make" is run inside the script with --enable-symbols though. Does this mean it has the debug symbols by default?


if [ $with_mongo = "1" ] ; then
    echo "------------------------ WITH MONGO"

    ./configure --enable-threads --enable-symbols \
                --prefix=${ns_install_dir} --exec-prefix=${ns_install_dir} --with-tcl=${ns_install_dir}/lib \
                --with-nsf=../../ \
                --with-mongoc=/usr/local/include/libmongoc-1.0/,/usr/local/lib/ \
                --with-bson=/usr/local/include/libbson-1.0,/usr/local/lib/
else
    ./configure --enable-threads --enable-symbols \
                --prefix=${ns_install_dir} --exec-prefix=${ns_install_dir} --with-tcl=${ns_install_dir}/lib
fi

${make}
${make} install

Marty
When I look at the gcc compile lines it does look like there is a -g in there. So I assume it does put them in by default.
gcc -DPACKAGE_NAME=\"tdom\" -DPACKAGE_TARNAME=\"tdom\" -DPACKAGE_VERSION=\"0.9.1\" -DPACKAGE_STRING=\"tdom\ 0.9.1\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DHAVE_GETRANDOM=1 -DXML_DEV_URANDOM=1 -DXML_POOR_ENTROPY=1 -DBUILD_tdom=/\*\*/ -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_MEMMOVE=1 -DHAVE_BCOPY=1 -DXML_POOR_ENTROPY=1 -DUSE_THREAD_ALLOC=1 -D_REENTRANT=1 -D_THREAD_SAFE=1 -DTCL_THREADS=1 -DUSE_TCL_STUBS=1 -DUSE_TCLOO_STUBS=1 -DMODULE_SCOPE=extern\ __attribute__\(\(__visibility__\(\"hidden\"\)\)\) -DHAVE_HIDDEN=1 -DHAVE_CAST_TO_UNION=1 -D_LARGEFILE64_SOURCE=1 -DTCL_WIDE_INT_IS_LONG=1 -DUSE_TCL_STUBS=1 -DXML_DTD=1 -DXML_NS=1 -DTDOM_NO_UNKNOWN_CMD=1 -DUSE_NORMAL_ALLOCATOR=1  -I../expat -I../generic -I"/usr/local/ns/include" -I.    -g -O2 -pipe -O2 -fomit-frame-pointer -DNDEBUG -Wall -fPIC  -c `echo ../generic/domxpath.c` -o domxpath.o
Hi Gustaf,

We got a signal 11 core today on our live system. I was able to connect up to it with gdb and get a backtrace. Here is the backtrace.

Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/ns/bin/nsd...
...
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type RET for more, q to quit, c to continue without paging--
Core was generated by `/usr/local/ns/bin/nsd -u root -g web -i -t /web/etc/config.tcl'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI_abort () at abort.c:107
107     abort.c: No such file or directory.
[Current thread is 1 (Thread 0x7fcc74ff9700 (LWP 37))]
(gdb) bt
#0  __GI_abort () at abort.c:107
#1  0x00007fccba4b3064 in skgdbgcra () from /opt/oracle/instantclient_21_1/libclntsh.so.21.1
#2  0x00007fccba480693 in skgesigCrash () from /opt/oracle/instantclient_21_1/libclntsh.so.21.1
#3  0x00007fccba4809de in skgesig_sigactionHandler () from /opt/oracle/instantclient_21_1/libclntsh.so.21.1
#4  signal handler called
#5  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#6  0x00007fccc2292859 in __GI_abort () at abort.c:79
#7  0x00007fccc24cf27f in Panic (fmt=optimized out) at log.c:943
#8  0x00007fccc215bf8d in Tcl_PanicVA () from /usr/local/ns/lib/libtcl8.6.so
#9  0x00007fccc215c0ff in Tcl_Panic () from /usr/local/ns/lib/libtcl8.6.so
#10 0x00007fccc25143ca in Abort (signal=optimized out) at unix.c:1119
#11 signal handler called
#12 0x00007fccc24b825c in NsTclConnChanProc (UNUSED_sock=optimized out, arg=0x7fcbf581d460, why=1) at connchan.c:694
#13 0x00007fccc24e279d in SockCallbackThread (UNUSED_arg=optimized out) at sockcallback.c:531
#14 0x00007fccc2210fd4 in NsThreadMain (arg=optimized out) at thread.c:232
#15 0x00007fccc221169f in ThreadMain (arg=optimized out) at pthread.c:870
#16 0x00007fccc1b72609 in start_thread (arg=optimized out) at pthread_create.c:477
#17 0x00007fccc238f293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Any Ideas on what could be causing the signal 11 issue? If you want me to open a new discussion on it I can.. Thanks! Marty