A number of years ago I wrote the collectl monitoring utility which is
currently in use on a fairly large community of HPC clusters,
including many on the top500 list. =A0I just wanted to let you all know
that I just updated the collectl-utils package which contains a
utility called colmux that I think can revolutionize one=92s ability to
see what=92s going on with any resource on clusters of almost any size
having tested it on clusters of over 2000 nodes with great results.
Basically it=92s a collectl multiplexor, and by that I mean it=92s a
utility that starts a copy of collectl running on multiple systems,
multiplexes the output back to a single point, sorts it by a specific
column number and displays the output in a continuously refreshing
window in a top-like fashion.=A0=A0 It can also be used to multiplex
historical data as well.
Since it can support almost anything collectl can monitor (which is
substantial), you can quickly identify anything from a busy nfs client
or server, a slow disk anywhere in the cluster, a network interface
generating too many errors, a slow infiniband link, a system doing too
many interrupts or almost anything else you can think of.=A0 It=92s eve=
n
been used to find the systems running at the highest temperatures on a
multi-thousand node cluster during a top500 linpack run!=A0=A0 I=92ve
successfully used it to find systems on a cluster leaking slab memory
and it only took seconds.
If you want to read more, you can read more about it on the
collectl-utils project page on sourceforge.=A0 There=92s even a nifty
photo of it running using an alternate output format displaying the
CPU load on 192 system, all on the same line once a second, taking 3
30=94 displays side-by-side!
-mark
--
To unsubscribe from this list: send the line "unsubscribe linux-admin" =
in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
