[Opensim-users] How to get ROBUST to notify that it has finished setup?
Seth Nygard
sethnygard at gmail.com
Thu Nov 18 16:07:10 UTC 2021
I don't use systemd for OpenSimulator as I find it lacks the necessary
handling for all that can go wrong with OpenSimulator and its
interdependent services. It can also complicate things when you may
want to do things for maintenance or other purposes. Simply knowing a
process is running is not sufficient for what may go wrong.
In my case a use a series of wrappers and file semaphores to control how
OpenSimulator is started, shutdown at both the individual application
level as well as the entire grid.
I generally run FSAssets as a separate service and the rest of Robust as
another. For cases where there may be high levels of concurrency I
separate Robust into more services.
If we consider my startup sequence for a grid with 2 robust services;
FSAssets service and another for everything else (core)then my main
wrapper would do the following for startup;
Start the grid
- check for an AUTORUN file semaphore for the grid
- if we do not see AUTORUN then abort any startup
- check if MySQL is running and accepting connections
- if MySQL is not OK then wait for 60 seconds and try again
- if MySQL has not been found to be OK after 5 minutes then generate a
log message and return with a fail exit code
- execute the wrapper for FSAssets
- if exit code is ok then start Robust-core. otherwise generate a log
message and return with a fail exit code
- if exit code is ok then loop through each simulator and run its
wrapper for startup, if any exit code is not OK then generate a log
message but continue
- if no not OK exit codes where encountered then generate a log message
for successful grid startup
Each of my OpenSim services have their own wrapper script as follows;
FSAssets wrapper script;
- check to see if MySQL is running and I can make a simple query on the
table to get the record count for assets (this will fail on the
first-run startup but that is never done as part of an automated
sequence in my case)
- if MySQL is not OK then then generate a log message and return with a
fail exit code
- start the FSAssets robust executable, in my case as a tmux session
- loop 10 times checking if we still have the new tmux session each second
- if during any check we no longer see our new tmux session we assume
something went wrong, generate a log message and return with a fail exit
code
- if after our 10 second loop we still have our tmux session then return
with an OK exit code
Robust-core wrapper script;
- check to see if MySQL is running and I can make a simple query on the
table to get the record count for user accounts (this will fail on the
first-run startup but that is never done as part of an automated
sequence in my case)
- if MySQL is not OK then then generate a log message and return with a
fail exit code
- start the core Robust executable, in my case as a tmux session
- loop 10 times checking if we still have the new tmux session each second
- if during any check we no longer see our new tmux session we assume
something went wrong, generate a log message and return with a fail exit
code
- if after our 10 second loop we still have our tmux session then return
with an OK exit code
Simulator wrapper script;
- check for an AUTORUN file semaphore for the simulator
- if we do not see AUTORUN then return with an OK exit code (this is a
skipped simulator and not an error)
- check if the main robust service is running by requesting the
get_grid_info
- if we couldn't get the grid info or did not find our expected grid uri
in the response then return with a fail exit code
- check to see if MySQL is running and our simulator schema exists (this
is OK for the first-run startup since we always our schema to be present
to continue)
- if MySQL is not OK then then generate a log message and return with a
fail exit code
- start our simulator executable as a new tmux session
- loop 10 times checking if we still have the new tmux session each second
- if during any check we no longer see our new tmux session we assume
something went wrong, generate a log message and return with a fail exit
code
- if after our 10 second loop we still have our tmux session then return
with an OK exit code
My grid shutdown process is very similar but in reverse order, without
the DB or service checks, and with longer delays. For startup the 10
seconds checks could be shortened but I prefer to wait long enough that
I know the OpenSimulator executable is not going to die due to an error.
That usually happens in the first couple of seconds. In the case of
simulators I often increase the 10 second loop to cover the typical time
needed to start the scripts for that particular build so I can better
balance the load on the host and not have too many sims all trying to
start their scripts at once.
Over the years this is what I have found to be the most flexible way to
handle an OpenSimuator grid and its services while avoiding many of the
errors that can happen along the way. The AUTORUN file semaphores are
added to make it easy to remove one or more select parts from the
automated sequences without needing to edit scripts. They exist with a
folder unique to each service and are easily noted when you may be
working in a terminal session. In my case I also use the AUTORUN file
semaphores to help with moving a simulator from one server to another
while avoiding any race conditions during the process.
In my case my gridcrtl wrapper intentionally runs things in a sequential
manner but that is not mandatory for the simulators. I however prefer
to limit the loading on the host during script startup or especially
when I may force a recompile of all of them. This is a CPU intensive
operation that can take some time if there are a large number of
scripts. That whole things however happens mostly on a single thread so
you could do several sims at once but I would still try and limit that
to no more than the number of cores your servers have at most.
My main grid wrapper in my case is gridctrl.sh, the robust wrappers in
my case are simply gridsvcctrl.sh, and the simulator wrappers are
simctrl.sh. The wrappers then contain all the necessary logic needed
and accept a simple command switch for START, STOP, RESTART depending on
what it needed. This makes them suitable to both automated and manual use.
I still find using my own wrapper scripts to give me much more control
over everything than trying to do it within the OpenSimulator startup
mechanism. My wrapper scripts have additional functions to provide for
both db and OAR based backups and several other common admin tasks.
I now normally run all the above using multiple docker containers but
the whole startup/shutdown concept remains very much the same.
Hopefully this gives you some ideas for how you can better handle what
you are wanting to do to make sure robust is running before starting the
simulators. In your specific use case you can just look at what my
wrapper for the simulator is doing and ignore the rest.
-Seth
On 2021-11-17 4:03 p.m., Gwyneth Llewelyn wrote:
> Hi all,
>
> I've been tinkering with my automation scripts under Ubuntu Linux
> 20.04.3 LTS, trying to get them fully integrated with systemd. It's
> tougher than I imagined!
>
> My question is rather simple. OpenSim.ini lists a few options to run
> some scripts and/or send some notifications when the instance is fully
> loaded and operational (for instance, once the instance is fully
> loaded, you could check for the statistics API. These can be used for
> a variety of purposes, from simple notifications to a sysadmin to let
> them know that an instance has rebooted, to let users get some sort of
> feedback on which regions are up, etc. and so forth. These can also be
> used for system maintenance purposes as well.
>
> I can't find anything similar for ROBUST, though — at least, not on
> the configuration files. The closest I could find was a reference to
> the 'console'. I'm assuming that this would technically allow a bash
> script to connect to ROBUST and perform some sort of check...? A bit,
> uh, 'clunky' but... I guess it's a possibility?
>
> What are you using to signal that ROBUST has finished loading?
>
> Thanks in advance!
>
> - Gwyn
>
> P. S. Some background notes, for those interested in understanding
> what I'm trying to accomplish and why I've been having some trouble.
> One of the great things about systemd (arguably one of the few...) is
> that it launches everything in parallel, as much as possible; the
> theory being that services will not need to block each other, which is
> what happened in early systems (which relied on a serial sequence of
> steps, each having to finish before the next one was launched).
>
> This is great for launching all the OpenSim instances for the whole
> grid — they will load in parallel, and, since they're pretty much
> self-contained, they will happily get what they need from the database
> server, and — in theory! — finish faster than launching each instance,
> one by one (in practice, it's not so rosy, since the database server
> becomes the bottleneck... although it ought to be possible to
> fine-tune it to deal with so many requests in parallel).
>
> However, there are two catches with this approach.
>
> Firstly, if the MySQL database is not ready before ROBUST and/or the
> instances launch, OpenSim will assume a 'broken' or non-existing
> database connection, and gracefully fail, by asking for the Estate
> name and so forth — i.e. basically the instances will be up, but
> blocked. The good news is that there are several ways to check that
> MySQL is up and running (using some external scripts — ), so this can
> be checked before ROBUST or any of the OpenSim instances are launched.
>
> Secondly — and the reason for this message to the list! — _if_ ROBUST
> hasn't launched yet, then none of the OpenSim instances will register
> themselves with the core grid services (including the asset server).
> I'm not quite sure if each instance, after failing their attempts in
> contacting ROBUST, will do any attempt at a later stage to re-check-in
> with it. If not, it effectively means a broken grid, where sections of
> it, on individual instances, will simply be isolated from the rest of
> the grid.
>
> ROBUST is quite fast in loading everything — compared with the OpenSim
> instances, at least — which means that there is a good chance that it
> launches before the instances. But we cannot be sure that this
> actually happens.
>
> Now, systemd has a way to generate a list (rather, a directed
> graph...) of dependencies. One can, indeed, make sure that ROBUST has
> already been launched *before* launching any of the instances. But
> this won't help much in this case, because systemd is only able to
> check that the *process* has been launched — not if it's ready to
> accept requests. There are some tricks to achieve that, but most
> require some changes in the ROBUST code, and I'm not even sure that,
> running inside Mono, the C# code has any access to system calls. The
> alternative is to use scripts that check for other things — such as,
> say, a status page or a file that has been written somewhere — in
> order to deduce that something has not only been launched but is
> actively accepting requests. I know how to do that inside an OpenSim
> instance, but not on ROBUST.
>
> --
> "I'm not building a game. I'm building a new country."
> -- Philip "Linden" Rosedale, interview to Wired, 2004-05-08
> _______________________________________________
> Opensim-users mailing list
> Opensim-users at opensimulator.org
> http://opensimulator.org/cgi-bin/mailman/listinfo/opensim-users
--
- Seth
More information about the Opensim-users
mailing list