* Babblefro1 (n=Babblefr@71-34-95-79.ptld.qwest.net) has joined #osgrid * Babblefrog has quit (Read error: 104 (Connection reset by peer)) * synalx has quit (Read error: 113 (No route to host)) * [awebb] (n=awebb@c-76-28-82-254.hsd1.ct.comcast.net) has joined #osgrid * krisbfunk (n=chatzill@chtwpe0105w-142068127011.pppoe-dynamic.pei.aliant.net) has joined #osgrid anyone tested osgrid login lately? stalling on region handshake for me "login packet never recieved by login server" i'm guessing that "while [ 1 ]; do mono OpenSim.exe; done" didn't do the trick or while(true); do mono OpenSim.exe ; done well, I'm logged in on osgrid w/out issues this morning several times ok, good to know. it's my client or connection then, i'll try when i get to work * krisbfunk has quit ("ChatZilla 0.9.81 [Firefox 2.0.0.12/2008020121]") Thre are some problems on OSGrid. Some regions are not accepting logins consistently. Nebadon and I wre finsing this earlier today. Looks like it is some sort of network problem. *wre finding *were finding Time for lunch. Can't type when I am hungry :) daTwitch: We wre having strange problems earlier today which we never solved. Nebadon could login to WP2 without a probelm. Howeber Tedd and I could not. I found the same on some other regions i.e. the Cameos. Ping times showed 200ms from me to the PC runninng WP2. ChrisD: soounds almost certainly like an intermittent network failure. Did anyone move beyond ping in the diagnostic process? I did a quick nework trace which indicated that the respose times deterioated when it got to Level3. OpenSim seems to be a bit susceptible to network latency. yeah, I have experienced thos same issues with level3 before. I actually managed to get them fixed at one point by harrasing level3 support. Nebadon was able to login to WP2 with 10 bots and himself. We could not get in at all. There should be some things we can do to make OS more fault tolerant. But I confess I am no network programmer This problem also prevents successful teleporting and region crossing I restarted Wright Plaza, it was at a bash prompt. I think I will *not* use the while true incantation, as it doest seem to do anything. I really dont think anything is wrong with the software. Charles, it had to be restarted earlier today. I ran my region server overnight and stayed logged in as well. So, we appear to be in a time of instability this week. If you want to know what I think, I think that taoki's region being constantly miscofigured/reconfigured is dramatically impacting the stability of WP Yes, we are going through a very unstable patch I dont think 'unstable' puts a fine enough point on it Fermi is next to Taoki's region and it has been up for 24 hours and still running It is interesting that some regions run for days and others cannot run for more then a few hours. I think it is more along the lines of "OpenSIm instances have become extremely sensitive to variations in network reliability and extremely sensitive to the state of neighbor regions' Feell free to invade Fermi and try and crash it. I am logging the console to a file. That could certainly be. We have differing, unknown and unknowable versions on adjacent sims and a grid that spans the planet. all day yesterday, Taok Vixen showed as a dark blue square on the mini map. In the past, this has indicated a 'black hole' in the grid * krisbfunk (n=chatzill@137.149.66.161) has joined #osgrid U agree daTwitch, if the region can't talk to a neigbour, it creates an exception. It should really timeout gracefully. well, it's not a binary state. a region can be misconfigured in a variety of ways. Some misconfigurations work partially. for instance, the circumstance where the public can log into a region but the region operator cannot got logged in ok from work daTwitch, thanks. you bet kris :D Got to grab some lunch. Back in 15 minutes. I know that last weekend I got a bunch of new neighbors on the grid that I've never seen before nor met on IRC. some of their regions worked, some didnt. Varying degrees of functionality suddenly at exactly the same tinme, it was all I could do to keep my region server online. some of it was -my- fault If this diagnosis is true, the question becomes "What is the correct way to move forward?". We need some sort of test that developers can duplicate to prove or disprove our premise. some not. fixing my issues did not eliminate the instabilties however complete relocation of the regions to a spot on the map with zero neighbors did. well, I think confirmation from some of the other users would help it's a bit fuzzy though - I'm reaching this conclusion through reasoning and circumstantial evidence myself The trick is going to be getting past "it dont work" to "it dont work because" it is however, the simplest explanation that fits the facts Occam's Razor is pretty good. Indeed it is :D Close shave every time :D It gets complicated in a world where there are 8 neighboring sims.. indeed. I think one way to test the hypothesis is with a small test grid where all regions were in control of a single operator so regions could be selectively broken in various ways, for purposes of documenting the impact on surrounding or adjacent regions If I recall, we did have issues way back where we had to have a blank spot between each sim, i.e., each had to be an isolated island in order for them to work. We got past that September or October. hmmmm I think one thing working against us is that when evertything *is* properly configured, it all works great. I also know that most of the developers test a quick standalone and only one or two are even looking at sim-sim interactions on a heterogeneous grid. this encourages a sort of complacency that is such a disjuncture from how the software is employed Is it possible that POS vs ODE between neighboring regions could have an affect in region-region interaction? * krisbfunk has quit ("ChatZilla 0.9.81 [Firefox 2.0.0.12/2008020121]") one has to wonder :) but unfortunately, I know significantly less about the potential impact of the physics engine on networking than I do about the networking itself (and that isnt much) Well, if you are right, then we have some significant challenges in creating a heterogeneous grid. By the logic espoused, a homogeneous grid has a better chance of working, but that means the whole notion of individuals putting their regions on a grid is in jeopardy. 'jeopardy' may be a bit extreme. Which means that Sakai's idea of a closed grid where one person instantiates all the regions and hence knows that all the configurations are the same will work better If we consider the developement cycles of past server architectures, what generally works well in a homo environment can be tuned and made fault tolerant for het. And I certainly hope that last statement is incorrect. basically, as I see it, Sakai/Dalien are lazy. Yes, I agree. The trick is going to be to figure out what the issues really are. we're at a point where the issues are much harder to debug. This is the really challenging part. Thats the reward for figuring out each issue. The next one is a little tougher. this one is genuine rocket science. which brings us to this other juncture. Maybe we are at the stage where updating the UGAI each and every day should slow down. Charles I have come to the conclusion that the single most important work that could be done right now is a complete audit of the underly packet handling systems. For instance we dont need to take Sir_Ahzz' word on whether his current work in that area is good. Nor do we need to take core's word. We need to take some personal responsibility, learn what he is doing, and be able to critique his work personally. From both a design and testing perspective. if it isnt right, we need to see it, say why, and fix it. Whoever wrote it. without that subsystem working like it was made by an advanced civilization, we're never gonna figure out why the rest of this doesnt work, or fails intermittently alien tech Chuck, we gotta make it alien tech it's really up to us, cuz like you say - core is testing something else entirely. Well, here is the problem. I read virtually everything written on all the irc channels. I see differing results from differing folks. *Some* of the results are consistent with my own observations, but not all. that is why we need to preen and harden that communications code. To the extent the results of others are inconsistent with my own results, I need to move more slowly as I cannot tell who is confused. it is the only way we can manage the signal to noise ratio So, how do you want to move forward? as soon as Ahzz comes in today, I'm going to interview him some about his code - have him point it out ot me in the source. As you say reported results are inconsistent. This is partly due to the varying technical expertise of sim owners. I am not sure as to how we would move forward on this. I think that is a bad assumption for two reasons Charles for one, it's always a bad idea to assume the data you gather is defective for another, we need to be sure what is done is tolerant in an HA sense One suggestion perhaps is that we set aside a portion of OSgrid where the neighbours are under control of youserlves. so it needs to work in that circumstnace where the net is busted or the neighbor is hosed WB ChrisD :D I have to admit that I attend to Mw, Lbsa, Sdague & Adam's opinions about architecture with a high degree of credibility. Others, including myself, have a low degree of credibility when it comes to architectural questions in the midst of the scene or packet handling logic. we already have that - anyone can relocate away from everyone else. All owners of the sims use a standard pre agreed OpenSim.ini and region xml file. I just have, and improved my stability dramatically Now you need to add controlled neighbours. you might be on to something ChrisD I do have a theory of why some of these problems occur. hnce the reason I am suggesting controleed neighbours We also have the issue of the moving target OpenSim.ini.example. Most of us do not update our OpenSim.ini on a controlled basis and new settings are added into OpenSim.ini.example in a somewhat uncontrolled basis from time to time. From my observation, I believe that the see into neighbouring region is having an impact. ChrisD: almost ccertainly it impacts in the following ways: if the neighbor is properly configured, it speeds region transit to the neighbor by preloading the prims if the neighbor is misconfigured, it blocks queues and their attendant threads waiting for connections and transfers to time out Charles, your observations re: OpenSim.ini are spot-on also, another thing biting us is the notion of the 'stable release' in a het grid (especially this one) there's no such thing as a stable release we're all running svn So, maybe a good next step is for us all to start with a common OpenSim.ini which we should perhaps post on osgrid.org web site? * krisbfunk_ (n=chatzill@postgresql.vre.UPEI.CA) has joined #osgrid or at least, we should be We can also control when the UGAI is updated. It doesnt have to be every single svn. What I am also seeing is that if someone logs into or transits to an adjacent region then my sim starts a new Client View thread. I have found high client counts with no one actually logged in to my sim. I see that also, by the way. same here it's somewhat understandable there needs to be a process in each region adjacent to that where the avatar is to ready to recieve the avatar if it transits there *to be ready Would "child_get_tasks" when set to true possibly exacerbate that? there are definitely some issues with it. I think it's why we sometimes see the ghost of an avatar in a region adjacent to the one s/he's actually in I gotta run for a few, fellers nature calls :) Ok, I have an idea. Lets post a "recommended" OpenSim.ini based on the latest OpenSim.ini.example and set all our regions to that .ini and perhaps set child_get_tasks to false for a few weeks. "child_get_tasks" is no longer used. It is now called "see_into_this_sim_from_neighbor". I am not sure if this is documented or not. Well, all my OpenSim.ini use child_get_tasks and that includes the plazas, Yang and the moons, so maybe we are on to something regarding making all the OpenSim.ini more consistent. To do that I would have to reset my sim and delete all invetory as I have local asset storage. To be consistent we would all need to use grid storage, I think the "see_into_this_sim_from_neighbor" defaults to true and that is why it works even with child_get_tasks. There was a Mantis that Babblefrog wrote that provided a C# program to convert SQLite to MySQL that we ran on Wright Plaza about 3 or 4 weeks ago when we converted the plazas from local to grid assets. yes it is the default now My local assets are stored in a local MySQL database. let's take it one step further. Lets start a discussion thread on the forum essentially by posting a log of this convo into it Ok, sounds good, go ahead and lets see where we go. then we can drag certain folk into it there we should also post your example on the website too. essentially, we need to start taking a bit more active stake in the grid bits of this project daTwitch, can you lead that charge? I just cant go multiple directions at the same time. BTW LoL yes I can Sir :) but one more thing that has bearing Thank you, Sir. 3Di is about to drop a bomb With the parcel stuff, I know. they warned us its coming on the dev list well, it's a stabilty patch too as I understand it Not to mention the Rex "soup" which is warming up in a branch yes Or the "Ahzz" stew which is simmering in git. this might get hairy for a bit, but I think now is the time to start generating awareness of some of these issues, potential fixes/workarounds, and our awareness of them Put-em all together and we dont know if we will have a tasty dish or garbage. I will setup another sim on another linux box using an agreed standard OpenSim.ini. This will be able to handle upto 4 regions connected to OSgrid. sweet ChrisD If you look a bit off shore se of WP you'll see my regions down there Nexus Prime, Bodhgaya, etc To be honest, I am more worried about Ahzz then I am by either 3Di or Rex, but thats just because I have listened to their irc for a little longer. I have growing confidence in Ahzz. Not because I think his code is perfect, but because he is playing well with others and is willing to admit he may be wrong. Currently ther is potential for major problems with lot of merging going on from multiple sources. yes, ChrisD, if you want to bring those regions up alongside mine, we should be in a position to track the impact of some of the changes in a more-or-less controlled environment Well, that is certainly on of the biggest worries right now. We have a time of instability *and* we are about to merge in slightly unknown patches from 3 different sources. Also subversion is the ideal system for dealing with this type of merge. i believe that using git would provide a much better way forward. doing things in this fashion allows us to meet both the goal of providing a free region parking service and that of testing the grid ops of this system. *is NOT the ideal I am not too familiar with git, which reflects more upon me than it. What I am learning about it, I like Swell. So we have a) instability b) 3 different slightly unknown patches and c) a new source archve program, all at the same time. Lets change languages at the same time, just to make it completely impossible. That's the spirit CHuck ;p You forgot the different opearting systems and the different versions of mono :) Actually, that really does break it down into a set of knowns. If we can do that, we can address specifics. I'm going to go buy some coffee. Back in a little while. ok I'm gopnna paste as much of this as I can into a new thread. Something for you to look at in a bit. Looks like there is a patch that may fix the frozen avatar.