Forums

View Forum Thread

Anyone can view the forums, but you need to log in in order to post messages.

> Forum Home > Announcements > Last minute new feature for 2.5: Automatic redundant masters

Wed, 13/May/2009 3:17 AM

Robin
1138 Posts This allows Smedge to run in a completely automatic mode, eliminating the need for worrying about Masters, Engines, and whatevers. This will make Smedge able to operate exactly like Smedge 2 did, only with all the power and stability of Smedge 3.

This is the first phase of a transformation in the data and communication systems as part of the next major release of Smedge. The next major release will also include many of the commonly requested features, as well as a complete GUI overhaul, making the interface more efficient and completely customizable.

This feature is currently in development. Once it is through QA, we will be releasing Smedge version 2.5 beta 2, with this new system enabled. The exact release date is not yet determined, but it will be before the end of this month. This release will also include all of the previously mentioned version 2.5 features and bug fixes.

This beta version will be available for Windows, Linux and Mac simultaneously. A complete list of the new features and fixes will be made available with the release.

As ever, please let us know if you have any questions about the new release or updating your Smedge.

Thanks
-robin

Edited by author on Thu, 14/May/2009 1:35 AM

Tue, 19/May/2009 6:56 AM

Jamie
102 Posts Look forward to seeing this release.
Cheers

J

Sun, 31/May/2009 12:44 PM

Robert
10 Posts Yeah, I'm excited for this one. I've been checking daily in hopes of seeing it.

Sun, 31/May/2009 4:25 PM

Alex
20 Posts Checking daily as well. I did notice it was not released before the end of the month though.

Programmer's rule #2: Never give code delivery timeframes.

Edited by author on Sun, 31/May/2009 4:25 PM

Mon, 01/Jun/2009 1:11 PM

Robin
1138 Posts Yes, you would think that after 8 years of doing this, I would stop giving deadlines, since I've pretty much missed every deadline I have ever set for myself. On the other hand, if I don't set deadlines for myself, nothing ever gets done!

In any case, this is still in development. Most everything is in place now, but there are a few more days of development left, and then some time to work out the kinks. To get the beta released on all three platforms will probably take another week to ten days.

Thank you for your patience. Please let me know if you have any questions!
-robin

by the way, what is programmers rule #1?

Edited by author on Mon, 01/Jun/2009 1:12 PM

Mon, 01/Jun/2009 3:01 PM

Alex
20 Posts Programmer's Rule #1 (golden rule for more than programming - though it took me over a decade of coding to realize it is true):

1) Cheap
2) Fast
3) Good

Pick two.

No worries on the wait, it is well worth it. Great software.

Edited by author on Mon, 01/Jun/2009 3:01 PM

Tue, 02/Jun/2009 12:20 PM

Tim
11 Posts Just wondering if you could expound on the Automatic redundant masters?
How will it work, what exactly will it do, how will it run, what will it require, etc.
We are in the process of setting up a completely new render farm and I would love more info so that I can set it up more intelligently. This would be huge help to us. Thanks again.

Tue, 02/Jun/2009 2:17 PM

Robin
1138 Posts Automatic Redundant Master:

Smedge currently uses a division of labor into separate processes on the machine. This division is done for stability, so that a problem that may take down one process will not generally affect other processes. For example, if something causes the SmedgeGui to crash, the SmedgeEngine process that is actually doing the work will not be affected.

Smedge also currently uses a client/server type architecture to establish communication and to ensure database consistency. The SmegeMaster is the process that handles the database consistency and the distribution of work to the Engines.

One limitation of the current design is that you must specify a machine to be the "Master" when you are installing Smedge. This is something that large render farm people have been used to for a long time. However, Smedge 2 used a system that dynamically selected a "master" at run time, instead of when you installed the program.

The Dynamic Redundant Master feature will allow Smedge to work in the simplified manner of Smedge 2, but with the stability and power of Smedge 3.

Specific differences:

You no longer need to select the machine's role (Master/Engine/Workstation) at installation time. You also no longer need to configure services or startup the SmedgeEngine or SmedgeMaster processes at login. While you can still do these things if you want, you can also make a machine available simply by starting the SmedgeGui on that machine. Double click the big red S, and you're good to go!

If the Master machine goes down, any other machine can take over seamlessly. All machines that are allowed to run the SmedgeMaster process will contain a complete mirror of the database, instantly ready to take over as the Master.

Smedge component processes' network usage will be more limited. Normally, each client application connects to the Master. With the redundant masters, client apps can connect to the Master or to any Mirror of the Master to receive data and updates. This limits the network connections between machines to the redundant master processes and the SmedgeEngine processes that do the work. (This will be optimized even further with the next major release, so that the SmedgeMaster is the only process that needs a network connection at all.)

As far as what you can expect when setting up a new farm, the design of the system is such that you can apply any current knowledge or setups with the new version. The dynamic system complements the existing operation of Smedge, instead of replacing it. The biggest consideration is that you no longer have to pick your "Master" and you don't need to decide what the machine is allowed to do when you install Smedge.

I hope that clarifies it a bit. Please let me know if you have any more questions.

-robin

Edited by author on Wed, 03/Jun/2009 2:34 AM

Thu, 11/Jun/2009 8:23 PM

Robin
1138 Posts Yes I know I'm way past the end of last month....

I had to take a few days to iron out some changes in the Maxwell Module. Maxwell support now includes the ability to send a command to "stop and merge" all work at whatever state it happens to be at. Maxwell is cool, isn't it?

Anyway, this is now done, and I am back on the automatic redundant masters, which is almost done being implemented. Sorry for the delay, but it will be worth it.

And, if anyone wants to play with the bleeding edge Maxwell support, drop me a line and I can send the link to download the latest testing build.

Thanks
-robin

Tue, 23/Jun/2009 12:01 PM

Tim
11 Posts any word on when the new build will be out?
Thanks

Tue, 23/Jun/2009 1:14 PM

Robin
1138 Posts Well, I'm sorry it has taken significantly longer than I thought. It's all in place now, and I'm just ironing out the details at this point. But it's still not quite stable yet under all possible situations, so I have to keep trying to find the issues. I don't have a specific day for release yet, however. I am aiming for trying to get it out by the end of this month, but we can all see what happened last time I said that!

-r

Fri, 26/Jun/2009 12:53 AM

Robin
1138 Posts Ok, it's almost ready for people. Part of why it took longer than expected is that I also fixed a messaging related issue that could cause the system to break. Smedge used to use a fixed size chunk of memory to compose the messages it sends along the network. Sometimes, a render could generate so many files that it would cause the message to reach that peak size, and this would kill the messenger.

So, now it uses a dynamically sized buffer to hold messages. Messages can grow as large as your system will allow, so if a job generates lots and lots of files, it won't break the messenger until it takes down the whole computer. Obviously, you should keep an eye on this, as it can start to degrade network performance and consume a fair bit of RAM if you let the list of files get too big. Remember, you can always avoid the problem (even on the earlier versions) by disabling the image filename detection system for the job (in the Advanced Info tab of the Submit Job window).

There have been a lot of other optimizations and improvements as part of the implementation of the dynamic redundant masters. I'll be posting a full list once I get everything synchronized and tested on all platforms.

Tue, 30/Jun/2009 1:15 PM

Alex
20 Posts Great to know. Been holding off building a new drive image in lieu of the new version.

Kudos again on the great work.

Tue, 30/Jun/2009 10:36 PM

Robin
1138 Posts To keep you updated:

With all the extra challenges of the Master being able to shift between being a Master and being a client, it has helped shed some light on issues that clients were having in general when things went wrong. Some of the weirder issues that may have occurred from time to time, include the process disconnecting then never being able to reconnect again until you shut it down and restart it.

I'm looking into this issue right now. Unfortunately, it's something buried deep, because it doesn't happen consistently, and when it does, there seems to be nothing different about the operation of the process except that the listening messenger simply never actually gets any connections, even though the OS reports that the socket is open and listening. For some reasons, any connecting clients are always denied, until the process restarts.

Once this issue is worked out, the last major hurdle is what to do about having multiple masters on a network at the same time. The system will currently detect this situation and resolve it on its own, but there are some options about what should happen.

Let me describe a scenario for you to see. Imagine I have a farm at the office, and I just got a new laptop. Say I install Smedge at home, so it had never seen the office network data. I happen to leave it running on my laptop, and I go to work.

At work, I plug my laptop into the network and it finds the Smedge network. Because my laptop had been running isolated from this network, it had set itself up as a Master, serving nothing in particular to no other machines but itself, but still ready to go. When I connect it to my office network, the master on my laptop and the master on the office network find each other, and one will choose to resign.

What happens next? If my laptop resigns, no worries, the other master stays master, and updates my laptop with all of its data. Now my laptop is synchronized with the system, and ready to go. But, say the office machine resigns. Suddenly, my laptop becomes the master for the whole system. The office network dumps all of the jobs that have been queued, and is updated by the (empty) queue of Jobs on my laptop. Poof, all of the jobs on my farm just vanished.

Obviously, this is not workable. There are two general approaches to help resolve this:

Adjust the policy used to determine which Master resigns.

Choosing a Master at random is clearly not going to work. Choosing one based on the start time is also no guarantee, as I could have started my laptop before the Master restarted at the office. The policy could include something like the number of connected clients, the number of jobs, the most recently updated engine or job time. These may help reduce the possible incidences of this situation.

Adjust the behavior when masters are merged.

The "cleanest" approach is have one master simply dump all of its data, and do a refresh from the other. Instead, the resigning Master could forward all of its job and engine data to the remaining Master, and "merge" the data. This could lead to old jobs that had been deleted being re-inserted into the job queue, instead of the opposite problem, which could be equally or more annoying

Of course, the best solution is not to run the dynamically redundant SmedgeMaster process at all on a machine that is not regularly connected to the same network. If the Master is not running, it can't possibly become the primary master, and none of this can happen.

There is another option that would be more useful on larger networks, where there is a vastly increased potential for the problem, say from artists with Smedge running on their personal laptops coming to and from their home and work networks. The default "policy" for selecting the master could be overridden by allowing administrators to specifically define which machines were allowed to become Master. If this set was empty (the default state), any machine could become the Mster. If not, the machine would have to be included in the set. If it was, then the normal policy determination would select a specific Master, but if not, then the machine would not be allowed to be a Master at all.

This option would mean that the system would be in a vulnerable state when you first install it, but would be easily customized by an administrator one time, and then the problem would be eliminated (as long as the machines included in that list were not used in a manner that could produce the behavior).

Anyway, that's a pretty technical description of what's going on and where I am, and I hope it makes sense. Let me know if anyone has any thoughts on this, or any other questions related to the dynamic master, or Smedge 2.5 operation.

Thanks
-r

Edited by author on Tue, 30/Jun/2009 10:38 PM

Fri, 03/Jul/2009 6:33 PM

Robin
1138 Posts Hi everyone. Well, the communication issues seem to be resolved at this point. There were some bugs in how the communication system reconnected after a failure. There was also a bug in the low level data transport buffer. This probably did not cause any issues with old versions, because the system in question was not used by earlier versions of the program.

I did have another concern about the redundant master system that I wanted to share with everyone. This is not so much an issue about the software itself, but one that will be a user issue, and one over which I have just about zero control (that I can think of at this point anyway).

Imagine that you have 2 machines, and both are up and running. You submit a Job, and the 2 machines both get updated with the new job, and it starts going. Now, about half way through the Job, you turn off one of the computers. The other proceeds to continue working, and eventually finishes the job. Now, you turn that computer off, then later start the first computer again. What's going to happen?

The first computer was disconnected from the system (not to mention powered off) as the job finished. So, it has the job, and half the history, but the other half is missing. As far as that machine knows, the second half of the job is still pending, and so it starts working away. This, clearly, is wasted effort, and is annoying as a finished (and even potentially deleted Job) is restarted by the machine coming online. If you restart the second machine, it will synchronize with the first, and it too, will forget about all the work it already did on the Job.

This situation involves having a Master that was not up-to-date become the managing Master on your network before it gets updated. This is similar to the situation in my earlier post. The difference is that the previous situation involved machines that are both online trying to merge their differences. This situation involves machines that are both offline dealing with different points in the history of Jobs when they restart. Specifically, this second situation is entirely dependent upon user interaction with the system, whereas the other could arise from factors outside of the user's control, like network failures or machine failures.

I'm not sure what to do about protecting users from themselves yet. If anyone has any ideas. Again, the only way to ensure that you completely avoid the situation is to not run the Master on machines where this can happen.

Thanks
-robin

Fri, 03/Jul/2009 8:02 PM

Robin
1138 Posts If anyone is worried about this new release being such a drastic change from the old, let me give you two levels of protection from the new operation.

First, Smedge is still built of the same exact components as before. If you install Smedge the way you currently have it installed, with a single specific Master machine, specific Engine machines, and everyone as Workstations, then it will still work exactly the same. You can disable the master mirror system on your Master. It will not allow parallel mirrors by forcing any Master that connects to mirror it to simply and immediately stop the Master process if it is started, whether manually, by automatic service, or by another component process starting up.

Second, you can still use the dynamic Master system without allowing the data to be saved on disk. As long as at least one process is running, all of the data will be preserved, but if every process stops, then all data will be permanently lost. This is useful for running Smedge on machines without needing the write permission to save the machine specific data. As long as at least one machine is running, others can join in.

Both of these options also allow exceptions, so you can build up a central core of several interconnected Master servers for redundancy. Only one would be the primary Master at a given moment, and the others could serve as dynamically available backups, ready to take over at any time. The rest of your nodes could be automatically excluded from being Masters.

By the way, the next major release, Smedge 4.0 will take this further, by allowing the masters to dynamically distribute the load. This will take Smedge's already incredible scalability to a new order of magnitude. Once this release is ready, 4.0 will be a grand wish fulfillment, including a bunch of the long requested features, optimized functionality for loading scenes once per job (for products that support it), a GUI refresh, a simplified common work flow for programmatic use, and a new data storage system for the data which will unify access to all system data. This last will be cool for those who like to customize Smedge, because you will have access to every parameter of every object associated with every event. If you want the Engine process start time from the Engine that just started a work unit (for whatever reason), you could access that with something like $(EngineID.StartTime) variable. (Details are still in design!)

The Smedge 4.0 official release will be in 2010.

Page 1 of 1

ÜberwareTM

View Forum Thread

Überware^{^TM}