I am taking a look at gc3pie as something to help in building dynamic workflows of large job campaigns to be executed in grid environments such as SGE
Been reading through the docs and so far it sounds like there are at least some promising ideas to use.
I like the notion of having auth and resources setup via a config file.
Having a bit of difficulty understanding the architecture, especially in the GC3Apps section.
I'm sure it will click eventually, but as of right now I am completely stumped and feel like a fool. :(
So I am thinking the GC3Apps is a collection of scripts to run well known "pipelines", "algorithms", or other units of known calculations.
I don't think anything in GC3Apps part of gc3pie are applicable to work currently on our plate.
Perhaps, they might serve as examples as to how to write our own scripts.
I think I am understanding it a bit better. I kind of like the
TaskCollection object models. Still not clear to me yet on how to structure workflows.
Next step, is to work through some tutorials.
One thing to note is that the links to the slides are broken in the docs, but there are available in the source tree.
All the examples in the tutorials are calling the base class methods directly rather than using
super(), for example:
class SquareApplication(Application): def __init__(self, x): self.to_square = x Application.__init__(self, ...)
class SquareApplication(Application): def __init__(self, x): self.to_square = x super(SquareApplication, self).__init__(...)
I wonder why this is done.
One thing I immediately like about this GC3pie over SAGA-python is the API.
For instance, with SAGA-python you have to explicitly copy over your input files before starting to execute anything. With GC3pie, you just list your local input files along with the executable that you want to execute remotely, and it handles copying the files to the remote server where the execution will happen.
Same for listing the outputs. This list (passed to the Application constructor) will be the files that will be copied back locally once the execution completes.
Firing up an SGE cluster this morning on EC2 using StarCluster to have to work through some of the tutorials and exercises in GC3Pie.
Now that this is available (
starcluster start mycluster is pretty awesome for on demand compute clusters), I am going to work through the exercises at the end of part03.
Dealing with environmental issues, now.
It connects fine, however, when I connect manually via SSH with the same user/key, I have a complete environment (all the expected env variables are set). When the script connects with
SshTransport it's throwing exceptions and error messages that indicate that the environment variables are not set.
So the environmental issues were caused by the StarCluster image loading SGE environment variables on login shells instead of through both login and non-login shells.
I added this to my user's
~/.bashrc script and all is working now:
if [[ $- != *i* ]]; then . /etc/profile.d/sge.sh fi
There was a minor change I needed to make to squares.py as it was causing a syntax error on the node of execution:
On line 33, I changed:
arguments=['expr', x, '*', x],
arguments=['expr', x, '\*', x],
In order to escape the
* character so that it would be interpreted properly by the shell.
Have successfully working through most of the 2 exercises at the end of part03's slide deck, however, having a bit of trouble now in understanding how the postprocess/terminate steps work and how their output makes their way back to the consumer.
I think the warholize demo might provide deeper insight into how all this works together with a bit more real-world example.
So after running
apt-get install -y imagemagick across all nodes in my cluster, the warholize script ran against:
python warholize.py -C 1 ~/Pictures/paltman.png --copies 9
Using all 9 nodes of my cluster in parallel following the following workflow:
The coolest part, I think, is how you can dynamically influence the next steps, either by changing parameters based on previous output, or selecting different jobs entirely at runtime.
It's all about how you write your
next() method on your workflow objects. For example:
The main objects to work with in gc3pie are:
The *Collection objects can other *Collections as well as Applications.
The Workflow isn't really it's own object but the top most Collection that owns all other collections.
You compose workflows by combining the various Applications (or executable bits that will run on a node in the cluster) to either run in parallel or serial, and these collections can run in parallel or serial all the way up until you have a single collection that will run everything. This top most collection is the Workflow.
Each collection specifies a task list in the
__init__ constructor. You can also implement a
next() method that will be called after each task completes.
You have access to this list and can dynamically pick the next Application to run from within this method.
After a task has completed, the next method is called with the index of the finished task in the ‘self.tasks‘ list
the return value of the next method is then made the collection execution.state.
If the returned state is ‘RUNNING‘, then the subsequent task is started, otherwise no action is performed.
For every distinct binary (or binary with certain parameters), I think an
Application subclass will be created in a library of apps that can be used to construct Workflows of various types of collections that glue these apps together funneling the output from some into the input of others and making decisions about what next to run in the
The trick is how we marry this construction of workflows up with Django models. I think it might be a little far reaching to make it complete data driven.
I think the goal (need clarification) of using models/Django is record what was run, when, and with what inputs (as well as what outputs, final and intermediary) were generated.
Whatever local files you pass to the
inputs parameter of the Application instance, will get pushed to the remote server, this includes any binary or script that you are going to execute. It's not just limited to data files used for input.
I wrote on an earlier card how I needed to add:
if [[ $- != *i* ]]; then . /etc/profile.d/sge.sh fi
~/.bashrc but failed to mention it needs to be on the
sgeadmin user's account (or whatever user you have configured to connect with).
Just produced a Littlewood fractal of (18 degrees/400 size) but it seems skewed, perhaps I made an error in transposing the code when moving to scripts: