ruffus

0

in looking into Python-based workflow systems to be used in highly parallelizable material science applications that will run on Titan, we came across ruffus

created Dec. 14, 2012, 2:34 p.m.

0

there is a lot to wrap one's head around in this project so first i am just going to do a simple little experiment to see how it will coordinate activities on my local machine

created Dec. 14, 2012, 2:37 p.m.

0

after that i think i should figure out how to simulate getting it to run on multiple nodes. without access to a supercomputer, I am not sure how I'll do this.

created Dec. 14, 2012, 2:38 p.m.

0

since we'll be interfacing with SGE and ruffus supposedly offers support for SGE, I think we'll be in good shape. i just need to understand how it works instead of expecting magic

created Dec. 14, 2012, 2:41 p.m.

0

i think i'll do a fractal image generation as a problem to solve and test this out

created Dec. 14, 2012, 2:50 p.m.

0

I'll probably borrow almost exclusively from James Tauber's work on Littlewood

created Dec. 14, 2012, 2:52 p.m.

0

The first thing I am going to do is fork littlewood and then create a parallel branch.

Here is where I'll add my work to extend things to have parallelization via ruffus.

created Dec. 14, 2012, 3:42 p.m.

0

Starting parallel.py the first thing I'll do is pull in parts of roots.py and heatmap.py and pull every thing into discrete functions.

created Dec. 14, 2012, 3:43 p.m.

0

There is parity now (verified by running) but I did remove the writing roots to a file and instead keeping them in memory. My thinking is that this will make it easier to parallelize the roots generation. We'll see.

created Dec. 14, 2012, 4:07 p.m.

0

I have replaced the normal sequential run of the top level roots_for_degree and heatmap functions with making them tasks and running them in the ruffus.pipeline_run.

Still no parallelization. That's next. This just simulates exactly how the script ran before.

Interesting thing to note, that my previous runs of 16 degree, size 200 fractals, took approx. 33-36 seconds. Now replaced to run in the pipeline they take 45 seconds. Seems like a good bit of overhead, but perhaps it isn't linear, and will more than make up for running roots in parallel.

created Dec. 14, 2012, 4:51 p.m.

0

Finished the conversion, running with multiple processes certain speeds up the roots calculation a bit, but for my Mac Book Air, with only 2 cores, it's hard to see that much of a difference.

I wonder what it would do on larger machines with more CPU and more cores. Adjust how many processes to run in parallel by changing the multiprocessing argument at the bottom of the file.

created Dec. 14, 2012, 6:53 p.m.

0

The next phase of this research is figuring out how ruffus interfaces and spawns work off to SGE.

created Dec. 14, 2012, 9:40 p.m.

0

Picking up on this adventure, I am going to try to use SAGA-python as an abstraction layer of a different distributed execution environments to run external processes.

created Dec. 17, 2012, 9:21 p.m.

0

I had originally thought SAGA-python was on googlecode which I was a bit dismayed by because I already see a change that I want to contribute back in the code. So happy to learn it's on Github.

created Dec. 17, 2012, 9:22 p.m.

0

First step is going to be pulling out:

Into a stand alone script and then replace the internals of that function with a SAGA execution of that new script.

created Dec. 17, 2012, 9:24 p.m.

0

Working with SAGA-python is very hard.

created Dec. 17, 2012, 10:55 p.m.

0

Finally got remote execution via SAGA/ssh with SFTP transfers in a ruffus job, but it was painful.

Also, it seems very flakey.

created Dec. 17, 2012, 11:06 p.m.

0

Flakey in the since that i am getting IO size mismatch errors that are non-deterministic. My connection is via localhost so I would not expect any network issues.

created Dec. 17, 2012, 11:06 p.m.

0

Ok, the flakiness was all my doing.

I had copy and pasted from some example code and didn't correct for my particular use case.

created Dec. 18, 2012, 2:08 p.m.

0

I do wonder if these remote notes will be prepared with a run time environment, or will I be responsible for bootstrapping all of them.

created Dec. 18, 2012, 2:09 p.m.

0

Another thing to sort out is how to choose a node.

created Dec. 18, 2012, 2:09 p.m.

0

With a PBS cluster, you can queue the task, so presumedly the cluster is choosing the node for you.

created Dec. 18, 2012, 2:12 p.m.

0

I guess the point of "grid" computing is that you don't pick a node, but rather you address a cluster.

created Dec. 18, 2012, 2:33 p.m.

0

I do wonder how you bootstrap nodes in your cluster. Surely we don't have to bootstrap an execution environment every time.

created Dec. 18, 2012, 2:33 p.m.

0

I'd like for it to be a split between something like providing an image with a virtualenv prepped with numpy and any other lib that might be needed, but then the code that needs to execute we copy over at run time like:

https://github.com/eldarion/littlewood/blob/parallel/parallel.py#L74

created Dec. 18, 2012, 2:34 p.m.

0

If this is the case, this perhaps when I do this it will be fine to be hard coded.

created Dec. 18, 2012, 2:35 p.m.

0

I plan on going through the SAGA-python interface, but reading about SGE makes me wonder just how this bootstrapping is done.

created Dec. 18, 2012, 2:37 p.m.

0

Looking at StarCluster by MIT to help provision clusters on EC2 for testing.

This screencast captures what looks to be a pretty awesome tool for launching SGE clusters on EC2

created Dec. 18, 2012, 4:20 p.m.

0

So I got a 10-node cluster fired up with little problem.

created Dec. 18, 2012, 9:21 p.m.

0

Now it's just getting the SGE plugin in SAGA-python working. So far just all kinds of SSH connection issues.

created Dec. 18, 2012, 9:22 p.m.

ruffus

by paltman

Keyboard Help