(pygr is a neat bioinformatics framework in Python.)
After some commenters on my last post seemed happy to hear that pygr was the focus of some summer work, I realized I had only discussed the pygr summer work in a post to the biology-in-python list.
Whoops.
So, here's the scoop: not only is pygr the focus of Rachel McCreary's Google Summer of Code project, but Jenny Qian will be using pygr to build an ENSEMBL interface, also as part of the Google Summer of Code.
That's not all!
In addition to Rachel and Jenny (under the sterling mentorship of Chris Lee, Robert Kirkpatrick, Namshin Kim, and myself) I have two MSU students working with me over the summer, Alex Nolley and Marie Buckner. They'll both be working with pygr-related things, although like Jenny their efforts may end up being more on ways to use pygr than on pygr's code itself.
I also have a grad student or two that may drop in on pygr, if only to use it for something research-y.
So all in all, pygr will get a lot of love this summer. Hopefully we can polish the code and documentation and tutorials to the point where the learning curve is as minimal as it can get, and this fabulous package will become readily available to many others...
Why am I personally putting so much effort into pygr? Well, I've been using it more and more over the last few months, and (somewhat like scipy) it's transformed my work by turning annoyingly difficult data organization problems into trivial Python transformations. I can literally throw together a custom genome browser in a matter of hours -- I've implemented two or three already, for different projects -- and it has enabled several new research program. pygr seems to be one of those rare packages (kind of like Python itself) that is not only functional and effective but presents a unified and coherent intellectual interface. pygr is the only good middleware layer I've seen for sequence intertwingling in bioinformatics. It's not that mature yet, but it has serious promise, and I'm hoping to get in on the ground floor, so to speak :).
cheers,
--titus
Dear Lazyweb, help!
I'm embarking on a number of summer projects in my new lab at MSU, and several of them focus on using pygr to do cool genomic stuff. In particular, I'm planning to build a personal genome annotation system that will let people run their own full genome Web sites and annotate the genomes with private information such as Solexa data, cDNA/EST projects, ChIP-seq, cis-regulatory reporter constructs, ncRNA predictions, etc. etc. (If you're interested in this sort of thing, get in touch -- it will, of course, be open source and open development, albeit in Python :)
As I've been thinking more about how to do the display side of things, I've been running headfirst into a serious lack of knowledge. I would like to make an interface that looks somewhat like your standard genome browser/GMOD/UCSC interface, such as this UCSC view of the chicken genome. I already have the basics of that view working; for example, see this simple example and a group-feature example. But I'd like to add more - a LOT more -- interactivity.
Ideally I'd like to be able to draw simple objects (squares, rectangles, lines) on some sort of canvas and then use JavaScript and AJAX to pop up windows and display bits of information. But I don't really know this space of functionality very well.
So I'm turning to the lazyweb.
Are JavaScript+image maps the right way to go (for example, this, this, and this)? Do they work well with multiple browsers? Or are there good JS libraries for drawing images on the fly in the browser? Is SVG a good thing to look at? Were you stuck with this task, what would you use?
The most important things for this project are, in order of importance:
- basic functionality (JS image maps seem fine for this)
- cross-browser functionality
- selection (e.g. GMOD RubberBandSelection)
- flexibility: reordering and redrawing of images.
Your thoughts are much appreciated! Please drop me a line or comment, whichever is most convenient. I'll summarize the options.
thanks,
--titus
p.s. I'm perfectly fine with "Google this, dumby!" I just don't have much in the way of google keyword knowledge in this area...
I was looking for an introductory book on peer-to-peer (P2P) application and their application to grid computation. Web services was a bonus, as it is something I don’t usually play with.
The book is split in four parts. The first chapter is an introduction to distributed systems, with the definitions and some examples of what is and what is not a centralized (by opposition to decentralized) application or framework. For people novice to P2P, the examples and the problems are well presented.
The first part goes into details of a distributed environment. The P2P solution is thus exposed, with its specific aspects, social (P2P has not a good reputation, to say the least) or routing (accessing peer behind firewalls, routers). As web services are one of the main subjects of the book, they are presented next. The last chapter tackles grid computing through its evolution, the current definition of a grid and the Globus Toolkit 2 architecture. Those chapters are really interesting because they lay down the ground for the remaining of the book.
The second part is about several P2P technologies that can be used, as well as some specific issues. Jini and Gnutella are the first ones to be exposed. They were not developed to answer to the same questions: Jini is about Remote Objects and Gnutella is about file sharing. These technologies introduce the issues of scalability and security ; the first tackles the use of more nodes in the grid, the second the protection of the grid. Finally, Freenet and JXTA are exposed. The first is dedicated to fiel storage on a distributed data grid, the second is a generic P2P framework. The chapters on the different technologies do not bring more information than what can be found on tutorials on the net, but they explain them in a clear way. Scalability and security are aspects that are sometimes forgotten in the design of a distributed system, so their presence in this part of the book helps remembering them.
Part three tackles Jini, JXTA and web service deployement. The first two chapters have some code samples that can be used ; for web service, there is only some XML fragments. Several formats are exposed in this chapter with their advantages and drawbacks.
Finally, the fourth chapter presents web services applied to grid systems. This gives grid services. The Globus toolkit 3 can be used for those grid services, and the future version 4 is introduced as well. This part is the shortest, maybe because these special services are not widely used, and a lot should still be explored to have a clear software designed (which may be used by the Globus toolkit 4, according to the book).
The good writting style of the book helps reading the book, as some pages can be difficult to understand. The final goal is to present grid services, with the underlying frameworks and tools that are grid (and P2P) systems and web services. The beginner is taken from the basics to advanced concepts, which can be applied to concrete grids.
If grid computation, how they can be done, and web services are of interest to you, I suggest you read this book.
From P2P to Web Services and Grids: Peers in a Client/Server World (Paperback)
by Ian J. Taylor, Andrew Harrison
ISBN: 1852338695
Price: USD 52.30
17 deals available from USD 48.29
(4 reviews)
I am currently changing jobs and changing countries. This is why I have been really bad at dealing with questions on the mailing-lists, bug-reports or feature requests.
So far I have been working as a physicist, doing atomic physics (Bose Einstein Condensation). I studied quantum physics, mostly theory, and I did a PhD in an experimental lab, building a couple of experiments on Bose Einstein Condensation and atom interferometry. After this, I moved to Florence to do a post-doc also on a BEC experiment.
This kind of work is very experimental. These experiments are monsters that you have to keep alive doing a lot of homemade mechanics, optics, and electronics. I thought I would love that, because I used to like working with my hands, but I grew tired of it. I wanted to work more with abstractions. And in addition I am computer geek, the parts of my job I preferred were related to computers.
My contract has ended at the end of April, and I have not renewed it. I was missing my girlfriend and wanted to find an excuse to come back to Paris. So now I am jobless, living at the expense of my girlfriend. I decided to take some time without a job, as I have the feeling I have been working without stopping for the last few years, not having time to travel and visit the world as I like to. We are planning a three weeks trip to Uzbekistan and Kyrgyzstan in two weeks.
After this i am going to devote my summer to hacking. The big news is that I am going to be going to the states. I will spend most of my time in Austin, working for Enthought. I am very excited about this, as I see this as the occasion to learn more about building scientific GUIs with Python. Building usable scientific programs is something that I am passionate about. I will also spend some time at Berkeley, with Fernando Perez, hopefully to work on Ipython1. I need to thank Enthought for making this possible for me, as they are providing the money. With some luck, this summer I will be productive on the free software side.
Of course right now I am battling with moving houses, fighting for visas, trying to fall back on my feet and organize the summer. I still don’t have my visa for the states, and it is making me nervous. I would really hate to have to cancel my trip to Kyrgyzstan because of visa problems with the states: I take time off work, I expect to spend it enjoying myself, and not waiting for visas.
So I am quitting atomic physics. I am starting a new adventure in something totally new for me. Starting from October, I will be working with JB Poline and Bertrand Thirion, at Neurospin, on neuroimaging. This work is mostly data processing, even though it has a lot of interplay with the physics of NMR. This is something very new for me and I will have to discover a new field. The good news is that a lot of the work is centered on computers, and one of the core technologies used at Neurospin is Python.
Honorable Mentions
I’ve noticed some days ago that I mainly used one design pattern in my scientific (but not only) code, the registry. How does it work? A registry is a list/dictionary/… of objects, applications add a new entry if it is needed, and then a user can tap into the registry to find the most adequate object for one’s purpose.
In fact, the registry is one of the best replacement for the switch statement. Indeed, it is far more modular as new cases can be introduced and deleted, and it is more readable as well.
I used a registry in several pieces of code:
Python, with is dynamic and duck typing, is naturally inclined to use registers, IMHO. This is more scientific-oriented coding, but the ease of use of a registry is very helpful in my everyday work.
Here is a sample of the automatic use in pyP2P:
Download__init__.py | |
from advertisement_core import * from peer_advertisement import * | |
Downloadadvertisement_core.py | |
class Advertisement(object): pass registry = {} | |
Downloadpeer_advertisement.py | |
import advertisement_core class PeerAdvertisement(advertisement_core.Advertisement): pass advertisement_core.registry["PA"] = PeerAdvertisement | |
This way, when the module is imported, the registry is also automatically populated.
After Ipython and Sympy, Mayavi is now using sphinx to build its docs. Sphinx is very neat because it allows for high quality pdf and html from the same restructured text source. The killer feature is that the resulting html pages have a builtin search that works with javascript, and thus works on the client without the need of a server.
In addition, the developer is very reactive and dedicated to making sphinx versatile-enough to generate high-quality docs for many packages. As a result many Python projects are switching to sphinx. First Python itself (that’s what sphinx was created for), but now more and more. It seems that zope is even considering it. One great side effect is that documentation for different Python modules will be consistent, with the same look and feel (although you can tweak sphinx output if you want).
We don’t have a server serving the html docs yet (it is planned, we just need a bit of time), but you can check out the pdf generated here.
This book is different from the two last books I read. Indeed, it tackles a specific Python library, Twisted, and how to use it.
Twisted is a network library aiming at simplifying the developement of network applications. It is based on an event loop for all processing (unfortunately, no word in the book about managing several event loops, as it is the case with GUI-based applications).
After an introduction of what is event programming with simple clients and servers, the reader will be introduced to basic web clients and servers. Twisted proposes a lot of bridges to create webservices with XMLRPC or SOAP. The explanations and the code is pretty clear, and it is easy to do one’s own small distributed application with these blocks.
When authentification is introduced, it is hard to understand at first. Zope interfaces are used, but I didn’t find the explanation of what they are and what the function implements() is and does. One can find out with the context, but a complete introduction of these techniques should be done at this point. Once authentification is understood, other services are exposed, like mail clients and servers (how to send a mail, process the information in the mail to send an answer), as well as NNTP.
SSH is only introduced towards the end of the book. And it is not simply explained as it is mixed with shells. Finally, network applications often are services or dameons, how to create them is done in the last chapter.
This book is good, a lot of explanations and of code (some mistakes can be found here and there) helps understanding the use of the library. Some parts of the book are outdated, so I hope that a new edition will be published soon, and some software tools should be more explained. Every aspect of Twisted is not developed in the book, it’s only Networking Programming Essentials, but once the basics are known, the rest can be learnt with the documentation.
Twisted Network Programming Essentials (Paperback)
by Abe Fettig
ISBN: 0596100329
Price: USD 19.77
47 deals available from USD 7.73
(9 reviews)
This release contains patches from 15 developers, which is so far the highest number of people/release (33 people have sent patches to SymPy so far, see the list of contributors):
by Ondřej Čertík (noreply@blogger.com) at April 26, 2008 04:13 PM
I’ve already given some answers in one of my first tickets on manifold learning. Here I will give some more complete results on the quality of the dimensionality reduction performed by the most well known techniques.
First of all, my test is about respecting the geodesic distances in the reduced space. This is not possible for some manifolds like a Gaussian 2D plot. I used the SCurve to create the test, as the speed on the curve is unitary and thus the distances in the coordinate space (the one I used to create the SCurve) are the same as the geodesic ones on the manifold. My test measures the matrix (Froebenius) norm between the original coordinates and the computed one up to an affine transform of the latter.
I tested several noise levels :
Here are the results:
| Method | no noise | Gaussian Noise 5% | Laplacian noise 2% | Impulsive noise |
|---|---|---|---|---|
| PCA | 43.6 | 43.6 | 44.4 | na |
| Isomap | 3.01 | 8.55 | 7.01 | 3.80 |
| Sp | 2.29 | 2.94 | 6.46 | 2.93 |
| Ssam | 2.61 | 2.60 | 6.10 | 3.22 |
| Scca | 3.01 | 6.22 | 4.70 | 3.09 |
| Laplacian Eigenmaps | 21.13 | 23.51 | 23.47 | na |
| Diffusion Maps | 67.50 | 67.76 | 67.54 | na |
| Hessian Eigenmaps | 3.05 | 18.57 | 20.51 | na |
| LLE | 40.1 | 90.2 | 69.2 | na |
The geodesic-based algorithms perfom obviously and logically better than every other algorithms. In my case, I want this to happen as I want to estimate a mapping function between the reduced space and the original space. This estimation and the effect of the reduction algorithm on it will be the subjects of future tickets.
After Advanced Computer Architecture and Parallel Processing, I’m going to review another book from the same serie. As the title hints it, the goal of this book is to introduce the tools that may be used in parallel, grid and distributed computing. This is the layer above the architecture the last book presented.
This book is split in six very different chapters. The first introduces the basic and necessary notions of computer or cluster architecture that are inherent to parallel computing. The issues that remain to be solved are clearly exposed although they are not the main topic of this book (which are programming tools and environments). People familiar with usual computer science will benefit a lot from this chapter, people accustomed with grids and clusters will refresh their memories by reading it.
The second chapter is dedicated to message passing tools. After a small comparison between Distributed Shared Memory models (DSMs, that are presented later) and MP tools, the important aspects of these tools are presented. Then, several of them are presented, with their advantages and their drawbacks (often in terms of richness of the interface they provide). Several pages propose experiment results, from the communication time for a given message to specific parallel applications like FFT transforms, and the conclusions are very interesting (the widely used MPI tool may not be the most adequate one for your parallel application…)
DSM are exposed in the third chapter. Hardware and software-based are analyzed through their specificities. Although DSMs are very attractive, they are not that well spread in labs and clusters (as far as I know, and this is a very selfish opinion). Here, some of the reasons are presented.
The next chapter is an “UFO” in the whole book. For the first time, there is code (!!) for three distributed-object protocols (with tests and times), and perhaps more that usual text. At least, with DO, one can test even at the office (or even at home) if the technology may be interesting. Besides, DO are not difficult to use, and they can be very efficient in a small application.
As the different tools are used on grids, a state of the art of the grid is summed up in the fifth chapter. The state of the art encompasses the goals of a grid and the associated management issues (security, data, scheduling, …), the different frameworks (that may use some of the tools presented in the first chapters) and mainly the Globus toolkit, and of course the applications of the grid (astrophysics for instance). This chapter is enjoyable because it shows that the frameworks are used and developed for a lot of applications. This leads to robust libraries that can be reused for one’s application(s). A small presentation of web services is done in the end of this chapter.
Finally, some development process stages are presented. A typical parallel application cannot be developed like a simple application (text processing, browser, …), even less scientific applications. What was surprising when I read this chapter is that the different tools that can help developing parallel applications are listed after the presentation of the whole process. Besides, an example is used as a support to show how to use the process, but as no tool is used in the very example, its impact (as an explanation and as a support) is limited. For each step of the process, the conclusions are not always shown. This is annoying as it would have been a good argument for the proposed software development (which seems appropriate for parallel scientific applications).
This book was far more interesting for me than Advanced Computer Architecture and Parallel Processing, but this feeling is biased as I have a larger background in electronics and computer architecture than in parallel tools. Nevertheless, I enjoyed this reading which taught me a lot about the tools I could use and their diversity (shared-memory models, distributed objects, frameworks, …).
18 Deals available from USD 24.99
1 Deals available from USD 89.20
So I'm pretty bullish on testing for maintenance reasons. It was nice to see how well it worked out for me when a user recently reported a problem with Cartwheel.
This is what happened: third-party package (LAGAN) that the user was running through the Web interface depended on certain command-line behavior from 'sort'. Now, I wasn't aware the the command-line arguments to sort were still evolving, but apparently they are -- my latest Debian upgrade removed some options (the '+1' behavior) in favor of '-k 1'. In any case, I did this big upgrade of many packages, and didn't realize that this third-party program was now broken. (More on that later.)
The user reported weird results, so I went and verified that he'd set everything up properly and that this was in fact a real problem. Then I ran the Cartwheel automated test suite. Voila! Problem was instantly pinpointed in a reproducible manner.
I fixed the program (editing Perl, ick), re-ran the tests, and then re-ran the user's analyses. Tada, done.
OK, so, great, the tests pinpointed the error for me after the user had found it.
Why did I have to wait for a user to report it?
Because I wasn't running the tests under continuous integration on my compute server.
Why not?
Can't think of why.
What would you have done differently?
I would have made sure all my tests were passing on my compute server after I upgraded the thing, i.e. not been a schmuck.
What have we learned?
Tests are only useful if (first) you write them -- that's half the battle -- and (second) you run them. Oops.
More generally, it was fun to note that by putting a fairly high-level functional test on the batch-processing backend, I discovered a bug several levels down in my software stack -- a problem lying between a third-party package and a system utility. Unit tests wouldn't have found this bug, unless the third-party package had them (don't think so) and I was running the third-party package unit tests (good grief...)
OK, back to work.
--titus
I'm having a long-running discussion with some people about threading and why using threads with simple subprocess calls is almost certainly an overcomplicated (== BAD) use of threads. Everyone seems to think I'm wrong (at least, there's either deafening silence or straight out argument ;) and I think I finally figured out why.
The task at hand: use subprocess to run some command (say, 'ping') a bunch of times. Because the command is I/O bound, you want to run the commands in parallel. Should you use threads to do this? Is it necessary in order to achieve good performance?
Well, consider these two examples ('common.py' is down at the bottom; it just contains the list of IP addresses to ping, and a function to call subprocess.Popen).
nothread.py:
from common import IP_LIST, do_ping z = [] for i in range(0, len(IP_LIST)): p = do_ping(IP_LIST[i]) z.append(p) for p in z: p.wait()
thread.py:
import threading from common import IP_LIST, do_ping def run_do_ping(addr): p = do_ping(addr) p.wait() ### # start all threads z = [] for i in range(0, len(IP_LIST)): t = threading.Thread(target=run_do_ping, args=(IP_LIST[i],)) t.start() z.append(t) # wait for all threads to finish for t in z: t.join()
Both of these work fine, and in both cases are easily modifiable to retrieve the output, exit status, etc. of the ping command. (In the threaded example you have to keep track of 'p' in 'run_do_ping' to retrieve this kind of info, and I wanted to keep things as simple as possible.)
They also run in about the same amount of time, although the non-threaded one is quicker by a few milliseconds for me. I think this is because thread starts & joins are extra overhead.
The key misunderstanding in the discussion seems to have been that the examples at hand were using subprocess.call, which blocks waiting for the subprocess to exit, i.e. equivalent to using this code in nothread.py:
for i in range(0, len(IP_LIST)): p = do_ping(IP_LIST[i]) p.wait()
Here the pings would execute serially rather than in parallel, with the obvious performance problem :). However, you can bypass this effect of subprocess.call by using subprocess.Popen, which creates a new process that executes in parallel with the calling process.
So, for this simple use of subprocess -- running a shell command and gathering the output -- which is "better"? I think 'nothread.py' is better because it is simpler, shorter, clearer, and less complicated. Of course, as soon as you start doing more complicated stuff like reading the streams of information coming out of the subprocess commands, the threaded version may well have its advantages. But that's not the case here, I think.
Comments welcome.
--titus
common.py:
import subprocess
IP_LIST = [ '131.215.17.3',
'131.215.17.4',
'131.215.17.5',
'131.215.17.16',
'131.215.17.17',
'131.215.17.18',
'131.215.17.19',
'131.215.17.24',
'131.215.17.25',
'131.215.17.31']
cmd_stub = 'ping -c 5 %s'
def do_ping(addr):
cmd = cmd_stub % (addr,)
return subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
In some discussions with a moderately new Python programmer who seems to value complexity over simplicity, I may have coined a new term:
"Penis size" style of programming -- the (mistaken) belief that the more advanced programming language features you use, the more impressive your code will look.
I think it's a fair generalization to say that experienced programmers value simplicity over complexity, all other things being equal.
A search for "penis size programming" came up with this link, which is entertainingly apropos.
--titus
p.s. I originally used "dick size", but now that I'm a professor, I have to be decorous, right?
I have been struggling for the last few days trying to understand the issues behind packaging and installing the Enthought Tool Suite. I think have been making progress, though only in my head, no actual code or packages so far are terribly satisfying.
If you are developing a Python-only program, with only dependencies on the standard library, you have no problems with packaging. You can ship tarballs, MSi installer, eggs, … all this works.
However, if you want to develop a rich program that provides many features in a closely integrated and consistent way to the user, you will have to depend on external packages. I know that many projects work around this by including the external dependencies inside the project, or simply reinventing the wheel. Well this does not scale. We cannot expect to develop a major scientific tool and community this way. Reuse is the key to scalability, in my opinion. Thus comes the problem, how to we ship our program?
The problem can be very well seen with the Enthought Tool Suite (ETS). The ETS is a suite of many different packages, all pretty much geared towards building interactive scientific application. In house, Enthought, the company (disclaimer: I do not work for Enthought) uses these packages to develop domain-specific applications for customers. They have broken up the suite in a set of small packages, to enable assembling applications by requiring only the features you need. This is important because if you want to use ETS’s 3D plotting package (TVTK or Mayavi), but you want to stick with MatPlotLib to do 2D plotting, and not use Chaco, you should be able to download only what you need.
As a result the ETS is made of a set of interdependent packages. Maybe they went a bit too far in the modularity, and there are almost 50 packages. The dependency graph looks like this:
Just to reassure you, the next version of the ETS has a much reduced number of packages, just because some packages where grouped, and the dependency graph indeed is sane:
As you can see, there is a complex dependency graph. So how do you ship this to the user? Another problem that should not be underestimated is: how do you make it easy for people who distribute your projects to package this?
Python has no good answer for this problem, but setuptools do go part of the way. Dependencies in the ETS are declared using setuptools, and installing the ETS strongly relies on setuptools.
Setuptools provides a way of automatically downloading dependencies. However, it is not a full packaging system replacement. The reason I say this is that it does not have the knowledge of a dependency graph, it just downloads packages, introspects them to find their dependencies, and recursively tries to satisfy them by downloading more. Phillip J. Eby (the author of setuptools) has been quite clear that he does not want to write an APT replacement, tough people keep getting it wrong and making the equation “easy_install = apt for Python” (IMHO this is due to bad communication on setuptools webpage).
Moreover, setuptools does not provide an easy to use API to extract all the information it has about packages, dependencies, and download URLs. It is thus not trivial to plug packages shipped with setuptools in an other package manager like rpm or apt. This is why bothers me most, because this is strongly limiting the exposure the ETS is getting in distributions (whether they be Linux distributions, or scientific computing “superpacks”). Recently I have had discussions with somebody on how to ship Mayavi in a monolithic distribution he has developed. He agreed to ship setuptools with the distribution, so now I need to give him a list of eggs to provide. There is no obvious way to get this list using setuptools (insert here big big rant). So I thought that an option was to install Mayavi in a virtual environment to trac the eggs added, and use this information. However, this person’s internet access was possible only by login on dumbed-down servers for security reasons. So we hit a wall. And for me this wall is a wall we keep hitting with setuptools: setuptools does everything for you, the download, the building the install. It does have flags to control these processes, but it does not expose the information you need to do this without using it. I actually think the reason it does not expose this information is that it does not know it a priori. Looking at the code it does seem so. In addition, the structure of the packages make it hard to do.
On the other side, Dave Peterson, at Enthought, has been working on a tool to allow checking out of the ETS SVN only the projects you are interested in. I played a bit with it, and modified it to generate the dependency graphs. I quickly found out that I actually like this tool much more than setuptools, even though it was pretty much using the same concepts. It took me a while to understand what I like about the tool. It is that it uses a map file to gather all the package and dependency information. As a result, it has the equivalent of a dependency graph. This makes it possible to do the operations I am interested in, eg listing all the packages required for installing a given project without actually downloading them.
The reason this is possible is that with the ETS we are not dealing with an open set of packages, like PyPI, in which packages can come and go, and no consistency is enforced. We are dealing with one suite of multiple projects that are made to work with each other. The base entity is thus a project set, on which we can make a “project map”.
What Dave has done works fantastically for development, I would like to push it further for distribution. What we expose to the user can now be a repository, in the sens of APT: a set of packages with consistent inter-dependencies, and a way of retrieving easily this information. The difference between the two, and the implications of the difference, is not something I had clearly in my mind in the beginning, but it is becoming clearer that having a repository with a project map gives a lot of added value for distributing. I’ll see if I can reuse Dave’s work to build such a tool, but do not hold your breath, I am not willingly in the business of packaging, and will probably not spend enough time on this to make it a good tool.
Edit: Correct Phillip’s name.
This is my first review. I read this book some time ago but I still want to write about it because the topic is very interesting.
The main matter of this book is not about how to write parallel applications, but more on what underlying architecture is interesting for those applications.
First, the reader is introduced with some vocabulary about parallelism (SIMD, MIMD, …) and their impact on applications. Then, interconnections between processors is introduced. In every current parallel application, communication between processors has a big impact at different scale: when more processors are added to the cluster, how the application is distributed, … Several chapters are discussing the issues that users will face.
Once the processors are connected, the different process that are executed need to use a way of communicating: they share some memory or they post messages to each other. Each model has its advantages and drawbacks. They are described in two complete chapters.
These architectures solve problems. The reader is introduced with the concept of abstract model. The underlying assumption is that problem that cannot be parallelly solved on the abstract model cannot be solved with shared memory or message passing architecture. This part is also abstract in its content and is perhaps the hardest part to understand in the book. I agree that mastering the abstract model will ease the pain of design the application, because it would have been split into parallel chunks first, but there is some overhead with using this model.
A complete chapter is dedicated to the Message Passing Interface, MPI. This is a standard that is widely used, and this chapter is perhaps the closest to programming issues (there is some code !). The C implementation is exposed (not the C++ or Fortran’s interface) in a reference-manual way.
The last chapter is about scheduling. This is a very difficult topic that is not introduced in a lot of books. It is a difficult field (an NP-complete problem), and this is hidden in a lot of explanations. Here, the reader has a complete introduction on the topic and can benefit from his reading when he has a lot of paralel task to do.
The topics covered by this book are very pragmatic. Each programmer should read it so that he can understand the issues with communication in a parallel application. The main reason why parallel applications are not so widely spread is : hardware intercommunication does not scale very well, but solutions are finally starting to be available. As a conclusion, I think this book should be in every university library, not on your own one, because it is still very expensive.
Advanced Computer Architecture and Parallel Processing (Wiley Series on Parallel and Distributed Computing) (Hardcover)
by Hesham El-Rewini, Mostafa Abd-El-Barr
ISBN: 0471467405
Price: USD 118.50
27 deals available from USD 58.94
(0 reviews)
Advanced Computer Architecture and Parallel Processing (Wiley Series on Parallel and Distributed Computing) (Kindle Edition)
by Hesham El-Rewini, Mostafa Abd-El-Barr
ISBN:
Price: USD 89.20
1 deals available from USD 89.20
(0 reviews)
Pavel Vinogradov <fastnix> has been keeping me updated on an issue he discovered while testing TCMalloc with Python as a Google Highly Open Participation (GHOP) task, task 105.
Briefly, Pavel discovered a situation in which replacing the Python memory allocator with TCMalloc resulted in really bad performance. The latest is that there appears to be a bug or gotcha in TCMalloc with glibc, where TCMalloc does a poor job in cases where mremap can be used by glibc. The TCMalloc folk are going to look into it more, I gather. (See google-perftools thread here.)
Anyway, this was a situation where we just threw the task at the students to see if anything interesting would pop out -- not expecting much of anything other than a learning experience for the student -- and yet through some simple-yet-dogged testing, Pavel really contributed something.
Awesome stuff!
There have been several real success stories to GHOP. I need to write them down, sigh... my kingdom for some time :)
--titus
In March 2008 issue, IEEE Computers published a case study on large-scale parallel scientific code development. I’d like to comment this article, a very good one in my mind.
Five research centers were analyzed, or more precisely their development tool and process. Each center did a research in a peculiar domain, but they seem share some Computational Fluid Dynamics basis.
Although the centers are very different, they use a common set of technologies :
Whereas Computer Science (CS) students are taught how to write an application in an efficient way (robust but rapidly written), Scientific Computing (SC) students must develop fast algorithms in a short time. This is needed because parallel computation is used when a serial computation would take too much time, but even parallelized, these computations can take several hours or days.
Having a prototype is great, but it only is half the job. Once you have a prototype, you can test it, tune it if needed, and then it must be parallelized (sometimes it is parallelized during prototyping; I tend to think that parallel code must be introduced after a first draft, but it doesn’t mean that I didn’t think about how to parallelize my code). At this point, it is not sure that the code executes well on several dozens of processors, but it can be tested on a small farm (talking about farms, one great thing about Subversion is that it can trigger actions, like building and testing code on a farm, this is a must have).
It seems that none of these centers have found a correct parallel debugger for their application. Even for a multithread program on a simple computer, mastering the debugger and then debugging the code is hard. A lot of manpower will have to be put in this domain…
Here are some of my thoughts about what could be used to enhance the quality of such an application, some of them are already used in some of these centers (so it is not completely crazy to express them here) :
I do not pretend to know the truth with my comments; these applications are developed for a very long time, far longer than my own development experience, and thus I’m not in the position of knowing better than the people working on them (if one of them is reading this post, I’d like to congratulate her/him for the hard work). But I think that sometimes a new look at a problem may solve it, and Python may be an efficient tool for these applications, leading to even better scientific applications.
At Google Campfire One, v 2.0 -- introducing AppEngine.
IT'S FREEZING. The cider ran out. Brr.
Deploying Web apps is annoyingly difficult. Technical hurdles, etc. Need machines. Blech. Costly.
AppEngine solves all these problems. Runs web apps, handles app lifecycles, apps are run on Google infrastructure can make use of GFS, auth etc. etc. etc.
Components
Config is in YAML, with mapping done by regexp.
from google.appengine.ext import webapp, looks like Java to me.
Naah, that was mean. It's python.
Oooh, WSGI built in.
def get, def post -- looks like web.py. why do people do that?
Using introspection, building SQL-like GQL, to drop stuff into/suck stuff out of database with proper names.
Django templates.
Single-command deployment.
Scalable serving infrastructure: when app pushed, pushed to multiple fault tolerant servers. Any one may fail, but request will always go through. THE ROADS MUST ROLL.
All Python runtime and many third-party libraries available. (How do they deal with security? What about SQL and ORMs?)
SDK: releasing for Linux etc.
Web-based admin console. Standard stuff, stats collected in "near-real time".
Data store. BigTable. Yah. Horizontally distributed fault-tolerant system.
No joins in GQL?! Rationale: joins may need to work across computers and individual ram capacity.
Send e-mail, make HTTP reqs, auth with Google accounts, use a variety of frameworks.
Went to get cider, they replenished. There are also meatballs wrapped in dough (!?!) It is still FREEZING.
Guido gets up. He's on the AppEngine team.
All about making tools for developers. But hates root password. Thinks AppEngine solves this.
"First time Google lets other people run stuff on their servers."
Isn't this a support nightmare? Software versions etc?? Well, can upload your own frameworks.
Stdlib emasculated in three ways: writing to the file system is forebidden; cannot talk directly to the network (urlfetch & mail sending API); no threads (chuckle, I think GvR not so secretly hates threads).
Python is not the only language (you heard it from Guido). Perl? COBOL? Assembly?
Stuff about admin infrastructure.
Lots of nice error logging/tracking stuff.
Data viewer. Interactive query. Nifty. (Hmm, how do you upload bulks of data??)
Version control built in. Yay.
Is testing built in???
Adding a domain...
Host limits: 5mn page views a month for a well written application is free.
Over and out. I'm freezing. Still.
--titus
I have been reading an article about a new language paradigm (Erasmus, a modular language for concurrent programming). The authors discuss the limitations of objects in terms of modularity. To sum up their point (and most probably distort it completely), the limitations with objects comes from the fact that you can’t be sure what is modifying what: suppose you have a method foo of an object bar that you call in a method of an object baz, you cannot be sure that this method hasn’t modified private attributes of your object baz, as foo could have called a method of your object. This does happen in large code bases. Of course, best practice tries to reduce this to a minimum, but this reduces modularity, and thus limits both code reuse and concurrency (as side effects are not well controlled).
Erasmus’s solution to is adopt a new container, that they call modules rather than objects, and that are based on message passing rather than method calls. These modules live in separate processes and can themselves be made of more conventional code (I am extrapolating a bit from the original article here).
This strikes me as being related to a pattern that I see more and more in my code that uses Traits. The objects deriving from HasTraits have a very easy and cheap way of coupling callbacks to the modification of their attributes. This induces a programming style know as reactive programming that is entirely callback-driven. In addition, this is a nice way of ensuring that the internal state of an object is always consistent. This is a first step to message passing and decoupling: you no longer call methods, you just set attributes and let the object do the rest. The limitation of this model in a large code base is that you have to carry around references to the objects you are interested about, and their attributes. Traits has patterns to help you do this (delegation, namely), but it is still a limitation.
This is where the Envisage framework comes into play. Envisage introduces the notion of plugins which provide extension points. These extension points are special traits attributes that are published in a registry (which can be application-wide, or not, in Envisage3). You can query the registry to retrieve these extension points and contribute to them. After that, the traits callback mechanism triggers an action in the plugin contributing the extension point.
This contribution mechanism could be based on message passing between processes quite easily (although for GUIs it breaks down, because AFAIK you cannot assemble a consistent GUI from different widgets living in different process space, without using some Xwindows-specific tricks). Of course this does not give me hard guaranties of decoupling and control of the side-effects, as a call to a plugin can induce calls to other plugins inside it. This is where best practice comes along: core plugins should be able to run and provide their basic functionality outside of Envisage, as normal objects. Envisage should only be a thin wrapper allowing them to expose this functionality and extend other plugins. This is introducing a distinction between objects and method calls, that do not need to be arranged in self-consistent entities and which you use very often , and plugins and extensions contribution, that form standalone entities and should be used more sparsely.
Of course Envisage cannot go too far in terms of providing guaranties for decoupling. It gives a mechanism, best practices, could even help plugin decoupling by having them live in different processes, but as long as it does not enforce rules in the semantics of the language, it cannot achieve what projects like Erasmus are trying to do. I however think it is good to have a look at the work done in these projects to see what we can learn.
PS: Web apps suck! I made a few sortcut mystakes under wordpress, wanted to undo them and hit “Ctrl-R”, which is “redo” under vim, and lost all my post. I strongly don’t believe in web apps, amongst other things because they don’t allow me to use vim.
Please send this on to anyone who might be interested...
Disney Animation has an opening for a summer intern to work on a testing project under the supervision of Paul Hildebrandt and Dr. C. Titus Brown. The ideal candidate will have experience with a dynamic language supporting introspection (Python preferred) as well as experience developing unit, functional, and regression tests, but the ability to learn quickly is the only requirement. The work will consist of building a new testing tool for aiding test automation on large existing code bases. We expect to release the tool under an Open Source license.
Please send your CV and a short personal statement to da-testing-2008@idyll.org, c/o Paul Hildebrandt and C. Titus Brown.
- BA/BS requested, MA or PhD in-progress desired.
- Housing stipend provided; competitive hourly wage.
- 2 months commitment minimum, 3 months preferred.
- Opportunity for follow-on work or continued employment.
Disney Animation is located in Burbank, CA.
I'll post what I can of the specific project proposal when I have a cleaner version.
--titus
Some of the widely used method are based on a similarity graph made with the local structure. For instance LLE uses the relative distances, which is related to similarities. Using similarities allows the use of sparse techniques. Indeed, a lot of points are not similar, and then the similarities matrix is sparse. This also means that a lot of manifold can be reduced with these techniques, but not with Isomap or the other geodesic-based techniques.
It is worth mentioning that I only implemented Laplacian Eigenmaps with a sparse matrix, due to the lack of generalized eigensolver for sparse matrix, but it will be available in a short time, I hope.
The Laplacian Eigenmaps are the most known technique using the similarity graph (safe for LLE, which is nothing more than a special case of the Laplacian Eigenmaps). The similarities are computed between neighboors (neighboors meaning the samples that are near one from another in a distance way or samples that are close, like pixels in an image), generally with a Gaussian kernel. The trick here is to choose the correct width of the kernel. Then, the similarities matrix is weighted (each column and line must sum to one, this is the Laplacian of the graph) and then eigenvectors are extracted from it. The first eigenvalue is one and must not be used.
Here is what I get :
One may wonder why the reduction is so poor, but I’m not the only one to get this result. I tried every width for the kernel to no avail. The literature says that Laplacian Eigenmaps tendto cluster points, which is easily explained by the algorithm. The eigenproblem extracts the main eigenvectors so that the weighted similarities matrix is preserved (in a quadratic way). This means that even if points should be close, if they are not close enough, they have a similarity of 0 so the eigenproblem will separate them.
Diffusion maps are another similarity graph technique. Although there is a Markovian/probabilistic interpretation, diffusion maps are basically Laplacian Eigenmaps with similarities computed between every pair of points. This means that they have the same drawbacks that Laplacian Eigenmaps except for the clustering. The width of the kernel is still difficult to estimate.
Here is the result :
The fact that every similarity is used explains the fact that diffusion maps cannot reduce the SwissRoll correctly. In this precise case, the kernel width was obviously too big, but smaller width gives a result similar to the Laplacian Eigenmaps, which is not correct either.
The other technique I will present is Hessian Eigenmaps. Instead of estimating the Laplacian of the similarities graph, it tries to estimate the Hessian. This gives very good result for the SwissRoll :
Unfortunately, the technique is not robust to noise, as I will show you in the result ticket. Safe for this fact, the technique is robust to holes in the manifold (not uniformly sampled manifolds for instance), which is one of the biggest drawback in techniques based on the geodesic distances.
Stay tuned.
Analytical solutions to the dimensionality reduction problem are only possible for quadratic cost functions, like Isomap, LLE, Laplacian Eigenmaps, … All these solutions are sensitive to outliers. The issue with the quadratic hypothesis is that there is no outilers, but on real manifolds, the noise is always there.
Some cost functions have been proposed, also known as stress functions as they measure the difference between the estimated geodesic distance and the computed Euclidien distance in the “feature” space. Every metric MDS can be used as stress functions, here are some of them.
The oldest function is Sammon’s NonLinear Mapping. Originally based on Euclidien distances, I implemented it with the approximated geodesic distances described in the Isomap ticket. The goal of this function is to add a weight (the inverse of the geodesic distance), leading to less weight for the greatest distances, but also an important weight for small distances.
Here is the cost function for the distances (y are the coordinates in the original space and x in the feature/reduced space) :

Optimizing this function with a conjugate-gradient descent from a random start can give this result :
Another function that is present and cited in the litterature is Desmartines’ one from the Curvilinear Component Analysis :

The F() function is 1 when the argument is small (less than an arbitrary value lambda), and else 0.
As a consequence, this function is not convex, not even continuous. The algorithm proposed in the associated paper is not great, I never managed to make it work on a SwissRoll, even with few points. So here are the step I use to optimize the function :
Each time, the new point is moved according to every already placed point, then when every point is moving, only the local stresses are used. But the optimization can still go wrong and some points that should be close can end far one from another, because their associated stress is zero.
Here is the result for this optimization :
The cost function I use is a robust one, not “recursive” as Desmartines qualifies it (the weight is not a function of the estimated distance as it is the case from the CCA cost function) :

The first term is the robust term, derivable when the (geodesic estimated and Euclidien computed) distances are equals, the second term allows for a fast convergence when the distances are not correctly estimated (useful at the beginning of the optimization, less afterwards) and the last term gives a small weight for small distances, as they can be polluted by noise for noisy manifolds. Gamma should only be a small value, Tau is set to be equal to 80% of the geodesic distances and sigma to 5%. This gives good results in every case.
Here is the result with this cost function :
Its optimization is not easy as it can give folded reduced space as an answer. I proposed two algorithms to solve the issue :
The first one :
The second one :
The second algorithm is slower than the first, but it works every time.
Stay tuned for the results…
From Wages or Shortage, this comment
""" A-grade engineers are unfortunately similar to Welsh longbowmen: devastatingly potent compared to their peers, but you have to start their training at age 10 or so. Simply upping the salaries of A-grade engineers won't magically create more of them. We know this, as we tried exactly that experiment in the boom." """
and this comment
""" ... the notion of "best practices" is widely misunderstood in IT. It is not organizational best practices that most improve the output, it's best engineering practices. And those are accepted first by the "rockstar" types and least understood, why, most often resisted, by the management and subsistence engineers. Where do you think that 10-20x gap comes from, lightning-fast typing skill? ;-) """
and this one
""" ... such practices often fall victim to the hero mentality that is the odious legacy of the dot-com boom. It basically says that for a tech company to do well, it has to find some rockstars, clear the decks for them, and sell the gold that trickles out of their foosball-table equipped office. It fosters a warlike mentality in the workplace and sacrifices long term growth for short term market share. It also happily sacrifices a vast middle ground of engineers who would improve and be profitably productive with a positive environment and some solid mentoring so it can lavish luxury on the super-productive who may not, as Dave seems to concede, necessarily add business value. Contrary to Dave's assertions, I've also seen good engineers get better in such an environment. """
all ring true.
--titus
This month I have traveled a bit for scientific-computing related reasons, and of course it was pure delight.
First of all, I was speaking at the OKcon, open knowledge conference in London, about Scientific tools in Python in general, and Mayavi in particular. I jumped on the occasion to visit the Airbus campus in Bristol. We have had some contacts with these guys, because they use Mayavi in some of their homegrown applications, and I was curious to put faces on friendly names on the mailing list. In addition, I was eager to find out how they were using Mayavi and Python scientific tools in an industrial environment, as I have never worked in another place than a physics lab.
The Airbus visit was enlightening: the Bristol campus is a major research facility (several thousands people) dedicated to wing design. A good part of the work is done through simulations deployed on big clusters. These calculations have historically been run in Fortran and C, but apparently the engineers are switching to a mix of compiled languages and Python. Moreover, steering of these simulations, through mesh-design, visualization of the results, analysis of the data, is done mainly through an interact program, ‘flightpad’, that is developed fully in Python, using the Envisage framework to couple together a bunch of scientific components, including Mayavi. I got to spend a fair amount of time with the guys doing this, and it was great to see how they did it. They have a good approach to scientific software design (loosely coupled components, reuse of all the existing libraries), eventhough their goal (automatic generation of Python scripts from user interaction) is way more ambitious than anything I have in mind. I was pleased to see that they where using Mayavi in a way completely consistent with its design, and did not have to hack around limitation.
It was really very encouraging to talk with the software strategist. He obviously completely got it as far as how an open-source model can be profitable to a company like Airbus. See so many people using open source tools as their main tools, as well as a manager ready to back this position, and explaining how it can be beneficial to contribute to an open-source project, really filled me with hope.
Of course visiting the Airbus campus was not only about software, it was also about planes (I got a drive around the campus, and it is quite fun to ride a mini cooper between to 747), and beers (reinventing the world to make it a better place at the pub, after work). I must say there is something special about the scientific Python community, it is the nicest community I know (with the sailing one :->). You meet people that you have never seen before, and you immediately feel at ease.
The Open Knowledge conference was fun. Not too much like the geek conferences I am used to, as here the focus was on the data, and not the tools , aka the software (for instance, the big deal is when you can get access to the complete public transport time-tables, and you can make maps of poorly connected areas). I met Martin Albrecht from the sage project. It was very interesting to discuss with him. I generally consider myself as doing rather fundamental research (Bose-Einstein condensation), but for him I was in the applied science section, because I use math and computers to do applied things. This distinction between applied and fundamental maths yields a distinction in the application of the code, and therefore the way an open-source scientific project can survive. It was very interesting to see the way sage’s development process therefore differed from scipy’s. I think that both Martin’s talk on sage, and mine on Python and interactive visualization had a lot of success: the room was full of scholars, and they wanted tools to do their work.
In London, I had the occasion to catch up with my brother, and Rob, a former colleague. That was nice too (and yielded more beers).
The week after, I was attending a sprint in Paris on nipy: neuroimaging in Python. We were a bunch of enthusiastic scientific Python users crammed in a small room during the day. There was the team from Berkeley with including Jarrod and Fernando, and all their friends. I got to make new friends, and catch up with old ones. The goal of the nipy effort is to build a complete processing pipeline for neuroimaging data, especially fMRI, in Python. This is a lot of work, as many transformations are applied to the raw data to make it useful for scientific publications. As the field matures, these transformations pile up, and the processing pipeline gets more and more complex. There already exists a good pipeline under MatLab (SPM), the problem is that, due to the poor language features of MatLab, it is a codebase hard to extend and to modify. One of the goals of the nipy project is to make a pluggable architecture, for researcher to be able to replace part of the pipeline by their own code, and thus explore new methods while comparing them to the reference one. This means that there are some interesting software engineering problems in here (pluggable pipelines, framework…, the kind of stuff I like), however the current focus is to get the algorithms right, before trying to do software over-engineering.
The Berkeley group got an NSF grant to work on the project and has been able to hire two developers for two years (Chris Burns and Tom Waite). The effort is lead by Jarrod Millman, and they have put a lot of work in making the underlying libraries better (that is improving numpy and scipy).
I had difficulties contributing any useful code, as I don’t know neuroimaging, but I had the pleasure of seeing people pick up the mayavi API and use it to quickly build domain-specific tools for displaying brains and activation regions. As usual this also revealed some shortcomings in the mlab API that I plan to address ASAP.
The week end after Fernando, Laurent Dufréchou, Stefan van der Waalt and myself crashed at my parent’s place to work on ipython1 and the front ends. My mother cooked us some fabulous food and I had a great time.
Unfortunately we did get as far as I would have like. The right abstraction for talking between the ipython1 execution engine, and the front end are not really easy to get right, as the engine is nothing more than an abstract execution engine, that basically only has a namespace and knows how to execute stuff in a non-blocking mode (that’s where it gets hard: how do you know what is going on with your engine and the commands you have sent to it? How do you deal with introspections requests such as tab-completion or docstring exploration). We want as little logics in the front ends as possible: let us not duplicate tab-completion or history. This is why we are progressively building an object, that Fernando dubbed “InputStateManager” that is doing the impedance matching between the front end and the engine. I am starting to believe that the best way to connect this object (ISM) to the front end is via a callback-based mechanism: the front-end calls the ISM methods and gives them a callback to call when finished (for instance if running in a different thread, a Wx frontend would pass something based on Wx.CallAfter to display the result). That way the mechanism is very general, can adapt to event-driven front ends or readline-based one, and knows nothing about the front end. Of course not much code got written, because I am way too slow, and it took me ages to figure this out.
We had a lot of fun, and for me the highlight of the week end was when my girlfriend joined us to do some hacking on a really cool project trying to use the scipy.org