How I use Linux for theoretical physics

4 readers like this.
Penguins gathered together in the Artic

Opensource.com

In 2008, I started studying physics and got in contact with Linux, since a bunch of people used it for data analysis and simulations. Comprehension came fast and easy with such people around, and I was strongly encouraged to get things done with Linux. I installed Ubuntu on my notebook, and soon got familiar with Bash and the standard tools.

After some years I turned to theoretical physics. While I was writing my master's thesis I gained access to a workstation running Scientific Linux, and a cluster system with a few hundred cores. I was impressed that each of my peers had implemented his own customized workflow, and that it was actually possible to work entirely with the keyboard, which is inconceivable for a Windows user. Using Linux on a daily basis luckily resulted in a steep learning curve. My task was to write a piece of software with the ability to determine material properties from video sequences, which we received from experimenters. I got it to work using C++, which handled image analysis and lots of math in a satisfactory amount of runtime, which was a critical factor. This project could have been done in Windows, but using Bash and powerful command line tools like MEncoder, convert, and AWK turned the development into a quick and efficient process.

I have been working on my PhD project for nearly three years using exclusively C++ and Bash. I enjoy the old C99 coding style without smart pointers, functionals and other stuff intended to make coding comfortable. Throughout my set of self-written tools I use libraries like the GNU Scientific Library, Cairo, OpenCV, OpenMP, and some more. Since I produce a lot of numeric data requiring heavy post-processing, I would be stranded without the Bash shell and its tools. For visualization I use gnuplot and Visual Molecular Dynamics (VMD), which are great and simple-to-use tools that save a lot of time.

Coding in Windows left me unsatisfied in understanding the mechanisms, which are hidden behind the standard system calls. Because I spent a lot of time condensing large data blocks into simple graphics, I became more interested in finding out what happens behind the scenes. This resulted in my own database project, which I worked on for about three years. Of course I know there are hundreds or thousands of existing projects, but I was interested in understanding it from the bottom up, and this is what Linux makes possible. It involved a lot of problems, which probably loads of people tackled before me, but I liked the challenge. Reading books like The Linux Programming Interface and Unix Domain Sockets inspired me to work out some simple concepts. This project helped me build useful experience, since in the future I would like to contribute to open source projects either by starting a new one or by joining another.

User profile image.
Jonas Hegemann is a physicist from Germany. In February 2018 he finished his dissertation in theoretical biophysics including computer simulations on cell biology and video analyses of elastic materials.

7 Comments

Thanks to the author for using the phrase "steep learning curve" to signify rapid learning instead of difficulty learning. You might consider investigating python as a wrapper to all of your lower level analysis/simulation code; its syntax is very nice to work with, and you end up with far more human readable code than shell scripting, with all of the power and speed.

Thanks for your comment! Actually many of my colleagues use python for this purpose, and I do so as well from time to time. Concerning readability your definitely right. My bash script collection is kind of a historical heritage, since I began with bash and henceforth proceeded with it. For the future, I'll definitely consider python as an alternative to bash.

In reply to by anon (not verified)

Reading experiences like these impress me with how much I have watched the tools have grown over the years. I completed my PhD quite a while ago and though I didn't have to carve my microprocessors out of wood, I did my control software in 6502 assembly and used an assembler written in BASIC.

I hope you're enjoying the challenges and the improvements in tools and can sharing what you're creating with others. If you get rich, that's okay, too.

Good luck and keep at it.
Uncle Ed McNerd, PhD

Good to hear that Linux worked for you.

However I'd like to make a few critical remarks about (1) how you came to use Linux, (2) how you structured your work load (3) how you dived into databases and (4) your eulogy of the CLI (Command Line Interpreter).

(1) I recognise and understand your introduction to Linux. It's ubiquitous in today's Physics computing environment because it's free, allows easy access to all kinds of ports (for data capture), because it's fairly easy to set up your own fully customised analysis software to pipe your data through and above all because work under Linux tends to be cumulative. What I mean is that work from previous PhD students within a research group will usually give the next set a head start. So that's fine, and often a better alternative than Windows.

(2) I understand that part of your work was to process video data captured from experiments. You said you did all the work in C++. I would have wondered about that and asked why you didn't use e.g. Matlab, and if you needed parallel processing why not use a Hadoop cluster with Matlab? Of course it depends a bit on the exact nature of the algorithms (but even then you could have C++ subroutines for the fiddly bits) , but in my experience Matlab is about 5-10 times quicker to code and has comparable runtimes. I have students mess up terribly with C++ when they tried to make active use of its class features together with memory management, to the extent that they had to ask for help from a real programmer or I even had to gently nudge them towards Matlab, Python. or existing library routines software. Matlab does its own memory management, has powerful graphics, excellent maths primitives and decent analysis routines. Of course Matlab on Linux is relatively expensive, so you either use it on Windows or there will be a time/cost trade-off. I wonder if you did a comparison and made a deliberate choice or if you just went with what was nearest to hand. While it's largely your own responsibility to choose your tools (feel free to make a detour through C++ if you like), I'm not convinced your choice was optimal from the perspective of solving your problem.

(3) As to your interest in finding out what goes on "behind the scenes" in condensing data blocks. this is entirely laudable. It left me wondering however how much of your time and energy was spent wisely, as you weren't working towards becoming a programmer but a PhD in Physics. As to digging into the specifics of database processing, Linux system interfacing, and sockets programming: as a hobby it's fine, but from a professional (or research) point of view I'd regard it as a waste of time (unless you have clear evidence that available routines are inefficient, or insufficiently adaptable). But that's just my opinion.

(4) As regards the CLI which you so praise, I have a healthy respect for it but usually hide its workings behind make scripts. Python and GUI's. As far as I've seen, it helps students (even PhD students) to build a simple GUI (Matlab is excellent for this, you might like Kivy if you insist on FOSS) from which to rum their software to produce final output. The main reason is that students tend to end up with a myriad of (usually undocumented) subtly different versions of whatever software tools they've built, and often are unable to reliably reproduce their work, let alone reliably adapt it. Having a GUI for any "results version" leads to one set of traceable routines that went into a certain result which is easy to archive and can easily (and reliably) reproduce your results even after a few years. Besides it helps whoever comes after you.

Hi Golodh, thanks for your detailed comment. Let me briefly respond to some of the points you made.

(2) We already had an existing implementation in Wolfram Mathematica, but it was marginally automated and much too slow to push large data sets through it. Especially the numerical methods slowed down the process and optimization was really necessary, since the used equations are hard to solve and unstable. This becomes even worse with decreasing quality of the input sequences. The decision to use C++ was actually not mine, it was part of the task, but we got a speed up from a few weeks in Wolfram Mathematica to a few hours in C++, so I think the decision was not too bad. I definitely agree with you in using existing libraries as far as possible, but I made the experience that "blind" usage of libraries might also lead to issues. Sometimes you have to change, tune, modify, or adapt a library function. Sometimes this can be done by writing a wrapper function, but in some cases, its faster to write the whole function yourself. I see the benefits of Matlab and Python concerning rapid prototyping, but being a trained C++ programmer with the possibility to draw on a large existing pool of code makes this advantage in some way less important. Of course you are better off in programming Matlab if you don't now how to manage memory in C++ efficiently, but you should agree that in principle Matlab or any other math software can not be faster than a customized C++ implementation. I know a PhD project, which consists of writing highly optimized C++ code intended to beat Comsol Multiphysics in performance. It's really worth the effort for runtime critical applications.

(3) You are probably right if you call it a waste of time regarding there are no benefits for my current work. Up to now, it is just a hobby and I'm not spending too much time on it. Yes, I'm doing a PhD in physics, but I'm not inevitably connected to academic research. Actually I'm planning to get a job in software development or a related field.

(4) Instead of a gui, I have a config file setting all relevant parameters, but only few of them have to be adapted after changing the input files. So usually its fairly simple to run the program, and results are wrapped into a large html report showing all relevant informations. So this is not too far from a gui from my point of view. We already share the code with some chemists. They normally use Windows but have no issues while using the CLI in a virtual machine yet.

In reply to by Golodh (not verified)

Hi Jonas,

Happy to see you had the time to read and comment on my ramblings. I just couldn't resist commenting when I read your article because of its enthusiasm for the programming side of things (well ... this is a programmer's forum).

(2) Ok, I didn't know that you already had a working prototype (basic algorithm clear, a working system design available). Rewriting the software to optimise performance then makes good sense. I also didn't realise that you already were an experienced C++ programmer. Thanks for pointing that out; I believe this combination makes all the difference.

I agree that Mathematica isn't the best choice for high-performance numerical work, but it's an excellent choice for prototyping if equations are hard to solve and numerically unstable (which is probably why it was used in the first place).

As to Matlab performance, I have seen cleverly structured Matlab functions (leading to calls to the Kuck (handcrafted assemby language) library routines) beat "optimised" C++ code on a regular basis.

The problem is that the people doing the "optimising" (not being software engineers or numerical mathematicians but researchers) often don't understand detailed software engineering issues like pipe-lining, the machine's memory hierarchy, cache coherence, the use of stride in array access etc.. but instead love to use abstractions that hinder the compiler in producing good assembly code, and are fond of using all sorts of clever tricks to save a few percentage points of operations at the cost of many pipeline stalls. They are then very surprised to learn their "optimised" code doesn't set performance records (or worse: suffer from numerical instability) :-)

But by and large I think you are right: if you do know about software engineering details and are a competent C++ programmer, then your implementation is usually quite hard to beat.

(3) Ok, didn't know that either. So you probably made the right choice again :-)

(4) Ok, fair enough. As an experienced programmer you probably check your inputs and make sure your code doesn't crash and burn when one or more inputs aren't there or are in the wrong format. Lots of research students (and researchers) never bother and leave whoever comes after them to deal with nasty configuration issues and totally avoidable hunt-the-bug exercises.

I also like the touch of producing a single coherent self-documenting output file. It's amazing how many people don't even include the date/time of running in their output, let alone which input exactly went into their run (ah well ... they usually learn fast) and produce a myriad disparate output files, named all alike ;-P.

Again it looks as if you did if you did the thing right. Cheers.

In reply to by tzunami_bomber

You make some great points. My thoughts, as a Master's student in computational physics:
(1) A more significant factor is the role of Linux in HPC. You're not going to run a cluster on Windows! Scientific Linux in particular is specialized for high performance distributed computing. Your hand is forced here, as a computational physicist.
(2) I can only speak to my own experience here, but I've always found it much quicker and easier to write fast, robust code in C and Python than in Matlab. I have a minor in compsci, though, and used scientific python in my undergrad; so I'm sufficiently disillusioned to resist the Siren's call of unnecessary OOP (and have stuck with C99 almost exclusively), and already have a high level language preference. I use Python anytime I can get away with it, C for heavy lifting, and C libraries loaded into python for everything in between. In my opinion Matlab as a language is too restricted for general usage and too slow and high-level (I WANT to manage my memory!) for HPC, but it's redeemed as a tool by its excellent libraries.
Most importantly, C and C++ (and Fortran) fully support MPI and can port to any server architecture, which makes them the canonical languages for large-scale comp-phys applications. You can't stay away from them for long!
(4) I'm a config-file guy, for simplicity and adaptability. You make an excellent argument for GUI's, but I have to wonder, do these students use version control and build tools? The development history of much of my work is a horrific mess, full of hacks to pump out data I needed immediately and the occasional apocalyptic refactoring... but I can load up v0.01 at the drop of a hat and immediately compile and run a simulation, thanks to Git, make, and shell scripts. GUIs are useful (on my to-do list!), but version control and build tools are absolutely essential.

In reply to by Golodh (not verified)

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.