The scientific computing in Python community has been rapidly blossoming over the years thanks to its ability to quickly analyze data in an interactive way, and for its ability to act a glue language that integrates code and data from a variety of environments. The SciPy conference started off as a gathering for developers of scientific packages to discuss future directions, but it has grown to a six day conference consisting of two days of tutorials, two days of conference talks, plenary talks, poster presentations, lightning talks, and birds-of-a-feather (BOF, pronounced b-aw-ff as in David Hasselhoff) sessions, and two days of sprints (hackathons).
This year’s themes were Machine Learning and Reproducible Science, and mini-symposia were held on Medical Imaging, Bioinformatics, Meteorology, Climatology, Atmospheric, and Oceanic Science, Geospatial Data Analysis (GIS), and Astronomy and Astrophysics.
In addition to sponsoring the conference, Kitwareans maintained a strong presence, with CEO Will Schroeder delivering a keynote, Pat Marion presenting a talk and a poster, and Matt McCormick acting as Program Committee Co-Chair.
Interactive tutorials covered introductory, intermediate, and advanced topics. Video recordings of the tutorials are available on YouTube as are all the talks, keynotes, and mini-symposia (some are still being uploaded at the time of this writing). Most tutorial chairs also have their material hosted on public Github repositories — see a presentation for details.
IPython in Depth
Matt attended the IPython in Depth tutorial (video, repository) delivered by the IPython community members, during which there was a long discussion on how they abandoned their -users mailing list in favor of a single -dev mailing list to encourage participation and a sense of ownership. The tutorial covered usage and configuration of the IPython shell, the IPython Notebook web interface, and the IPython system for parallel and distributed computing. The IPython Notebook has huge traction within the community, and it was used as a teaching platform for most of the tutorials. It was also used my many for their presentations when coupled with reveal.js.
An Introduction to scikit-learn (I) and (II)
Matt also attended a day-long set of tutorials (video, repository) on machine learning by the developers of scikit-learn, Gaël Varoquaux, Jake Vanderplas, and Olivier Grisel. All three are brilliant programmers and scientists: Gaël and Jake’s primary disciplines are medical imaging and astronomy, and Olivier is a leader in machine learning who also delivered a keynote on the subject this year. Gaël and Olivier were a couple of the many French participants at the conference — the French contingent was at least as large as Kitwares!
The tutorial was outstanding, and it is highly recommended for anyone that wants a quick yet thorough introduction to machine learning. The material covered an introduction of the subject including an overview of the different areas of supervised classification and regression, unsupervised clustering and dimensionality reduction, introductory and in-depth overview and examples of the most popular algorithms, high level concepts; practical issues such as regularization, training versus test sets in cross validation, pre-processing, and parallel analysis; and how to get started quickly with sk-learn.
Talks and posters were categorized into one of three tracks, General, Reproducible Research, and Machine Learning. Some talks will be published the conference proceedings and in a special issue of the IOP Computational Science & Discovery (CSD) journal. Video recordings of the talks are available online as are the slides and presentation material.
Will’s presentation, The New Scientific Publishers, (video) discussed how the driving need for scientific reproducibility, and new ways of publishing on the web, are converging to build the larger publishing collective, which consists of open source communities, data curators, and open access journals based on open standards. These disparate communities are joining together to create a broad, scalable system for scientific publishing.
The subject of Pat’s talk was the Python import problem: importing modules from a shared network filesystem at extreme scale will destroy the performance of a parallel Python program. Guido Van Rossum described the issue at his PyCon keynote last year (video), and Fernando Perez mentioned the issue in his SciPy keynote last week when presenting IPython.parallel (video).
This is a major bottleneck to using Python at supercomputing scale, but it’s also a solved problem in many ways. However, a variety of solutions are still at the proof-of-concept stage, and no solution has been officially adopted and supported by Python, NumPy, and SciPy. That means every research group has to encounter the issue for themselves, and figure out which approach they should choose and how to implement it in their code. This is especially important to groups that are using ParaView Catalyst, a Python-driven ParaView compute engine that is linked to simulation codes running on supercomputers.
The talk was well received, and lots of people came up to say how important the issue was to them. In addition to the talk, Pat presented the poster “3D Perception: Point cloud data processing and visualization” (.pdf).
Matt attended both the first Bioinformatics symposium and the Medical Imaging symposium. The Bioinformatics symposium demonstrated how Python has come a long way in the -omics world, and that there are a number of emerging packages that provide rapid, visual, interactive, and distributed analysis of genomic data such an Gemini, metaseq, bcbio-nextgen, and synapse.
Talks of interest to Kitware at the Medical Imaging symposium included a well-implemented open source DICOM anonymization system. Additionally, Ventana Medical Systems (associated with Roche) gave an interesting talk describing how the SciPy stack coupled with OMERO and wrapped ITK and VTK, enables a rapid technology prototyping and assessment platform. However, they lamented the difficulties in building, maintaining, and distributing the software stack, which may be an opportunity for delivering a scientific Python/Kitware toolkit stack distribution based off better CMake/Python integration.
Other interesting business contacts we made included Rich Signell from the USGS, who was interested in Paraview for 3D oceanography visualization, and Ondřej Čertík from LANL, who was eager to demo his VTK/IPython Notebook integration.
On Wednesday, Matt moderated a BOF panel on reproducible research. This was very well attended (over 35 people), especially considering that many attendees sacrificed lunch time to participate. In general, the reproducibility theme was a huge success — it is clear that we are at the beginning of a new era!
A sampling of the tools presented at the conference covered different areas such as:
Document and analysis tracking/literate programming
- IPython presentations
- Open Science Framework
Research metadata tracking and analysis
Satisfying Software Dependencies
Areas that were identified for improvement include,
- Software and data dependency specification and resolution
- Open standards
- Increased peer pressure for satisfying a reproducible research standard
There is a discussion on creating tutorials related to reproducible research for next year’s conference.
Matt also participated as a panelist on the SciPy 2014 BOF. In conclusion, this year’s conference was a record-breaker that required capping registration after 300 attendees. The intention for next year is to broaden the community even more with a target attendance of over 500.
Austin, Texas in June is not the best place for a run, but we did it anyway!