We love open source software and open data, and try to use tools available to the wider community whenever possible. The data science stack underlying our platforms and analyses includes the following:
Python has become one of the most popular languages for scientific calculations and data analysis, and we love it too! We’re currently using the Anaconda Python 3 distribution, which is packaged up by the kind folks at Continuum Analytics.
Jupyter is an interactive scripting and analysis framework that makes playing with data fun. It grew out of IPython, which was born right here in Boulder, Colorado at CU, in the hands of a physics grad student named Fernando Perez.
Pandas provides easy to use heterogeneous data frames for interactive data manipulation and visualization, including very large data sets.
For flexible, static two dimensional visualizations, we use Matplotlib, which is tightly integrated with all of the tools above. We’re also excited to get more familiar with Plotly, for interactive online data presentation.
High level machine learning and clustering techniques have become very accessible and powerful. The sklearn suite of Python tools makes it much easier to automatically associate records in different datasets without overlapping IDs, and to extract interesting patterns from datasets that might otherwise be too large or complex to tackle.
Pangeo is a project (partly run out of NCAR in Boulder) to make it easier for geoscientists to do collaborative, open development, and make use of large shared data catalogs on cloud computing resources. Catalyst is in the process of creating a pangeo instance for energy data work, so we and our users can more easily access and manipulate the terabytes of US electricity data that’s available from FERC, EIA, EPA, the ISOs, and other agencies by simply opening up a Jupyter notebook that’s being served from the remote pangeo instance where all the data lives — with no software installation or setup required.
Dask builds on Pandas by adding a task graph generator and scheduler, and way of a working with data that is significantly larger than the memory available on an individual computer. This allows a user to analyze the bigger datasets within PUDL, conveniently making full use of a multi-core processor, or a cloud computing platform like Pangeo.
Testing may not be exciting… but it does make our code more likely to work! We use PyTest to ensure that PUDL stays functional, and continuously test the full data processing pipeline from the original government sources to our live database.
Under the hood, NumPy and SciPy are libraries that really make Python hum for data analysis and scientific computation. They build upon decades worth of high performance computing libraries, and make that functionality much more accessible than say… Fortran77.
Not that any of us would know anything about that. No sirree.
SQLAlchemy provides a database agnostic Python programming environment, that lets us build complex data structures, which are linked directly to an underlying database through an advanced Object Relational Model.
For large local data sets, postgres is a fully featured open source database, that seeks to implement the full SQL standard, which means lots of native support for specific data types.
Can you fall in love with a text editor? Apparently the answer is yes! Does that make us dorks? Who cares!