Bringing you the news about the future of Open Source
In this episode of Open Source Directions we were joined by Jeff Bezanson and Katie Hyatt who talk about the work they have been doing with Julia. Julia is a programming language that was designed from the beginning for high performance. It programs compile to native code for multiple platforms via LLVM. Julia is dynamically typed, feels like a scripting language, and has good support for interactive use. Julia has a rich language of descriptive datatypes, and type declarations can be used to clarify and solidify programs. This language uses multiple dispatch as a paradigm, making it easy to express many object-oriented and functional programming patterns. It provides asynchronous I/O, debugging, logging, profiling, a package manager, and more.
In this episode of Open Source Directions we were joined by Aditya Mukhopadhyay who talked about the work he has been doing with RecallGraph. RecallGraph is a versioned-graph data store - it retains all changes that its data (vertices and edges) have gone through to reach their current state. It supports point-in-time graph traversals, letting the user query any past state of the graph just as easily as the present.
In this episode of Open Source Directions we were joined by Matthew Seal who talked about the work he has been doing with Jupyter and Nteract. Matthew also discussed a particular topic: common Jupyter tools and their adoption for various use cases in the wild.
OpenTechResponse is the hub for information sharing and coordination between open source projects responding to an emergency or crisis situation.
Spyder is a powerful scientific environment written in Python, for Python, and designed by and for scientists, engineers and data analysts. It offers a unique combination of the advanced editing, analysis, debugging, and profiling functionality of a comprehensive development tool with the data exploration, interactive execution, deep inspection, and beautiful visualization capabilities of a scientific package.
Fortran is a compiled language which means that once written, the source code must be passed through a compiler to produce a machine executable that can be run.
Apache Arrow is a cross-language development platform for in-memory data. It supports zero-copy streaming messaging and has support for a number of languages, including C, C++, Python, R, Rust, and many others.
Jupyter Book lets you build an online book using a collection of Jupyter Notebooks and Markdown files. Its output is similar to the excellent Bookdown tool, and adds extra functionality for people running a Jupyter stack.
Originally a port of the R package, pyjanitor has evolved from a set of convenient data cleaning routines into an experiment with the method chaining paradigm. Data preprocessing usually consists of a series of steps that involve transforming raw data into an understandable/usable format.
Bokeh is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics, and affords high-performance interactivity over large or streaming datasets. Bokeh can help anyone who would like to quickly and easily make interactive plots, dashboards, and data applications. This is our second time visiting Bokeh, in preperation for the v2.0 release!
Lale is a Python library for semi-automated data science. Lale makes it easy to automatically select algorithms and tune hyperparameters of pipelines that are compatible with scikit-learn, in a type-safe fashion. If you are a data scientist who wants to experiment with automated machine learning, this library is for you! Lale adds value beyond scikit-learn along three dimensions: automation, correctness checks, and interoperability. For automation, Lale provides a consistent high-level interface to existing pipeline search tools including GridSearchCV, SMAC, and Hyperopt. For correctness checks, Lale uses JSON Schema to catch mistakes when there is a mismatch between hyperparameters and their type, or between data and operators. And for interoperability, Lale has a growing library of transformers and estimators from popular libraries such as scikit-learn, XGBoost, PyTorch etc. Lale can be installed just like any other Python package and can be edited with off-the-shelf Python tools such as Jupyter notebooks.
STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of time series data mining tasks such as: pattern/motif (approximately repeated subsequences within a longer time series) discovery, anomaly/novelty (discord) discovery, shapelet discovery, semantic segmentation, density estimation, time series chains (temporally ordered set of subsequence patterns), and more!
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
stdlib ("standard lib") is a standard library for JavaScript and Node.js, with an emphasis on numerical and scientific computing applications. The library provides a collection of robust, high performance libraries for mathematics, statistics, data processing, streams, and more and includes many of the utilities you would expect from a standard library.
Voilà turns Jupyter notebooks into standalone web applications. Unlike the usual HTML-converted notebooks, each user connecting to the Voilà tornado application gets a dedicated Jupyter kernel which can execute the callbacks to changes in Jupyter interactive widgets. By default, Voilà disallows execute requests from the front-end, preventing execution of arbitrary code. By default, Voilà runs with the strip_source option, which strips out the input cells from the rendered notebook.
The Econ-ARK project provides open-source toolkits for researchers trying to understand how economic and social outcomes result from the actions of heterogeneous individuals. The primary goals of the project are to make entry into the world of such modeling easy; to accelerate the development of this kind of modeling for policy-making and academic research; and to increase the openness, replicability, and interoperability of modeling tools.
Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data: 1. The data is uniformly distributed on a Riemannian manifold; 2. The Riemannian metric is locally constant (or can be approximated as such); 3. The manifold is locally connected. From these assumptions it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.
Panel provides tools for easily composing widgets, plots, tables, and other viewable objects and controls into control panels, apps, and dashboards. Panel works with visualizations from Bokeh, Matplotlib, HoloViews, and other Python plotting libraries, making them instantly viewable either individually or when combined with interactive widgets that control them. Panel works equally well in Jupyter Notebooks, for creating quick data-exploration tools, or as standalone deployed apps and dashboards, and allows you to easily switch between those contexts as needed.
OpenTeams brings together organizations using open source software with creators and maintainers of the software to facilitate and grow funding opportunities.
Vega is a declarative format for creating, saving, and sharing visualization designs. With Vega, visualizations are described in JSON, and generate interactive views using either HTML5 Canvas or SVG.
Have a repository full of Jupyter notebooks? With Binder, open those notebooks in an executable environment, making your code immediately reproducible by anyone, anywhere.
The nteract project is an ecosystem of open source tools to enable people to build their own front-ends and workflows on top of the Jupyter ecosystem.
conda-forge is community led collection of recipes, build infrastructure and distributions. Conda-forge currently build conda packages for Linux, Mac, Windows, ARM, and Power8 architectures. Conda-forge has 1400 members in its GitHub organization and >7000 repositories. The conda-forge channel has about 80 million downloads a month, and growing. Conda-forge is an official NumFOCUS project.
SciKit-Learn provides simple and efficient tools for data mining and data analysis which are accessible to everybody, and reusable in various contexts. It is built on NumPy, SciPy, and matplotlib.
xtensor provides an extensible expression system enabling lazy broadcasting, an API following the idioms of the C++ standard library, and tools to manipulate array expressions and build upon xtensor. xtensor containers are inspired by NumPy, the Python array programming library. Adaptors for existing data structures to be plugged into our expression system can easily be written. xtensor requires a modern C++ compiler supporting C++14.
Array interface object for Python with pluggable backends and a multiple-dispatch mechanism for defining down-stream functions. CORRECTION: In the episode Hameer implied moving data from GPUs to CPUs won’t be a problem in PCIe 4,0. It’s actually in an Intel-proposed extension to PCIe 5.0.
It provides transparent conversion of objects between Javascript and Python. When inside a browser, this means Python has full access to the Web APIs. While closely related to the iodide project, Pyodide may be used standalone in any context where you want to run Python inside a web browser.
PyMC3 is a probabilistic programming package for Python that allows users to fit Bayesian models using a variety of numerical methods, most notably Markov chain Monte Carlo (MCMC) and variational inference (VI).
TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications.
Chainer is a powerful, flexible and intuitive deep learning framework. Chainer supports CUDA computation. It only requires a few lines of code to leverage a GPU. It also runs on multiple GPUs with little effort. Chainer supports various network architectures including feed-forward nets, convnets, recurrent nets and recursive nets. It also supports per-batch architectures. Forward computation can include any control flow statements of Python without lacking the ability of backpropagation. It makes code intuitive and easy to debug.
Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code. Numba translates Python functions to optimized machine code at runtime using the Industry-standard LLVM compiler library. Numba-compiled numerical algorithms in Python can approach the speeds of C or FORTRAN. Users do not need to replacethe Python interpreter, run a separate compilation step, or even have a C/C++ compiler installed. Applying one of the Numba decorators to a Python function is all that is needed.
ITK is an open-source, cross-platform system that provides developers with an extensive suite of software tools for image analysis. Developed through extreme programming methodologies, ITK employs leading-edge algorithms for registering and segmenting multidimensional data.
Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages. The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
The Spark Python API (PySpark) exposes the Spark programming model to Pytho
Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love. Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and SciKit-Learn.
The aim of PyData/Sparse is to create sparse containers that implement the ndarray interface. Traditionally in the PyData ecosystem, sparse arrays have been provided by the scipy.sparse submodule. All containers there depend on and emulate the numpy.matrix interface. This means that they are limited to two dimensions and also do not work well in places where numpy.ndarray would work. PyData/Sparse is well on its way to replacing scipy.sparse as the de-facto sparse array implementation in the PyData ecosystem.
Datashader is a graphics pipeline system for creating meaningful representations of large datasets quickly and flexibly. Datashader breaks the creation of images into a series of explicit steps that allow computations to be done on intermediate representations. This approach allows accurate and effective visualizations to be produced automatically without trial-and-error parameter tuning, and also makes it simple for data scientists to focus on particular data and relationships of interest in a principled way.
SciPy is open-source software for mathematics, science, and engineering. It includes modules for statistics, optimization, integration, linear algebra, Fourier transforms, signal and image processing, ODE solvers, and more. SciPy provides many user-friendly and efficient numerical routines such as for numerical integration and optimization. SciPy runs on all popular operating systems, is easy to use, and powerful enough to be depended upon by the world's leading scientists & engineers.
GeoViews is a Python library that makes it easy to explore and visualize geographical, meteorological, and oceanographic datasets, such as those used in weather, climate, and remote sensing research. GeoViews is built on the HoloViews library for building flexible visualizations of multidimensional data. GeoViews adds a family of geographic plot types based on the Cartopy library, plotted using either the Matplotlib or Bokeh packages. With GeoViews, you can now work easily and naturally with large, multidimensional geographic datasets, instantly visualizing any subset or combination of them, while always being able to access the raw data underlying any plot.
Intake will appeal to different groups but is useful for all and acts as a common platform that everyone can use to smooth the progression of data from developers and providers to users.
CuPy's interface is highly compatible with NumPy; in most cases it can be used as a drop-in replacement. Blog Post: https://quansight.github.io/Episode-5-CuPy/
This episode features Bokeh, which is a web-based and interactive visualization library for Python. Its goal is to provide elegant, concise construction of versatile graphics, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.