Landscape of High-Performance Python to Develop Data Science and Machine Learning Applications

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, 1 introduction, 2.1 approach and context, 2.2 search strategy.

SourceCriteria
DatabaseWeb-based resources: Google Scholar, GitHub, PyPI, Reddit, Stack Overflow, Data Science Central
Date of publication2018–2022 (for scientific publications)
Keywords( or or or or or or or or ) combined with ( and ( or ))
LanguageEnglish
Type of publicationConference proceedings Journal articles PyPI documentation GitHub vignettes Social media posts Web news and articles
Inclusion criteriaTools/Packages aimed at increasing the performance of Python code for ML and DS directly or indirectly
Exclusion criteriaTools/packages outside Python Python packages focusing on one application use case Packages dedicated to a specific type of hardware other than CPU and GPU

2.3 Inclusion and Exclusion Criteria

3 pure python performance improvement, 3.1 distributed memory and shared memory approaches, 3.2 task-based approaches, 3.3 program transformation and compilation, 3.3.1 semi-automatic approaches., 3.3.2 automatic approaches., 4 accelerating numerical libraries usage, 4.1.1 legacy drop-in., 4.1.2 gpu acceleration., 4.1.3 compilation based., 4.3 scikit-learn, 4.3.1 dislib., 4.3.2 cuml., 4.3.3 mllib., 5 structuring frameworks, 5.1 deep learning frameworks, 5.2 distributed computation frameworks, 6 discussion and results, 6.1 results, 6.1.1 first scenario: pure python performance improvement..

Tool NameActivity PeriodTechniqueHardwareUsage ComplexityOpen SourceMaintained byPopularity
[ ]10.2006–11.2022 \(\star\) MPICPU+++YIndividuals632/174,734
Used by other libraries and software such as H5py, vtk, dask, and PocketFlow. DS/ML practitioners may prefer using a library that hides low-level MPI instructions to obtain parallelism.
[ ]06.2007–11.2016 \(\ast\) MPICPU+++YIndividuals68/na
Used by other projects such as ANUGA, TCRM, and Wind multipliers. Helps to avoid the construction of memory buffers by using Pickle. Not compatible with Python 3+.
[ ]05.2016–03.2022 \(\star\) OpenMPCPU++YIndividuals256 /66,880
Only work on systems with fork support. Overhead to circumvent the GIL by using the OS fork method.
[ ]06.2021–01.2023 \(\ast\) OpenMPCPU++YIndividuals33/na
Only works with Linux x86-64 with one specific Python and NumPy version. It is distributed as a forked Numba package with a version out of the main branch.
[ ]01.2017–11.2022 \(\star\) Task basedCPU+YUniversityna/96
Wrapper for COMPS [ ]. Used mostly in academic environments. The user needs to know task-based programming and use the definitions of the library.
[ ]naTask basedCPU+nanana
Wrapper for Legion [ ].
[ ]04.2021– 08.2022 \(\ast\) Task basedCPU+YIndividuals58/41
Wrapper for Kokkos [ ]. Works only with Ubuntu with gcc 7.5.0 and NVCC 10.2.
[ ]07.2019–05.2022 \(\star\) Task basedCPU++YIndividuals390/567
The user requires familiarity with task-based programming and library definitions.
[ ]01.2016–01.2016 \(\star\) Task basedCPU++YIndividuals5/1
The user requires familiarity with task-based programming and library definitions.
08.2007–01.2023 \(\star\) CompilerCPU+++YIndividuals7.9k/30,221,789
To easily create a C extension for Python, including parallel processing. Used by many popular libraries such scipy, pandas, and NumPy. Requires manual refactoring of code for performance improvement.
[ ]12.2017–12.2022 \(\star\) JITBoth+YEnterprise8.5k/2,321,402
According to the statistics it is a highly popular package. It uses decorators to give hints to the compiler and has a dependency on LLVM.
[ ]10.2014–05.2018 \(\star\) TranspilationCPU+/–YUniversity384/167
Used mainly in academic environment as support for other packages such as PyCosmo. Only supports a subset of Python.
[ ]07.2009–03.2023 \(\ast\) TranspilationCPU++YIndividuals673/35
All variables are implicitly typed. Only supports Python 2.4–2.6.
[ ]02.2013–01.2023 \(\star\) TranspilationCPUYIndividuals8.7k / 66,296
Translates Python code to highly optimized C or C++ code. Used by at least 25 other packages.
[ ]08.2012–01.2023 \(\star\) CompilationCPUYIndividuals1.9k/262,749
Used by at least 13 packages on GitHub to accelerate computations. Generates C++ code and optimizes it. The code needs to be compiled to be used in another project as a module.
12.2017–08.2019 \(\star\) TranspilationCPU+ (annotations)YIndividuals122/78
Can bridge Python, Fortran and C/C++. Only a subset of the language is supported.
[ ]11.2017–08.2020 \(\ast\) CompilationCPU+YIndividuals11/na
Relies on PyCOMPSs and PLUTO. The code is annotated and then automatically translated into PyCOMPSs task definition.
[ ]11.2020–11.2022 \(\star\) JITCPU+++YIndividuals1.3k/894
Compiles CPython bytecode into improved machine code. Does not currently support with blocks and async...await (YIELD_FROM) statements.
[ ]03.2021–03.2023 \(\ast\) JITCPUYEnterprise2.4k/272,262
JIT extension to CPython. Usage is straightforward but not frequently updated on PyPI. Needs to be built for a recent version, soon to be incorporated to CPython.

6.1.2 Second Scenario: Accelerating Numerical Libraries Usage.

Tool NameActivity PeriodLibrary/TechniqueHardwareUsage ComplexityOpen SourceMaintained byPopularity
[ ]04.2014–10.2015 \(\star\) NumPy/Drop-inCPU+YIndividuals4/15
Distributed NumPy arrays based on MPI. No updates since 2015.
[ ]04.2008–09.2011 \(\ast\) NumPy/Drop-inCPUYIndividuals5/na
The project is deprecated, moved to Bohrium.
[ ]11.2017–11.2020 \(\star\) NumPy/Drop-inBoth+YIndividuals218/336
Bohrium is used as support for other packages such as Veros and Weld. Drop-in replacement for NumPy but with limited coverage, claimed fall back to NumPy, observed crashes experimentally.
[ ]05.2016–09.2017 \(\mp\) NumPy/Drop-inCPU+YIndividualsna/na
Last update since 2017. Manual refactoring is needed to use the distributed data structure.
[ ]06.2016–01.2023 \(\star\) NumPy/Drop-inGPU+YEnterprise6.9k/22,849
Good NumPy coverage but not complete. Used by other libraries and part of the RAPIDS ecosystem. The user must add instructions for data copies between the GPU memory and main memory. CuPy has many packages for different CUDA and AMD ROCm versions.
[ ]04.2018–02.2021NumPy/Drop-inBoth++YIndividuals1/na
Numerical computations based on OpenCL and CUDA. Not documented.
[ ]xx.2020–xx.2020NumPy/Drop-inBothnanana
Method remains conceptual, as no associated code can be found.
[ ]03.2014–05.2014 \(\star\) NumPy/Drop-inBothYIndividuals32/27
Provides functions to create data structures (Vector and Matrix) and apply mathematical operations on multi-CPUs and GPUs.
[ ]12.2018–01.2023 \(\star\) NumPy/Drop-in/JITBoth–/++YEnterprise23,000/4,135,872
Uses Accelerated Linear Algebra compiler (XLA) [ ]. Used by many packages such as ColabDesign, trax, ml-workspace, and TensorNetwork. It can be used as drop-in library for NumPy. Nonetheless, JAX arrays are always immutable. Additionally, JAX is aimed to work best with functional programming.
[ ]01.2009–10.2022 \(\star\) NumPy/LibraryBoth+YIndividuals2k/2,975,123
Used by other popular packages such as Pandas, zipline, and osmnx. The API is not as extensive as NumPy. Requires to work with large arrays to overcome the overhead of compilation and usage.
[ ]02.2015–12.2022 \(\star\) Pandas/Drop-inCPU–/+YIndividuals7.9k/52,307
Used by at least 33 GitHub packages such as neural-lifetimes, geospatial-ml, radis, and optimus. This tool prioritizes memory optimization for handling large datasets, which may impact performance.
[ ]06.2018–01.2023 \(\star\) Pandas/Drop-inCPUYIndividuals8.6k/1,390,926
Can use execution engines like Dask [ ] or Ray [ ]. Used by at least 47 GitHub packages such as aws-sdk-pandas, ludwig, and pandera. Modin lacks full implementation of certain Pandas functions, invoking unsupported functions incurs overhead as execution falls back to Pandas and results are transferred back to Modin, requiring data type transformation.
[ ]11.2018–02.2023 \(\mp\) Pandas/Drop-inGPU–/+YEnterprise5.5k/2,212
It is part of the rapids.ai collection of software. cuDF is restricted to use only with NVIDIA GPUs. It can lead to certain memory constraints. Specifically, due to the relatively smaller size of GPU memory, users may encounter overheads related to memory swaps or copies, which can impact performance.
[ ]03.2018–07.2021 \(\star\) Pandas/LibraryCPU+YIndividuals1.7k/73,586
Used by at least 3 other packages in PyPI.
[ ]03.2021–03.2023 \(\star\) Pandas/LibraryCPU+YEnterprise16.9k/466,445
Used by 186 packages and 1,664 repositories on GitHub. There are some differences between the Pandas API and the polars API.
[ ]02.2019–11.2022 \(\star\) SciKit/drop-inCPU++YUniversity41/411
Provides a set of distributed algorithms (related to ML and data processing) such as regression techniques, -nearest neighbors, and -means. The library has been implemented on top of PyCOMPSs programming model.
[ ]01.2019–06.2020 \(\star\) SciKit/LibraryGPU++YEnterprise220/2,059
Compatible with CuPy for NumPy support. Used by at least 6 packages and 178 repositories on GitHub. Does not convert code to run into GPUs, helps with deployment and management of Dask workers.
[ ]10.2012–04.2023 \(\mp\) SciKit/LibraryCPU+YEnterprisena / na
Must be used within Spark. Part of Spark, installed as pyspark. MLlib is part of spark, and in PyPI it is named pyspark.

6.1.3 Third Scenario: Structuring Frameworks.

Tool NameActivity PeriodTechniquesHardwareUsage ComplexityOpen SourceMaintained byPopularity
[ ]12.2016–03.2023 \(\star\) Framework, JITBoth+++YEnterprise175k/15,397,352
Used by 4,934 packages and 275,967 repositories on GitHub. Powerful and widely adopted, steep learning curve, industry standard.
[ ]11.2015–03.2023 \(\star\) LibraryBoth+YEnterprise58k/10,078,784
Formerly supported several frameworks, bound to TensorFlow since 2019.
[ ]12.2018–03.2023 \(\star\) Framework, JITBoth++YEnterprise67.2k/9,992,796
Used by 225,826 repositories on GitHub. Major deep learning framework, high flexibility, more popular in research and academic contexts.
[ ]11.2010–07.2020 \(\star\) FrameworkBoth++YUniversity9.7k/229,250
Used by 203 packages and 13,754 repositories on GitHub. Former competitor of TensorFlow, development stopped in 2020. Forked by Aesara [ ] since.
[ ]05.2017–03.2022 \(\star\) Framework, JITBoth+++YFoundation20.4k/351,957
Used by 94 packages and 6,381 repositories on GitHub. Similar to PyTorch in terms of flexibility, lacks convenient IO primitives, and the release frequency has slowed down recently.
[ ]10.2019–12.2021 \(\ast\) Task basedCPU++YEnterprise16/na
Extending Python built-in CPU parallelization to cluster environments but has little adoption.
[ ]08.2017–02.2023 \(\star\) Task based, MPIGPU+YEnterprise13.3k/71,025
Used by 22 packages and 892 repositories on GitHub. Distributed deep learning framework based on OpenMPI or Gloo for communication.
[ ]11.2018–09.2019 \(\star\) Task basedCPU++YUniversity280/162
Task-based distributed computing framework with distributed Python objects. Note that Python 3+ is not supported.
[ ]01.2013–08.2022 \(\star\) Task basedCPU++YUniversity591/1,415
Used by 6 packages and 326 repositories on GitHub. Similar to a MapReduce framework, with extensive information available for deployment on high-performance clusters.
[ ]na–na \(\mp\) Task basedCPU++Ynana/na
Job-based distributed computing framework. It is absent from main repositories, and Python 3+ is not supported.
[ ]04.2009–02.2023 \(\star\) Task basedCPU+++YEnterprise21.5k/6,198,447
Used by 1,515 packages and 102,009 repositories on GitHub. Popular distributed task queue system with many features but can be complex to deploy.
[ ]03.2010–02.2011 \(\star\) Task basedBoth++YUniversity67/17
Distributed map function that also supports CUDA code. Not maintained since 2011 and does not support Python 3+
[ ]04.2015–03.2023 \(\star\) Task basedCPU++YUniversity2.4k/143,330
Powerful but requires significant code adaptation.
[ ]02.2017–03.2023 \(\star\) Task basedCPU++YUniversity376/416,826
Task-based parallelization library that uses asynchronous function invocation, backed by an active community.
[ ]06.2021–10.2022 \(\star\) Task based, Compil.CPU++YUniversity798/90
Task-based and compilation-based framework for CPU parallelism. Requires the use of its own Domain Specific Language (DSL).
[ , ]01.2015–03.2023 \(\star\) Task basedBoth+/++YEnterprise10.9k/7,016,339
Used by 1,583 packages and 47,013 repositories on GitHub. Supports both single-machine and distributed computing. Strongly bound to NumPy and Pandas.
[ ]06.2017–03.2023 \(\star\) Task based, BackendBoth+++YEnterprise25.8k/2,095,907
Used by 553 packages and 9,355 repositories on GitHub. Task-based framework with a versatile backend supporting both single-machine and distributed computing. Powerful but can be complex to set up effectively.
[ ]03.2019–03.2021 \(\star\) Task basedBoth+++YUniversity378/8,348
Used by 3 packages and 6 repositories on GitHub. Supports both single-machine and distributed computing. Powerful capabilities but requires significant work overhead for complex use cases and in a GPU context.

6.2 Scope and Limitations of the Study

6.3 python interpreters, 7 conclusion.

  • Parisot O Jaziri M (2024) Deep Sky Objects Detection with Deep Learning for Electronically Assisted Astronomy Astronomy 10.3390/astronomy3020009 3 :2 (122-138) Online publication date: 13-May-2024 https://doi.org/10.3390/astronomy3020009
  • Bhowmik S Sultana S Sajid A Reno S Manjrekar A (2023) Robust multi-domain descriptive text classification leveraging conventional and hybrid deep learning models International Journal of Information Technology 10.1007/s41870-023-01559-2 16 :5 (3219-3231) Online publication date: 25-Oct-2023 https://doi.org/10.1007/s41870-023-01559-2

Index Terms

Computing methodologies

Machine learning

Parallel computing methodologies

Parallel programming languages

Software and its engineering

Software creation and management

Software development techniques

Recommendations

Learning python data visualization, python machine learning: a guide for beginners, python high performance programming, information, published in.

cover image ACM Computing Surveys

University of Sydney, Australia

Association for Computing Machinery

New York, NY, United States

Publication History

Check for updates, author tags.

  • code acceleration
  • data science

Contributors

Other metrics, bibliometrics, article metrics.

  • 2 Total Citations View Citations
  • 3,382 Total Downloads
  • Downloads (Last 12 months) 3,382
  • Downloads (Last 6 weeks) 304

View options

View or Download as a PDF file.

View online with eReader .

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

Using Python for Research

Take your introductory knowledge of Python programming to the next level and learn how to use Python 3 for your research.

Random walks generated using Python 3

Associated Schools

Harvard T.H. Chan School of Public Health

Harvard T.H. Chan School of Public Health

What you'll learn.

Python 3 programming basics (a review)

Python tools (e.g., NumPy and SciPy modules) for research applications

How to apply Python research tools in practical settings

Course description

This course bridges the gap between introductory and advanced courses in Python. While there are many excellent introductory Python courses available, most typically do not go deep enough for you to apply your Python skills to research projects. In this course, after first reviewing the basics of Python 3, we learn about tools commonly used in research settings.

Using a combination of a guided introduction and more independent in-depth exploration, you will get to practice your new Python skills with various case studies chosen for their scientific breadth and their coverage of different Python features. This run of the course includes revised assessments and a new module on machine learning.

Course Outline

Python Basics

Review of basic Python 3 language concepts and syntax.

Python Research Tools

Introduction to Python modules commonly used in scientific computation, such as NumPy.

Case Studies

This collection of six case studies from different disciplines provides opportunities to practice Python research skills.

Statistical Learning

Exploration of statistical learning using the scikit-learn library followed by a two-part case study that allows you to further practice your coding skills.

Instructors

Jukka-Pekka Onnela

Jukka-Pekka Onnela

You may also like.

CS50W

CS50's Web Programming with Python and JavaScript

This course picks up where CS50 leaves off, diving more deeply into the design and implementation of web apps with Python, JavaScript, and SQL using frameworks like Django, React, and Bootstrap.

CS50x

CS50: Introduction to Computer Science

An introduction to the intellectual enterprises of computer science and the art of programming.

CS50L

CS50 for Lawyers

This course is a variant of Harvard University's introduction to computer science, CS50, designed especially for lawyers (and law students).

Join our list to learn more

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Open access
  • Published: 16 September 2020

Array programming with NumPy

  • Charles R. Harris 1 ,
  • K. Jarrod Millman   ORCID: orcid.org/0000-0002-5263-5070 2 , 3 , 4 ,
  • Stéfan J. van der Walt   ORCID: orcid.org/0000-0001-9276-1891 2 , 4 , 5 ,
  • Ralf Gommers   ORCID: orcid.org/0000-0002-0300-3333 6 ,
  • Pauli Virtanen 7 , 8 ,
  • David Cournapeau 9 ,
  • Eric Wieser 10 ,
  • Julian Taylor 11 ,
  • Sebastian Berg 4 ,
  • Nathaniel J. Smith 12 ,
  • Robert Kern 13 ,
  • Matti Picus   ORCID: orcid.org/0000-0002-1771-9949 4 ,
  • Stephan Hoyer   ORCID: orcid.org/0000-0002-5207-0380 14 ,
  • Marten H. van Kerkwijk 15 ,
  • Matthew Brett 2 , 16 ,
  • Allan Haldane 17 ,
  • Jaime Fernández del Río 18 ,
  • Mark Wiebe   ORCID: orcid.org/0000-0003-3603-8038 19 , 20 ,
  • Pearu Peterson   ORCID: orcid.org/0000-0001-7328-4305 6 , 21 , 22 ,
  • Pierre Gérard-Marchant 23 , 24 ,
  • Kevin Sheppard   ORCID: orcid.org/0000-0001-8700-2292 25 ,
  • Tyler Reddy 26 ,
  • Warren Weckesser 4 ,
  • Hameer Abbasi 6 ,
  • Christoph Gohlke   ORCID: orcid.org/0000-0001-8108-7707 27 &
  • Travis E. Oliphant 6  

Nature volume  585 ,  pages 357–362 ( 2020 ) Cite this article

374k Accesses

9814 Citations

1799 Altmetric

Metrics details

  • Computational neuroscience
  • Computational science
  • Computer science
  • Solar physics

Array programming provides a powerful, compact and expressive syntax for accessing, manipulating and operating on data in vectors, matrices and higher-dimensional arrays. NumPy is the primary array programming library for the Python language. It has an essential role in research analysis pipelines in fields as diverse as physics, chemistry, astronomy, geoscience, biology, psychology, materials science, engineering, finance and economics. For example, in astronomy, NumPy was an important part of the software stack used in the discovery of gravitational waves 1 and in the first imaging of a black hole 2 . Here we review how a few fundamental array concepts lead to a simple and powerful programming paradigm for organizing, exploring and analysing scientific data. NumPy is the foundation upon which the scientific Python ecosystem is constructed. It is so pervasive that several projects, targeting audiences with specialized needs, have developed their own NumPy-like interfaces and array objects. Owing to its central position in the ecosystem, NumPy increasingly acts as an interoperability layer between such array computation libraries and, together with its application programming interface (API), provides a flexible framework to support the next decade of scientific and industrial analysis.

Similar content being viewed by others

research paper on python programming

SciPy 1.0: fundamental algorithms for scientific computing in Python

research paper on python programming

Julia for biologists

research paper on python programming

PySAGES: flexible, advanced sampling methods accelerated with GPUs

Two Python array packages existed before NumPy. The Numeric package was developed in the mid-1990s and provided array objects and array-aware functions in Python. It was written in C and linked to standard fast implementations of linear algebra 3 , 4 . One of its earliest uses was to steer C++ applications for inertial confinement fusion research at Lawrence Livermore National Laboratory 5 . To handle large astronomical images coming from the Hubble Space Telescope, a reimplementation of Numeric, called Numarray, added support for structured arrays, flexible indexing, memory mapping, byte-order variants, more efficient memory use, flexible IEEE 754-standard error-handling capabilities, and better type-casting rules 6 . Although Numarray was highly compatible with Numeric, the two packages had enough differences that it divided the community; however, in 2005 NumPy emerged as a ‘best of both worlds’ unification 7 —combining the features of Numarray with the small-array performance of Numeric and its rich C API.

Now, 15 years later, NumPy underpins almost every Python library that does scientific or numerical computation 8 , 9 , 10 , 11 , including SciPy 12 , Matplotlib 13 , pandas 14 , scikit-learn 15 and scikit-image 16 . NumPy is a community-developed, open-source library, which provides a multidimensional Python array object along with array-aware functions that operate on it. Because of its inherent simplicity, the NumPy array is the de facto exchange format for array data in Python.

NumPy operates on in-memory arrays using the central processing unit (CPU). To utilize modern, specialized storage and hardware, there has been a recent proliferation of Python array packages. Unlike with the Numarray–Numeric divide, it is now much harder for these new libraries to fracture the user community—given how much work is already built on top of NumPy. However, to provide the community with access to new and exploratory technologies, NumPy is transitioning into a central coordinating mechanism that specifies a well defined array programming API and dispatches it, as appropriate, to specialized array implementations.

NumPy arrays

The NumPy array is a data structure that efficiently stores and accesses multidimensional arrays 17 (also known as tensors), and enables a wide variety of scientific computation. It consists of a pointer to memory, along with metadata used to interpret the data stored there, notably ‘data type’, ‘shape’ and ‘strides’ (Fig. 1a ).

figure 1

a , The NumPy array data structure and its associated metadata fields. b , Indexing an array with slices and steps. These operations return a ‘view’ of the original data. c , Indexing an array with masks, scalar coordinates or other arrays, so that it returns a ‘copy’ of the original data. In the bottom example, an array is indexed with other arrays; this broadcasts the indexing arguments before performing the lookup. d , Vectorization efficiently applies operations to groups of elements. e , Broadcasting in the multiplication of two-dimensional arrays. f , Reduction operations act along one or more axes. In this example, an array is summed along select axes to produce a vector, or along two axes consecutively to produce a scalar. g , Example NumPy code, illustrating some of these concepts.

The data type describes the nature of elements stored in an array. An array has a single data type, and each element of an array occupies the same number of bytes in memory. Examples of data types include real and complex numbers (of lower and higher precision), strings, timestamps and pointers to Python objects.

The shape of an array determines the number of elements along each axis, and the number of axes is the dimensionality of the array. For example, a vector of numbers can be stored as a one-dimensional array of shape N , whereas colour videos are four-dimensional arrays of shape ( T ,  M ,  N , 3).

Strides are necessary to interpret computer memory, which stores elements linearly, as multidimensional arrays. They describe the number of bytes to move forward in memory to jump from row to row, column to column, and so forth. Consider, for example, a two-dimensional array of floating-point numbers with shape (4, 3), where each element occupies 8 bytes in memory. To move between consecutive columns, we need to jump forward 8 bytes in memory, and to access the next row, 3 × 8 = 24 bytes. The strides of that array are therefore (24, 8). NumPy can store arrays in either C or Fortran memory order, iterating first over either rows or columns. This allows external libraries written in those languages to access NumPy array data in memory directly.

Users interact with NumPy arrays using ‘indexing’ (to access subarrays or individual elements), ‘operators’ (for example, +, − and × for vectorized operations and @ for matrix multiplication), as well as ‘array-aware functions’; together, these provide an easily readable, expressive, high-level API for array programming while NumPy deals with the underlying mechanics of making operations fast.

Indexing an array returns single elements, subarrays or elements that satisfy a specific condition (Fig. 1b ). Arrays can even be indexed using other arrays (Fig. 1c ). Wherever possible, indexing that retrieves a subarray returns a ‘view’ on the original array such that data are shared between the two arrays. This provides a powerful way to operate on subsets of array data while limiting memory usage.

To complement the array syntax, NumPy includes functions that perform vectorized calculations on arrays, including arithmetic, statistics and trigonometry (Fig. 1d ). Vectorization—operating on entire arrays rather than their individual elements—is essential to array programming. This means that operations that would take many tens of lines to express in languages such as C can often be implemented as a single, clear Python expression. This results in concise code and frees users to focus on the details of their analysis, while NumPy handles looping over array elements near-optimally—for example, taking strides into consideration to best utilize the computer’s fast cache memory.

When performing a vectorized operation (such as addition) on two arrays with the same shape, it is clear what should happen. Through ‘broadcasting’ NumPy allows the dimensions to differ, and produces results that appeal to intuition. A trivial example is the addition of a scalar value to an array, but broadcasting also generalizes to more complex examples such as scaling each column of an array or generating a grid of coordinates. In broadcasting, one or both arrays are virtually duplicated (that is, without copying any data in memory), so that the shapes of the operands match (Fig. 1d ). Broadcasting is also applied when an array is indexed using arrays of indices (Fig. 1c ).

Other array-aware functions, such as sum, mean and maximum, perform element-by-element ‘reductions’, aggregating results across one, multiple or all axes of a single array. For example, summing an n -dimensional array over d axes results in an array of dimension n  −  d (Fig. 1f ).

NumPy also includes array-aware functions for creating, reshaping, concatenating and padding arrays; searching, sorting and counting data; and reading and writing files. It provides extensive support for generating pseudorandom numbers, includes an assortment of probability distributions, and performs accelerated linear algebra, using one of several backends such as OpenBLAS 18 , 19 or Intel MKL optimized for the CPUs at hand (see Supplementary Methods for more details).

Altogether, the combination of a simple in-memory array representation, a syntax that closely mimics mathematics, and a variety of array-aware utility functions forms a productive and powerfully expressive array programming language.

Scientific Python ecosystem

Python is an open-source, general-purpose interpreted programming language well suited to standard programming tasks such as cleaning data, interacting with web resources and parsing text. Adding fast array operations and linear algebra enables scientists to do all their work within a single programming language—one that has the advantage of being famously easy to learn and teach, as witnessed by its adoption as a primary learning language in many universities.

Even though NumPy is not part of Python’s standard library, it benefits from a good relationship with the Python developers. Over the years, the Python language has added new features and special syntax so that NumPy would have a more succinct and easier-to-read array notation. However, because it is not part of the standard library, NumPy is able to dictate its own release policies and development patterns.

SciPy and Matplotlib are tightly coupled with NumPy in terms of history, development and use. SciPy provides fundamental algorithms for scientific computing, including mathematical, scientific and engineering routines. Matplotlib generates publication-ready figures and visualizations. The combination of NumPy, SciPy and Matplotlib, together with an advanced interactive environment such as IPython 20 or Jupyter 21 , provides a solid foundation for array programming in Python. The scientific Python ecosystem (Fig. 2 ) builds on top of this foundation to provide several, widely used technique-specific libraries 15 , 16 , 22 , that in turn underlie numerous domain-specific projects 23 , 24 , 25 , 26 , 27 , 28 . NumPy, at the base of the ecosystem of array-aware libraries, sets documentation standards, provides array testing infrastructure and adds build support for Fortran and other compilers.

figure 2

Essential libraries and projects that depend on NumPy’s API gain access to new array implementations that support NumPy’s array protocols (Fig. 3 ).

Many research groups have designed large, complex scientific libraries that add application-specific functionality to the ecosystem. For example, the eht-imaging library 29 , developed by the Event Horizon Telescope collaboration for radio interferometry imaging, analysis and simulation, relies on many lower-level components of the scientific Python ecosystem. In particular, the EHT collaboration used this library for the first imaging of a black hole. Within eht-imaging, NumPy arrays are used to store and manipulate numerical data at every step in the processing chain: from raw data through calibration and image reconstruction. SciPy supplies tools for general image-processing tasks such as filtering and image alignment, and scikit-image, an image-processing library that extends SciPy, provides higher-level functionality such as edge filters and Hough transforms. The ‘scipy.optimize’ module performs mathematical optimization. NetworkX 22 , a package for complex network analysis, is used to verify image comparison consistency. Astropy 23 , 24 handles standard astronomical file formats and computes time–coordinate transformations. Matplotlib is used to visualize data and to generate the final image of the black hole.

The interactive environment created by the array programming foundation and the surrounding ecosystem of tools—inside of IPython or Jupyter—is ideally suited to exploratory data analysis. Users can fluidly inspect, manipulate and visualize their data, and rapidly iterate to refine programming statements. These statements are then stitched together into imperative or functional programs, or notebooks containing both computation and narrative. Scientific computing beyond exploratory work is often done in a text editor or an integrated development environment (IDE) such as Spyder. This rich and productive environment has made Python popular for scientific research.

To complement this facility for exploratory work and rapid prototyping, NumPy has developed a culture of using time-tested software engineering practices to improve collaboration and reduce error 30 . This culture is not only adopted by leaders in the project but also enthusiastically taught to newcomers. The NumPy team was early to adopt distributed revision control and code review to improve collaboration on code, and continuous testing that runs an extensive battery of automated tests for every proposed change to NumPy. The project also has comprehensive, high-quality documentation, integrated with the source code 31 , 32 , 33 .

This culture of using best practices for producing reliable scientific software has been adopted by the ecosystem of libraries that build on NumPy. For example, in a recent award given by the Royal Astronomical Society to Astropy, they state: “The Astropy Project has provided hundreds of junior scientists with experience in professional-standard software development practices including use of version control, unit testing, code review and issue tracking procedures. This is a vital skill set for modern researchers that is often missing from formal university education in physics or astronomy” 34 . Community members explicitly work to address this lack of formal education through courses and workshops 35 , 36 , 37 .

The recent rapid growth of data science, machine learning and artificial intelligence has further and dramatically boosted the scientific use of Python. Examples of its important applications, such as the eht-imaging library, now exist in almost every discipline in the natural and social sciences. These tools have become the primary software environment in many fields. NumPy and its ecosystem are commonly taught in university courses, boot camps and summer schools, and are the focus of community conferences and workshops worldwide. NumPy and its API have become truly ubiquitous.

Array proliferation and interoperability

NumPy provides in-memory, multidimensional, homogeneously typed (that is, single-pointer and strided) arrays on CPUs. It runs on machines ranging from embedded devices to the world’s largest supercomputers, with performance approaching that of compiled languages. For most its existence, NumPy addressed the vast majority of array computation use cases.

However, scientific datasets now routinely exceed the memory capacity of a single machine and may be stored on multiple machines or in the cloud. In addition, the recent need to accelerate deep-learning and artificial intelligence applications has led to the emergence of specialized accelerator hardware, including graphics processing units (GPUs), tensor processing units (TPUs) and field-programmable gate arrays (FPGAs). Owing to its in-memory data model, NumPy is currently unable to directly utilize such storage and specialized hardware. However, both distributed data and also the parallel execution of GPUs, TPUs and FPGAs map well to the paradigm of array programming: therefore leading to a gap between available modern hardware architectures and the tools necessary to leverage their computational power.

The community’s efforts to fill this gap led to a proliferation of new array implementations. For example, each deep-learning framework created its own arrays; the PyTorch 38 , Tensorflow 39 , Apache MXNet 40 and JAX arrays all have the capability to run on CPUs and GPUs in a distributed fashion, using lazy evaluation to allow for additional performance optimizations. SciPy and PyData/Sparse both provide sparse arrays, which typically contain few non-zero values and store only those in memory for efficiency. In addition, there are projects that build on NumPy arrays as data containers, and extend its capabilities. Distributed arrays are made possible that way by Dask, and labelled arrays—referring to dimensions of an array by name rather than by index for clarity, compare x[:, 1] versus x.loc[:, 'time']—by xarray 41 .

Such libraries often mimic the NumPy API, because this lowers the barrier to entry for newcomers and provides the wider community with a stable array programming interface. This, in turn, prevents disruptive schisms such as the divergence between Numeric and Numarray. But exploring new ways of working with arrays is experimental by nature and, in fact, several promising libraries (such as Theano and Caffe) have already ceased development. And each time that a user decides to try a new technology, they must change import statements and ensure that the new library implements all the parts of the NumPy API they currently use.

Ideally, operating on specialized arrays using NumPy functions or semantics would simply work, so that users could write code once, and would then benefit from switching between NumPy arrays, GPU arrays, distributed arrays and so forth as appropriate. To support array operations between external array objects, NumPy therefore added the capability to act as a central coordination mechanism with a well specified API (Fig. 2 ).

To facilitate this interoperability, NumPy provides ‘protocols’ (or contracts of operation), that allow for specialized arrays to be passed to NumPy functions (Fig. 3 ). NumPy, in turn, dispatches operations to the originating library, as required. Over four hundred of the most popular NumPy functions are supported. The protocols are implemented by widely used libraries such as Dask, CuPy, xarray and PyData/Sparse. Thanks to these developments, users can now, for example, scale their computation from a single machine to distributed systems using Dask. The protocols also compose well, allowing users to redeploy NumPy code at scale on distributed, multi-GPU systems via, for instance, CuPy arrays embedded in Dask arrays. Using NumPy’s high-level API, users can leverage highly parallel code execution on multiple systems with millions of cores, all with minimal code changes 42 .

figure 3

In this example, NumPy’s ‘mean’ function is called on a Dask array. The call succeeds by dispatching to the appropriate library implementation (in this case, Dask) and results in a new Dask array. Compare this code to the example code in Fig. 1g .

These array protocols are now a key feature of NumPy, and are expected to only increase in importance. The NumPy developers—many of whom are authors of this Review—iteratively refine and add protocol designs to improve utility and simplify adoption.

NumPy combines the expressive power of array programming, the performance of C, and the readability, usability and versatility of Python in a mature, well tested, well documented and community-developed library. Libraries in the scientific Python ecosystem provide fast implementations of most important algorithms. Where extreme optimization is warranted, compiled languages can be used, such as Cython 43 , Numba 44 and Pythran 45 ; these languages extend Python and transparently accelerate bottlenecks. Owing to NumPy’s simple memory model, it is easy to write low-level, hand-optimized code, usually in C or Fortran, to manipulate NumPy arrays and pass them back to Python. Furthermore, using array protocols, it is possible to utilize the full spectrum of specialized hardware acceleration with minimal changes to existing code.

NumPy was initially developed by students, faculty and researchers to provide an advanced, open-source array programming library for Python, which was free to use and unencumbered by license servers and software protection dongles. There was a sense of building something consequential together for the benefit of many others. Participating in such an endeavour, within a welcoming community of like-minded individuals, held a powerful attraction for many early contributors.

These user–developers frequently had to write code from scratch to solve their own or their colleagues’ problems—often in low-level languages that preceded Python, such as Fortran 46 and C. To them, the advantages of an interactive, high-level array library were evident. The design of this new tool was informed by other powerful interactive programming languages for scientific computing such as Basis 47 , 48 , 49 , 50 , Yorick 51 , R 52 and APL 53 , as well as commercial languages and environments such as IDL (Interactive Data Language) and MATLAB.

What began as an attempt to add an array object to Python became the foundation of a vibrant ecosystem of tools. Now, a large amount of scientific work depends on NumPy being correct, fast and stable. It is no longer a small community project, but core scientific infrastructure.

The developer culture has matured: although initial development was highly informal, NumPy now has a roadmap and a process for proposing and discussing large changes. The project has formal governance structures and is fiscally sponsored by NumFOCUS, a nonprofit that promotes open practices in research, data and scientific computing. Over the past few years, the project attracted its first funded development, sponsored by the Moore and Sloan Foundations, and received an award as part of the Chan Zuckerberg Initiative’s Essentials of Open Source Software programme. With this funding, the project was (and is) able to have sustained focus over multiple months to implement substantial new features and improvements. That said, the development of NumPy still depends heavily on contributions made by graduate students and researchers in their free time (see Supplementary Methods for more details).

NumPy is no longer merely the foundational array library underlying the scientific Python ecosystem, but it has become the standard API for tensor computation and a central coordinating mechanism between array types and technologies in Python. Work continues to expand on and improve these interoperability features.

Over the next decade, NumPy developers will face several challenges. New devices will be developed, and existing specialized hardware will evolve to meet diminishing returns on Moore’s law. There will be more, and a wider variety of, data science practitioners, a large proportion of whom will use NumPy. The scale of scientific data gathering will continue to increase, with the adoption of devices and instruments such as light-sheet microscopes and the Large Synoptic Survey Telescope (LSST) 54 . New generation languages, interpreters and compilers, such as Rust 55 , Julia 56 and LLVM 57 , will create new concepts and data structures, and determine their viability.

Through the mechanisms described in this Review, NumPy is poised to embrace such a changing landscape, and to continue playing a leading part in interactive scientific computation, although to do so will require sustained funding from government, academia and industry. But, importantly, for NumPy to meet the needs of the next decade of data science, it will also need a new generation of graduate students and community contributors to drive it forward.

Abbott, B. P. et al. Observation of gravitational waves from a binary black hole merger. Phys. Rev. Lett . 116 , 061102 (2016).

Article   ADS   MathSciNet   CAS   Google Scholar  

Chael, A. et al. High-resolution linear polarimetric imaging for the Event Horizon Telescope. Astrophys. J . 286 , 11 (2016).

Article   ADS   Google Scholar  

Dubois, P. F., Hinsen, K. & Hugunin, J. Numerical Python. Comput. Phys . 10 , 262–267 (1996).

Ascher, D., Dubois, P. F., Hinsen, K., Hugunin, J. & Oliphant, T. E. An Open Source Project: Numerical Python (Lawrence Livermore National Laboratory, 2001).

Yang, T.-Y., Furnish, G. & Dubois, P. F. Steering object-oriented scientific computations. In Proc. TOOLS USA 97. Intl Conf. Technology of Object Oriented Systems and Languages (eds Ege, R., Singh, M. & Meyer, B.) 112–119 (IEEE, 1997).

Greenfield, P., Miller, J. T., Hsu, J. & White, R. L. numarray: a new scientific array package for Python. In PyCon DC 2003 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.112.9899 (2003).

Oliphant, T. E. Guide to NumPy 1st edn (Trelgol Publishing, 2006).

Dubois, P. F. Python: batteries included. Comput. Sci. Eng . 9 , 7–9 (2007).

Article   Google Scholar  

Oliphant, T. E. Python for scientific computing. Comput. Sci. Eng . 9 , 10–20 (2007).

Article   CAS   Google Scholar  

Millman, K. J. & Aivazis, M. Python for scientists and engineers. Comput. Sci. Eng . 13 , 9–12 (2011).

Pérez, F., Granger, B. E. & Hunter, J. D. Python: an ecosystem for scientific computing. Comput. Sci. Eng . 13 , 13–21 (2011). Explains why the scientific Python ecosystem is a highly productive environment for research .

Virtanen, P. et al. SciPy 1.0—fundamental algorithms for scientific computing in Python. Nat. Methods 17 , 261–272 (2020); correction 17 , 352 (2020). Introduces the SciPy library and includes a more detailed history of NumPy and SciPy.

Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng . 9 , 90–95 (2007).

McKinney, W. Data structures for statistical computing in Python. In Proc. 9th Python in Science Conf . (eds van der Walt, S. & Millman, K. J.) 56–61 (2010).

Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res . 12 , 2825–2830 (2011).

MathSciNet   MATH   Google Scholar  

van der Walt, S. et al. scikit-image: image processing in Python. PeerJ 2 , e453 (2014).

van der Walt, S., Colbert, S. C. & Varoquaux, G. The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng . 13 , 22–30 (2011). Discusses the NumPy array data structure with a focus on how it enables efficient computation .

Wang, Q., Zhang, X., Zhang, Y. & Yi, Q. AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs. In SC’13: Proc. Intl Conf. High Performance Computing, Networking, Storage and Analysis 25 (IEEE, 2013).

Xianyi, Z., Qian, W. & Yunquan, Z. Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In 2012 IEEE 18th Intl Conf. Parallel and Distributed Systems 684–691 (IEEE, 2012).

Pérez, F. & Granger, B. E. IPython: a system for interactive scientific computing. Comput. Sci. Eng . 9 , 21–29 (2007).

Kluyver, T. et al. Jupyter Notebooks—a publishing format for reproducible computational workflows. In Positioning and Power in Academic Publishing: Players, Agents and Agendas (eds Loizides, F. & Schmidt, B.) 87–90 (IOS Press, 2016).

Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring network structure, dynamics, and function using NetworkX. In Proc. 7th Python in Science Conf . (eds Varoquaux, G., Vaught, T. & Millman, K. J.) 11–15 (2008).

Astropy Collaboration et al. Astropy: a community Python package for astronomy. Astron. Astrophys . 558 , A33 (2013).

Price-Whelan, A. M. et al. The Astropy Project: building an open-science project and status of the v2.0 core package. Astron. J . 156 , 123 (2018).

Cock, P. J. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25 , 1422–1423 (2009).

Millman, K. J. & Brett, M. Analysis of functional magnetic resonance imaging in Python. Comput. Sci. Eng . 9 , 52–55 (2007).

The SunPy Community et al. SunPy—Python for solar physics. Comput. Sci. Discov . 8 , 014009 (2015).

Hamman, J., Rocklin, M. & Abernathy, R. Pangeo: a big-data ecosystem for scalable Earth system science. In EGU General Assembly Conf. Abstracts 12146 (2018).

Chael, A. A. et al. ehtim: imaging, analysis, and simulation software for radio interferometry. Astrophysics Source Code Library https://ascl.net/1904.004 (2019).

Millman, K. J. & Pérez, F. Developing open source scientific practice. In Implementing Reproducible Research (eds Stodden, V., Leisch, F. & Peng, R. D.) 149–183 (CRC Press, 2014). Describes the software engineering practices embraced by the NumPy and SciPy communities with a focus on how these practices improve research .

van der Walt, S. The SciPy Documentation Project (technical overview). In Proc. 7th Python in Science Conf. (SciPy 2008) (eds Varoquaux, G., Vaught, T. & Millman, K. J.) 27–28 (2008).

Harrington, J. The SciPy Documentation Project. In Proc. 7th Python in Science Conference (SciPy 2008) (eds Varoquaux, G., Vaught, T. & Millman, K. J.) 33–35 (2008).

Harrington, J. & Goldsmith, D. Progress report: NumPy and SciPy documentation in 2009. In Proc. 8th Python in Science Conf. (SciPy 2009) (eds Varoquaux, G., van der Walt, S. & Millman, K. J.) 84–87 (2009).

Royal Astronomical Society Report of the RAS ‘A’ Awards Committee 2020: Astropy Project: 2020 Group Achievement Award (A) https://ras.ac.uk/sites/default/files/2020-01/Group%20Award%20-%20Astropy.pdf (2020).

Wilson, G. Software carpentry: getting scientists to write better code by making them more productive. Comput. Sci. Eng . 8 , 66–69 (2006).

Hannay, J. E. et al. How do scientists develop and use scientific software? In Proc. 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering 1–8 (IEEE, 2009).

Millman, K. J., Brett, M., Barnowski, R. & Poline, J.-B. Teaching computational reproducibility for neuroimaging. Front. Neurosci . 12 , 727 (2018).

Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) 8024–8035 (Neural Information Processing Systems, 2019).

Abadi, M. et al. TensorFlow: a system for large-scale machine learning. In OSDI’16: Proc. 12th USENIX Conf. Operating Systems Design and Implementation (chairs Keeton, K. & Roscoe, T.) 265–283 (USENIX Association, 2016).

Chen, T. et al. MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. Preprint at http://www.arxiv.org/abs/1512.01274 (2015).

Hoyer, S. & Hamman, J. xarray: N–D labeled arrays and datasets in Python. J. Open Res. Softw . 5 , 10 (2017).

Entschev, P. Distributed multi-GPU computing with Dask, CuPy and RAPIDS. In EuroPython 2019 https://ep2019.europython.eu/media/conference/slides/fX8dJsD-distributed-multi-gpu-computing-with-dask-cupy-and-rapids.pdf (2019).

Behnel, S. et al. Cython: the best of both worlds. Comput. Sci. Eng . 13 , 31–39 (2011).

Lam, S. K., Pitrou, A. & Seibert, S. Numba: a LLVM-based Python JIT compiler. In Proc. Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM ’15 7:1–7:6 (ACM, 2015).

Guelton, S. et al. Pythran: enabling static optimization of scientific Python programs. Comput. Sci. Discov . 8 , 014001 (2015).

Dongarra, J., Golub, G. H., Grosse, E., Moler, C. & Moore, K. Netlib and NA-Net: building a scientific computing community. IEEE Ann. Hist. Comput . 30 , 30–41 (2008).

Article   MathSciNet   Google Scholar  

Barrett, K. A., Chiu, Y. H., Painter, J. F., Motteler, Z. C. & Dubois, P. F. Basis System, Part I: Running a Basis Program—A Tutorial for Beginners UCRL-MA-118543, Vol. 1 (Lawrence Livermore National Laboratory 1995).

Dubois, P. F. & Motteler, Z. Basis System, Part II: Basis Language Reference Manual UCRL-MA-118543, Vol. 2 (Lawrence Livermore National Laboratory, 1995).

Chiu, Y. H. & Dubois, P. F. Basis System, Part III: EZN User Manual UCRL-MA-118543, Vol. 3 (Lawrence Livermore National Laboratory, 1995).

Chiu, Y. H. & Dubois, P. F. Basis System, Part IV: EZD User Manual UCRL-MA-118543, Vol. 4 (Lawrence Livermore National Laboratory, 1995).

Munro, D. H. & Dubois, P. F. Using the Yorick interpreted language. Comput. Phys . 9 , 609–615 (1995).

Ihaka, R. & Gentleman, R. R: a language for data analysis and graphics. J. Comput. Graph. Stat . 5 , 299–314 (1996).

Google Scholar  

Iverson, K. E. A programming language. In Proc. 1962 Spring Joint Computer Conf . 345–351 (1962).

Jenness, T. et al. LSST data management software development practices and tools. In Proc. SPIE 10707, Software and Cyberinfrastructure for Astronomy V 1070709 (SPIE and International Society for Optics and Photonics, 2018).

Matsakis, N. D. & Klock, F. S. The Rust language. Ada Letters 34 , 103–104 (2014).

Bezanson, J., Edelman, A., Karpinski, S. & Shah, V. B. Julia: a fresh approach to numerical computing. SIAM Rev . 59 , 65–98 (2017).

Lattner, C. & Adve, V. LLVM: a compilation framework for lifelong program analysis and transformation. In Proc. 2004 Intl Symp. Code Generation and Optimization (CGO’04) 75–88 (IEEE, 2004).

Download references

Acknowledgements

We thank R. Barnowski, P. Dubois, M. Eickenberg, and P. Greenfield, who suggested text and provided helpful feedback on the manuscript. K.J.M. and S.J.v.d.W. were funded in part by the Gordon and Betty Moore Foundation through grant GBMF3834 and by the Alfred P. Sloan Foundation through grant 2013-10-27 to the University of California, Berkeley. S.J.v.d.W., S.B., M.P. and W.W. were funded in part by the Gordon and Betty Moore Foundation through grant GBMF5447 and by the Alfred P. Sloan Foundation through grant G-2017-9960 to the University of California, Berkeley.

Author information

Authors and affiliations.

Independent researcher, Logan, UT, USA

Charles R. Harris

Brain Imaging Center, University of California, Berkeley, Berkeley, CA, USA

K. Jarrod Millman, Stéfan J. van der Walt & Matthew Brett

Division of Biostatistics, University of California, Berkeley, Berkeley, CA, USA

K. Jarrod Millman

Berkeley Institute for Data Science, University of California, Berkeley, Berkeley, CA, USA

K. Jarrod Millman, Stéfan J. van der Walt, Sebastian Berg, Matti Picus & Warren Weckesser

Applied Mathematics, Stellenbosch University, Stellenbosch, South Africa

Stéfan J. van der Walt

Quansight, Austin, TX, USA

Ralf Gommers, Pearu Peterson, Hameer Abbasi & Travis E. Oliphant

Department of Physics, University of Jyväskylä, Jyväskylä, Finland

Pauli Virtanen

Nanoscience Center, University of Jyväskylä, Jyväskylä, Finland

Mercari JP, Tokyo, Japan

David Cournapeau

Department of Engineering, University of Cambridge, Cambridge, UK

Eric Wieser

Independent researcher, Karlsruhe, Germany

Julian Taylor

Independent researcher, Berkeley, CA, USA

Nathaniel J. Smith

Enthought, Austin, TX, USA

Robert Kern

Google Research, Mountain View, CA, USA

Stephan Hoyer

Department of Astronomy and Astrophysics, University of Toronto, Toronto, Ontario, Canada

Marten H. van Kerkwijk

School of Psychology, University of Birmingham, Edgbaston, Birmingham, UK

Matthew Brett

Department of Physics, Temple University, Philadelphia, PA, USA

Allan Haldane

Google, Zurich, Switzerland

Jaime Fernández del Río

Department of Physics and Astronomy, The University of British Columbia, Vancouver, British Columbia, Canada

Amazon, Seattle, WA, USA

Independent researcher, Saue, Estonia

Pearu Peterson

Department of Mechanics and Applied Mathematics, Institute of Cybernetics at Tallinn Technical University, Tallinn, Estonia

Department of Biological and Agricultural Engineering, University of Georgia, Athens, GA, USA

Pierre Gérard-Marchant

France-IX Services, Paris, France

Department of Economics, University of Oxford, Oxford, UK

Kevin Sheppard

CCS-7, Los Alamos National Laboratory, Los Alamos, NM, USA

Tyler Reddy

Laboratory for Fluorescence Dynamics, Biomedical Engineering Department, University of California, Irvine, Irvine, CA, USA

Christoph Gohlke

You can also search for this author in PubMed   Google Scholar

Contributions

K.J.M. and S.J.v.d.W. composed the manuscript with input from others. S.B., R.G., K.S., W.W., M.B. and T.R. contributed text. All authors contributed substantial code, documentation and/or expertise to the NumPy project. All authors reviewed the manuscript.

Corresponding authors

Correspondence to K. Jarrod Millman , Stéfan J. van der Walt or Ralf Gommers .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Peer review information Nature thanks Edouard Duchesnay, Alan Edelman and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information.

This file contains Supplementary Methods, including Supplementary Figure 1 and additional references.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585 , 357–362 (2020). https://doi.org/10.1038/s41586-020-2649-2

Download citation

Received : 21 February 2020

Accepted : 17 June 2020

Published : 16 September 2020

Issue Date : 17 September 2020

DOI : https://doi.org/10.1038/s41586-020-2649-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Validated wgs and wes protocols proved saliva-derived gdna as an equivalent to blood-derived gdna for clinical and population genomic analyses.

  • Katerina Kvapilova
  • Pavol Misenko
  • Zbynek Kozmik

BMC Genomics (2024)

Color-based discrimination of color hues in rock paintings through Gaussian mixture models: a case study from Chomache site (Chile)

  • Enrique Cerrillo-Cuenca
  • Marcela Sepúlveda
  • Fernando Bastías

Heritage Science (2024)

An efficient PGD solver for structural dynamics applications

  • Clément Vella
  • Pierre Gosselet
  • Serge Prudhomme

Advanced Modeling and Simulation in Engineering Sciences (2024)

Mining digital identity insights: patent analysis using NLP

  • Matthew Comb
  • Andrew Martin

EURASIP Journal on Information Security (2024)

Dashing Growth Curves: a web application for rapid and interactive analysis of microbial growth curves

  • Michael A. Reiter
  • Julia A. Vorholt

BMC Bioinformatics (2024)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

research paper on python programming

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

arXiv's Accessibility Forum starts next month!

Help | Advanced Search

Computer Science > Programming Languages

Title: program synthesis with large language models.

Abstract: This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.
Comments: Jacob and Augustus contributed equally
Subjects: Programming Languages (cs.PL); Machine Learning (cs.LG)
Cite as: [cs.PL]
  (or [cs.PL] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

1 blog link

Dblp - cs bibliography, bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

information-logo

Article Menu

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Machine learning in python: main developments and technology trends in data science, machine learning, and artificial intelligence.

research paper on python programming

1. Introduction

1.1. scientific computing and machine learning in python, 1.2. optimizing python’s performance for numerical computing and data processing, 2. classical machine learning, 2.1. scikit-learn, the industry standard for classical machine learning, 2.2. addressing class imbalance, 2.3. ensemble learning: gradient boosting machines and model combination, 2.4. scalable distributed machine learning, 3. automatic machine learning (automl), 3.1. data preparation and feature engineering, 3.2. hyperparameter optimization and model evaluation, 3.3. neural architecture search.

  • Entire structure: Generates the entire network from the ground-up by choosing and chaining together a set of primitives, such as convolutions, concatenations, or pooling. This is known as macro search .
  • Cell-based: Searches for combinations of a fixed number of hand-crafted building blocks, called cells. This is known as micro search .
  • Hierarchical: Extends the cell-based approach by introducing multiple levels and chaining together a fixed number of cells, iteratively using the primitives defined in lower layers to construct the higher layers. This combines macro and micro search.
  • Morphism-based structure: Transfers knowledge from an existing well-performing network to a new architecture.

4. GPU-Accelerated Data Science and Machine Learning

4.1. general purpose gpu computing for machine learning, 4.2. end-to-end data science: rapids, 4.3. ndarray and vectorized operations, 4.4. interoperability, 4.5. classical machine learning on gpus, 4.6. distributed data science and machine learning on gpus, 5. deep learning, 5.1. static data flow graphs, 5.2. dynamic graph libraries with eager execution, 5.3. jit and computational efficiency, 5.4. deep learning apis, 5.5. new algorithms for accelerating large-scale deep learning, 6. explainability, interpretability, and fairness of machine learning models, 6.1. feature importance, 6.2. constraining nonlinear models, 6.3. logic and reasoning, 6.4. explaining with interactive visualizations, 6.5. privacy, 6.6. fairness, 7. adversarial learning, 8. conclusions, author contributions, acknowledgments, conflicts of interest, abbreviations.

AIArtificial intelligence
APIApplication programming interface
AutodifffAutomatic differentiation
AutoMLAutomatic machine learning
BERTBidirectional Encoder Representations from Transformers model
BOBayesian optimization
CDEPContextual Decomposition Explanation Penalization
Classical MLClassical machine learning
CNNConvolutional neural network
CPUCentral processing unit
DAGDirected acyclic graph
DLDeep learning
DNNDeep neural network
ETLExtract translate load
GANGenerative adversarial networks
GBMGradient boosting machines
GPUGraphics processing unit
HPOHyperparameter optimization
IPCInter-process communication
JITJust-in-time
LSTMlong-short term memory
MPIMessage-passing interface
NASNeural architecture search
NCCLNVIDIA Collective Communications Library
OPGOne-process-per-GPU
PNASProgressive neural architecture search
RLReinforcement learning
RNNRecurrent neural network
SIMTSingle instruction multiple thread
SIMDSingle instruction multiple data
SGDStochastic gradient descent
  • McCulloch, W.S.; Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 1943 , 5 , 115–133. [ Google Scholar ] [ CrossRef ]
  • Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 1958 , 65 , 386. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ Green Version ]
  • LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989 , 1 , 541–551. [ Google Scholar ] [ CrossRef ]
  • Piatetsky, G. Python Leads the 11 Top Data Science, Machine Learning Platforms: Trends and Analysis. 2019. Available online: https://www.kdnuggets.com/2019/05/poll-top-data-science-machine-learning-platforms.html (accessed on 1 February 2020).
  • Biham, E.; Seberry, J. PyPy: Another version of Py. eSTREAM, ECRYPT Stream Cipher Proj. Rep. 2006 , 38 , 2006. [ Google Scholar ]
  • Developers, P. How fast is PyPy? 2020. Available online: https://speed.pypy.org (accessed on 1 February 2020).
  • Team, G. The State of the Octoverse 2020. Available online: https://octoverse.github.com (accessed on 25 March 2020).
  • Oliphant, T.E. Python for scientific computing. Comput. Sci. Eng. 2007 , 9 , 10–20. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods 2020 , 17 , 261–272. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Mckinney, W. pandas: A Foundational Python Library for Data Analysis and Statistics. Python High Perform. Sci. Comput. 2011 , 14 , 1–9. [ Google Scholar ]
  • Preston-Werner, T. Semantic Versioning 2.0.0. 2013. Semantic Versioning. Available online: https://semver.org/ (accessed on 26 January 2020).
  • Authors, N. NumPy Receives First Ever Funding, Thanks to Moore Foundation. 2017. Available online: https://numfocus.org/blog/numpy-receives-first-ever-funding-thanks-to-moore-foundation (accessed on 1 February 2020).
  • Fedotov, A.; Litvinov, V.; Melik-Adamyan, A. Speeding up Numerical Calculations in Python. 2016. Available online: http://russianscdays.org/files/pdf16/26.pdf (accessed on 1 February 2020).
  • Blackford, L.S.; Petitet, A.; Pozo, R.; Remington, K.; Whaley, R.C.; Demmel, J.; Dongarra, J.; Duff, I.; Hammarling, S.; Henry, G.; et al. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 2002 , 28 , 135–151. [ Google Scholar ]
  • Angerson, E.; Bai, Z.; Dongarra, J.; Greenbaum, A.; McKenney, A.; Du Croz, J.; Hammarling, S.; Demmel, J.; Bischof, C.; Sorensen, D. LAPACK: A portable linear algebra library for high-performance computers. In Proceedings of the 1990 ACM/IEEE Conference on Supercomputing, New York, NY, USA, 12–16 November 1990; pp. 2–11. [ Google Scholar ]
  • Team, O. OpenBLAS: An Optimized BLAS Library. 2020. Available online: https://www.openblas.net (accessed on 1 February 2020).
  • Team, I. Python Accelerated (Using Intel ® MKL). 2020. Available online: https://software.intel.com/en-us/blogs/python-optimized (accessed on 1 February 2020).
  • Diefendorff, K.; Dubey, P.K.; Hochsprung, R.; Scale, H. Altivec extension to PowerPC accelerates media processing. IEEE Micro 2000 , 20 , 85–95. [ Google Scholar ] [ CrossRef ]
  • Pedregosa, F.; Michel, V.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Vanderplas, J.; Cournapeau, D.; Pedregosa, F.; Varoquaux, G.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011 , 12 , 2825–2830. [ Google Scholar ]
  • Buitinck, L.; Louppe, G.; Blondel, M.; Pedregosa, F.; Mueller, A.; Grisel, O.; Niculae, V.; Prettenhofer, P.; Gramfort, A.; Grobler, J.; et al. API design for machine learning software: Experiences from the Scikit-learn project. arXiv 2013 , arXiv:1309.0238. [ Google Scholar ]
  • Team, I. Using Intel ® Distribution for Python. 2020. Available online: https://software.intel.com/en-us/distribution-for-python (accessed on 1 February 2020).
  • Dean, J.; Ghemawat, S. MapReduce: Simplified data processing on large clusters. Commun. ACM 2008 , 51 , 107–113. [ Google Scholar ] [ CrossRef ]
  • Zaharia, M.; Chowdhury, M.; Das, T.; Dave, A. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, San Jose, CA, USA, 25–27 April 2012; p. 2. [ Google Scholar ]
  • Rocklin, M. Dask: Parallel computation with blocked algorithms and task scheduling. In Proceedings of the 14th Python in Science Conference, Austin, TX, USA, 6–12 July 2015; pp. 130–136. [ Google Scholar ]
  • Team, A.A. Apache Arrow—A Cross-Language Development Platform for In-memory Data. 2020. Available online: https://arrow.apache.org/ (accessed on 1 February 2020).
  • Team, A.P. Apache Parquet Documentation. 2020. Available online: https://parquet.apache.org (accessed on 1 February 2020).
  • Zaharia, M.; Xin, R.S.; Wendell, P.; Das, T.; Armbrust, M.; Dave, A.; Meng, X.; Rosen, J.; Venkataraman, S.; Franklin, M.J.; et al. Apache Spark: A unified engine for big data processing. Commun. ACM 2016 , 59 , 56–65. [ Google Scholar ] [ CrossRef ]
  • Developers, R. Fast and Simple Distributed Computing. 2020. Available online: https://ray.io (accessed on 1 February 2020).
  • Developers, M. Faster Pandas, Even on Your Laptop. 2020. Available online: https://modin.readthedocs.io/en/latest/#faster-pandas-even-on-your-laptop (accessed on 1 February 2020).
  • Lemaître, G.; Nogueira, F.; Aridas, C.K. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 2017 , 18 , 559–563. [ Google Scholar ]
  • Galar, M.; Fernandez, A.; Barrenechea, E.; Bustince, H.; Herrera, F. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2012 , 42 , 463–484. [ Google Scholar ] [ CrossRef ]
  • Raschka, S. Model evaluation, model selection, and algorithm selection in machine learning. arXiv 2018 , arXiv:1811.12808. [ Google Scholar ]
  • Breiman, L. Bagging predictors. Mach. Learn. 1996 , 24 , 123–140. [ Google Scholar ]
  • Breiman, L. Random forests. Mach. Learn. 2001 , 45 , 5–32. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Freund, Y.; Schapire, R.E. A decision-theoretic generalization of online learning and an application to boosting. In Proceedings of the European Conference on Computational Learning Theory, Barcelona, Spain, 13–15 March 1995; pp. 23–37. [ Google Scholar ]
  • Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001 , 29 , 1189–1232. [ Google Scholar ] [ CrossRef ]
  • Zhao, Y.; Wang, X.; Cheng, C.; Ding, X. Combining machine learning models using Combo library. arXiv 2019 , arXiv:1910.07988. [ Google Scholar ]
  • Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 13 August 2016; pp. 785–794. [ Google Scholar ]
  • Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 3147–3155. [ Google Scholar ]
  • Raschka, S.; Mirjalili, V. Python Machine Learning: Machine Learning and Deep Learning with Python, Scikit-learn, and TensorFlow 2 ; Packt Publishing Ltd.: Birmingham, UK, 2019. [ Google Scholar ]
  • Wolpert, D.H. Stacked generalization. Neural Netw. 1992 , 5 , 241–259. [ Google Scholar ] [ CrossRef ]
  • Sill, J.; Takács, G.; Mackey, L.; Lin, D. Feature-weighted linear stacking. arXiv 2009 , arXiv:0911.0460. [ Google Scholar ]
  • Lorbieski, R.; Nassar, S.M. Impact of an extra layer on the stacking algorithm for classification problems. JCS 2018 , 14 , 613–622. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Raschka, S. MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack. J. Open Source Softw. 2018 , 3 , 638. [ Google Scholar ] [ CrossRef ]
  • Cruz, R.M.; Sabourin, R.; Cavalcanti, G.D. Dynamic classifier selection: Recent advances and perspectives. Inf. Fusion 2018 , 41 , 195–216. [ Google Scholar ] [ CrossRef ]
  • Deshai, N.; Sekhar, B.V.; Venkataramana, S. MLlib: Machine learning in Apache Spark. Int. J. Recent Technol. Eng. 2019 , 8 , 45–49. [ Google Scholar ]
  • Barker, B. Message passing interface (MPI). In Proceedings of the Workshop: High Performance Computing on Stampede, Austin, TX, USA, 15–20 November 2015; Volume 262. [ Google Scholar ]
  • Thornton, C.; Hutter, F.; Hoos, H.H.; Leyton-Brown, K. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; pp. 847–855. [ Google Scholar ]
  • Feurer, M.; Klein, A.; Eggensperger, K.; Springenberg, J.T.; Blum, M.; Hutter, F. Auto-sklearn: Efficient and robust automated machine learning. In Automated Machine Learning ; Springer: Switzerland, Cham, 2019; pp. 113–134. [ Google Scholar ]
  • Olson, R.S.; Moore, J.H. TPOT: A tree-based pipeline optimization tool for automating machine learning. In Automated Machine Learning ; Springer: Switzerland, Cham, 2019; pp. 151–160. [ Google Scholar ]
  • Team, H. H 2 O AutoML. 2020. Available online: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html (accessed on 1 February 2020).
  • Jin, H.; Song, Q.; Hu, X. Auto-Keras: An efficient neural architecture search system. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Dalian, China, 21–23 November 2019; pp. 1946–1956. [ Google Scholar ]
  • Gijsbers, P.; LeDell, E.; Thomas, J.; Poirier, S.; Bischl, B.; Vanschoren, J. An open source AutoML benchmark. arXiv 2019 , arXiv:1907.00909. [ Google Scholar ]
  • Feurer, M.; Klein, A.; Eggensperger, K.; Springenberg, J.T.; Blum, M.; Hutter, F. Efficient and robust automated machine learning. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 2962–2970. [ Google Scholar ]
  • He, X.; Zhao, K.; Chu, X. AutoML: A survey of the state-of-the-art. arXiv 2019 , arXiv:1908.00709. [ Google Scholar ]
  • Antoniou, A.; Storkey, A.; Edwards, H. Data augmentation generative adversarial networks. arXiv 2017 , arXiv:1711.04340. [ Google Scholar ]
  • Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv. 2010 , 4 , 40–79. [ Google Scholar ] [ CrossRef ]
  • Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012 , 13 , 281–305. [ Google Scholar ]
  • Sievert, S.; Augspurger, T.; Rocklin, M. Better and faster hyperparameter optimization with Dask. In Proceedings of the 18th Python in Science Conference, Austin, TX, USA, 8–14 July 2019; pp. 118–125. [ Google Scholar ]
  • Li, L.; Jamieson, K.; DeSalvo, G.; Rostamizadeh, A.; Talwalkar, A. Hyperband: A novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 2018 , 18 , 6765–6816. [ Google Scholar ]
  • Snoek, J.; Rippel, O.; Swersky, K.; Kiros, R.; Satish, N.; Sundaram, N.; Patwary, M.; Prabhat, M.; Adams, R. Scalable Bayesian optimization using deep neural networks. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2171–2180. [ Google Scholar ]
  • Bergstra, J.S.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems 24 ; Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Granada, Spain, 2011; pp. 2546–2554. [ Google Scholar ]
  • Falkner, S.; Klein, A.; Hutter, F. BOHB: Robust and efficient hyperparameter optimization at scale. arXiv 2018 , arXiv:1807.01774. [ Google Scholar ]
  • Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8697–8710. [ Google Scholar ]
  • Real, E.; Aggarwal, A.; Huang, Y.; Le, Q.V. Regularized evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 4780–4789. [ Google Scholar ]
  • Negrinho, R.; Gormley, M.; Gordon, G.J.; Patil, D.; Le, N.; Ferreira, D. Towards modular and programmable architecture search. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 13715–13725. [ Google Scholar ]
  • Zoph, B.; Le, Q.V. Neural architecture search with reinforcement learning. arXiv 2016 , arXiv:1611.01578. [ Google Scholar ]
  • Goldberg, D.E.; Deb, K. A comparative analysis of selection schemes used in genetic algorithms. In Foundations of Genetic Algorithms ; Elsevier: Amsterdam, The Netherlands, 1991; Volume 1, pp. 69–93. [ Google Scholar ]
  • Liu, H.; Simonyan, K.; Vinyals, O.; Fernando, C.; Kavukcuoglu, K. Hierarchical representations for efficient architecture search. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [ Google Scholar ]
  • Pham, H.; Guan, M.Y.; Zoph, B.; Le, Q.V.; Dean, J. Efficient neural architecture search via parameter sharing. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Vienna, Austria, 25–31 July 2018; Volume 9, pp. 6522–6531. [ Google Scholar ]
  • Liu, C.; Zoph, B.; Neumann, M.; Shlens, J.; Hua, W.; Li, L.J.; Fei-Fei, L.; Yuille, A.; Huang, J.; Murphy, K. Progressive neural architecture search. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) ; Springer: Switzerland, Cham, 2018; Volume 11205, pp. 19–35. [ Google Scholar ]
  • Kandasamy, K.; Neiswanger, W.; Schneider, J.; Póczos, B.; Xing, E.P. Neural architecture search with Bayesian optimisation and optimal transport. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 2016–2025. [ Google Scholar ]
  • Liu, H.; Simonyan, K.; Yang, Y. DARTS: Differentiable architecture search. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [ Google Scholar ]
  • Xie, S.; Zheng, H.; Liu, C.; Lin, L. SNAS: Stochastic neural architecture search. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [ Google Scholar ]
  • Ghemawat, S.; Gobioff, H.; Leung, S.T. The Google file system. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, Bolton Landing, NY, USA, 19–22 Octeber 2003; pp. 29–43. [ Google Scholar ]
  • Dean, J.; Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, USA, 6–8 December 2004; pp. 137–150. [ Google Scholar ]
  • Steinkraus, D.; Buck, I.; Simard, P. Using GPUs for machine learning algorithms. In Proceedings of the Eighth International Conference on Document Analysis and Recognition (ICDAR’05), Seoul, Korea, 29 August–1 September 2005; pp. 1115–1120. [ Google Scholar ]
  • Cirecsan, D.; Meier, U.; Gambardella, L.M.; Schmidhuber, J. Deep big simple neural nets excel on hand-written digit recognition. arXiv 2010 , arXiv:1003.0358 v1. [ Google Scholar ]
  • Klöckner, A. PyCuda: Even simpler GPU programming with Python. In Proceedings of the GPU Technology Conference, Berkeley, CA, USA, 20–23 September 2010. [ Google Scholar ]
  • Brereton, R.G.; Lloyd, G.R. Support vector machines for classification and regression. Analyst 2010 , 135 , 230–267. [ Google Scholar ] [ CrossRef ]
  • Ocsa, A. SQL for GPU Data Frames in RAPIDS Accelerating end-to-end data science workflows using GPUs. In Proceedings of the LatinX in AI Research at ICML 2019, Long Beach, CA, USA, 10 June 2019. [ Google Scholar ]
  • Lam, S.K.; Pitrou, A.; Seibert, S. Numba: A LLVM-based Python JIT compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, Austin, TX, USA, 15 November 2015. [ Google Scholar ]
  • Nishino, R.; Loomis, S.H.C. CuPy: A NumPy-compatible library for NVIDIA GPU calculations. In Proceedings of the Workshop on Machine Learning Systems (LearningSys) in the Thirty-first Annual Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017. [ Google Scholar ]
  • Tokui, S.; Oono, K.; Hido, S.; Clayton, J. Chainer: A next-generation open source framework for deep learning. In Proceedings of the Workshop on Machine Learning Systems (LearningSys) in the Twenty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), Tbilisi, Georgia, 16–19 October 2015; Volume 5. [ Google Scholar ]
  • Developers, G. XLA—TensorFlow, Compiled. 2017. Available online: https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html (accessed on 1 February 2020).
  • Frostig, R.; Johnson, M.J.; Leary, C. Compiling machine learning programs via high-level tracing. In Proceedings of the Systems for Machine Learning, Montreal, QC, Canada, 4 December 2018. [ Google Scholar ]
  • Zhang, H.; Si, S.; Hsieh, C.J. GPU-acceleration for large-scale tree boosting. arXiv 2017 , arXiv:1706.08359. [ Google Scholar ]
  • Dünner, C.; Parnell, T.; Sarigiannis, D.; Ioannou, N.; Anghel, A.; Ravi, G.; Kandasamy, M.; Pozidis, H. Snap ML: A hierarchical framework for machine learning. In Proceedings of the Thirty-Second Conference on Neural Information Processing Systems (NeurIPS 2018), Montreal, QC, Canada, 15 November 2018. [ Google Scholar ]
  • Johnson, J.; Douze, M.; Jegou, H. Billion-scale similarity search with GPUs. In IEEE Transactions on Big Data ; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2019; p. 1. [ Google Scholar ]
  • Maaten, L.V.d.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008 , 9 , 2579–2605. [ Google Scholar ]
  • Chan, D.M.; Rao, R.; Huang, F.; Canny, J.F. t-SNE-CUDA: GPU-accelerated t-SNE and its applications to modern data. In Proceedings of the 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Lyon, France, 24–27 September 2018; pp. 330–338. [ Google Scholar ]
  • Seabold, S.; Perktold, J. Statsmodels: Econometric and statistical modeling with Python. In Proceedings of the 9th Python in Science Conference. Scipy, Austin, TX, USA, 28 June–3 July 2010; Volume 57, p. 61. [ Google Scholar ]
  • Shainer, G.; Ayoub, A.; Lui, P.; Liu, T.; Kagan, M.; Trott, C.R.; Scantlen, G.; Crozier, P.S. The development of Mellanox/NVIDIA GPUDirect over InfiniBand—A new model for GPU to GPU communications. Comput. Sci. Res. Dev. 2011 , 26 , 267–273. [ Google Scholar ] [ CrossRef ]
  • Potluri, S.; Hamidouche, K.; Venkatesh, A.; Bureddy, D.; Panda, D.K. Efficient inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters with NVIDIA GPUs. In Proceedings of the 2013 42nd International Conference on Parallel Processing, Lyon, France, 1–4 October 2013; pp. 80–89. [ Google Scholar ]
  • Anderson, D.P.; Cobb, J.; Korpela, E.; Lebofsky, M.; Werthimer, D. SETI@ home: An experiment in public-resource computing. Commun. ACM 2002 , 45 , 56–61. [ Google Scholar ] [ CrossRef ]
  • Smith, V.; Forte, S.; Ma, C.; Takáč, M.; Jordan, M.I.; Jaggi, M. CoCoA: A general framework for communication-efficient distributed optimization. J. Mach. Learn. Res. 2017 , 18 , 8590–8638. [ Google Scholar ]
  • Shamis, P.; Venkata, M.G.; Lopez, M.G.; Baker, M.B.; Hernandez, O.; Itigin, Y.; Dubman, M.; Shainer, G.; Graham, R.L.; Liss, L.; et al. UCX: An open source framework for HPC network APIs and beyond. In Proceedings of the 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, Washington, DC, USA, 26–28 August 2015; pp. 40–43. [ Google Scholar ]
  • Rajendran, K. NVIDIA GPUs and Apache Spark, One Step Closer 2019. Available online: https://medium.com/rapids-ai/nvidia-gpus-and-apache-spark-one-step-closer-2d99e37ac8fd (accessed on 25 March 2020).
  • LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015 , 521 , 436–444. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Raschka, S. Naive Bayes and text classification I–introduction and theory. arXiv 2014 , arXiv:1410.5329. [ Google Scholar ]
  • Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936 , 7 , 179–188. [ Google Scholar ] [ CrossRef ]
  • Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, New York, NY, USA, 3 March 2014; pp. 675–678. [ Google Scholar ]
  • Team, T.T.D.; Al-Rfou, R.; Alain, G.; Almahairi, A.; Angermueller, C.; Bahdanau, D.; Ballas, N.; Bastien, F.; Bayer, J.; Belikov, A.; et al. Theano: A Python framework for fast computation of mathematical expressions. arXiv 2016 , arXiv:1605.02688. [ Google Scholar ]
  • Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation OSDI 16), San Diego, CA, USA, 2–4 November 2016; pp. 265–283. [ Google Scholar ]
  • Seide, F.; Agarwal, A. CNTK: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 13 August 2016; p. 2135. [ Google Scholar ]
  • Markham, A.; Jia, Y. Caffe2: Portable High-Performance Deep Learning Framework from Facebook ; NVIDIA Corporation: Santa Clara, CA, USA, 2017. [ Google Scholar ]
  • Ma, Y.; Yu, D.; Wu, T.; Wang, H. PaddlePaddle: An open-source deep learning platform from industrial practice. Front. Data Domputing 2019 , 1 , 105–115. [ Google Scholar ]
  • Chen, T.; Li, M.; Li, Y.; Lin, M.; Wang, N.; Wang, M.; Xiao, T.; Xu, B.; Zhang, C.; Zhang, Z. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv 2015 , arXiv:1512.01274. [ Google Scholar ]
  • Collobert, R.; Kavukcuoglu, K.; Farabet, C. Torch7: A matlab-like environment for machine learning. In Proceedings of the BigLearn, NeurIPS Workshop, Sierra Nevada, Spain, 16–17 December 2011. [ Google Scholar ]
  • Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in PyTorch. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [ Google Scholar ]
  • Neubig, G.; Dyer, C.; Goldberg, Y.; Matthews, A.; Ammar, W.; Anastasopoulos, A.; Ballesteros, M.; Chiang, D.; Clothiaux, D.; Cohn, T.; et al. DyNet: The dynamic neural network toolkit. arXiv 2017 , arXiv:1701.03980. [ Google Scholar ]
  • He, H. The State of Machine Learning Frameworks in 2019. 2019. Available online: https://thegradient.pub/state-of-ml-frameworks-2019-pytorch-dominates-research-tensorflow-dominates-industry/ (accessed on 1 February 2020).
  • Coleman, C.; Narayanan, D.; Kang, D.; Zhao, T.; Zhang, J.; Nardi, L.; Bailis, P.; Olukotun, K.; Ré, C.; Zaharia, M. DAWNBench: An end-to-end deep learning benchmark and competition. Training 2017 , 100 , 102. [ Google Scholar ]
  • Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 8024–8035. [ Google Scholar ]
  • Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986 , 323 , 533–536. [ Google Scholar ] [ CrossRef ]
  • Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015 , arXiv:1502.03167. [ Google Scholar ]
  • Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [ Google Scholar ]
  • Qian, N. On the momentum term in gradient descent learning algorithms. Neural Netw. 1999 , 12 , 145–151. [ Google Scholar ] [ CrossRef ]
  • Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004 , 32 , 407–499. [ Google Scholar ]
  • Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014 , arXiv:1412.6980. [ Google Scholar ]
  • Team, T. TensorFlow 2.0 is Now Available! 2019. Available online: https://blog.tensorflow.org/2019/09/tensorflow-20-is-now-available.html (accessed on 1 February 2020).
  • Harris, E.; Painter, M.; Hare, J. Torchbearer: A model fitting library for PyTorch. arXiv 2018 , arXiv:1809.03363. [ Google Scholar ]
  • Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018 , arXiv:1810.04805. [ Google Scholar ]
  • Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners. 2019. Available online: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (accessed on 1 February 2020).
  • Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 2015 , 115 , 211–252. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [ Google Scholar ]
  • Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18-22 June 2018; pp. 7132–7141. [ Google Scholar ]
  • Huang, Y. Introducing GPipe, An Open Source Library for Efficiently Training Large-scale Neural Network Models. 2019. Available online: https://ai.googleblog.com/2019/03/introducing-gpipe-open-source-library.html (accessed on 1 February 2020).
  • Hegde, V.; Usmani, S. Parallel and distributed deep learning. In Technical Report ; Stanford University: Stanford, CA, USA, 2016. [ Google Scholar ]
  • Ben-Nun, T.; Hoefler, T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Comput. Surv. 2019 , 52 , 1–43. [ Google Scholar ] [ CrossRef ]
  • Huang, Y.; Cheng, Y.; Bapna, A.; Firat, O.; Chen, D.; Chen, M.; Lee, H.; Ngiam, J.; Le, Q.V.; Wu, Y.; et al. GPipe: Efficient training of giant neural networks using pipeline parallelism. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 103–112. [ Google Scholar ]
  • He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [ Google Scholar ]
  • Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 22–24 June 2009; pp. 248–255. [ Google Scholar ]
  • Tan, M.; Le, Q.V. EfficientNet: Rethinking model scaling for convolutional neural networks. arXiv 2019 , arXiv:1905.11946. [ Google Scholar ]
  • Gupta, S. EfficientNet-EdgeTPU: Creating Accelerator-Optimized Neural Networks with AutoML. 2020. Available online: https://ai.googleblog.com/2019/08/efficientnet-edgetpu-creating.html (accessed on 1 February 2020).
  • Choi, J.; Wang, Z.; Venkataramani, S.; Chuang, P.I.J.; Srinivasan, V.; Gopalakrishnan, K. PACT: Parameterized clipping activation for quantized neural networks. arXiv 2018 , arXiv:1805.06085. [ Google Scholar ]
  • Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2704–2713. [ Google Scholar ]
  • Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. XNOR-Net: ImageNet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 525–542. [ Google Scholar ]
  • Zhang, D.; Yang, J.; Ye, D.; Hua, G. LQ-Nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 365–382. [ Google Scholar ]
  • Zhou, A.; Yao, A.; Guo, Y.; Xu, L.; Chen, Y. Incremental network quantization: Towards lossless CNNs with low-precision weights. arXiv 2017 , arXiv:1702.03044. [ Google Scholar ]
  • Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; Zou, Y. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv 2016 , arXiv:1606.06160. [ Google Scholar ]
  • Bernstein, J.; Zhao, J.; Azizzadenesheli, K.; Anandkumar, A. signSGD with majority vote is communication efficient and fault tolerant. In Proceedings of the International Conference on Learning Representations (ICLR) 2019, New Orleans, LA, USA, 6–9 May 2019. [ Google Scholar ]
  • Nguyen, A.P.; Martínez, M.R. MonoNet: Towards Interpretable Models by Learning Monotonic Features. arXiv 2019 , arXiv:1909.13611. [ Google Scholar ]
  • Ribeiro, M.T.; Singh, S.; Guestrin, C. ‘Why should i Trust You?’ Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 13 August 2016; pp. 1135–1144. [ Google Scholar ]
  • Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30 ; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Long Beach, CA, USA, 2017; pp. 4765–4774. [ Google Scholar ]
  • Shapley, L.S. A value for n-person games. Contrib. Theory Games 1953 , 2 , 307–317. [ Google Scholar ]
  • Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 3319–3328. [ Google Scholar ]
  • Shrikumar, A.; Greenside, P.; Kundaje, A. Learning important features through propagating activation differences. arXiv 2017 , arXiv:1704.02685. [ Google Scholar ]
  • Rafique, H.; Wang, T.; Lin, Q. Model-agnostic linear competitors–when interpretable models compete and collaborate with black-box models. arXiv 2019 , arXiv:1909.10467. [ Google Scholar ]
  • Rieger, L.; Singh, C.; Murdoch, W.J.; Yu, B. Interpretations are useful: Penalizing explanations to align neural networks with prior knowledge. arXiv 2019 , arXiv:1909.13584. [ Google Scholar ]
  • Murdoch, W.J.; Liu, P.J.; Yu, B. Beyond word importance: Contextual decomposition to extract interactions from LSTMs. arXiv 2018 , arXiv:1801.05453. [ Google Scholar ]
  • Zhuang, J.; Dvornek, N.C.; Li, X.; Yang, J.; Duncan, J.S. Decision explanation and feature importance for invertible networks. arXiv 2019 , arXiv:1910.00406. [ Google Scholar ]
  • Bride, H.; Hou, Z.; Dong, J.; Dong, J.S.; Mirjalili, A. Silas: High performance, explainable and verifiable machine learning. arXiv 2019 , arXiv:1910.01382. [ Google Scholar ]
  • Bride, H.; Dong, J.; Dong, J.S.; Hóu, Z. Towards dependable and explainable machine learning using automated reasoning. In Proceedings of the International Conference on Formal Engineering Methods, Gold Coast, QLD, Australia, 12–16 November 2018; pp. 412–416. [ Google Scholar ]
  • Dietterich, T.G. Learning at the knowledge level. Mach. Learn. 1986 , 1 , 287–315. [ Google Scholar ]
  • Rabold, J.; Siebers, M.; Schmid, U. Explaining black-box classifiers with ILP–empowering LIME with Aleph to approximate nonlinear decisions with relational rules. In Proceedings of the International Conference on Inductive Logic Programming, Ferrara, Italy, 12 April 2018; pp. 105–117. [ Google Scholar ]
  • Rabold, J.; Deininger, H.; Siebers, M.; Schmid, U. Enriching visual with verbal explanations for relational concepts–combining LIME with Aleph. arXiv 2019 , arXiv:1910.01837. [ Google Scholar ]
  • Hunter, J.D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 2007 , 9 , 90–95. [ Google Scholar ] [ CrossRef ]
  • VanderPlas, J.; Granger, B.; Heer, J.; Moritz, D.; Wongsuphasawat, K.; Satyanarayan, A.; Lees, E.; Timofeev, I.; Welsh, B.; Sievert, S. Altair: Interactive statistical visualizations for Python. J. Open Source Softw. 2018 , 1 , 1–2. [ Google Scholar ]
  • Hohman, F.; Park, H.; Robinson, C.; Chau, D.H.P. Summit: Scaling deep learning interpretability by visualizing activation and attribution summarizations. IEEE Trans. Vis. Comput. Graph. 2019 , 26 , 1096–1106. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Olah, C.; Mordvintsev, A.; Schubert, L. Feature Visualization. 2017. Available online: https://distill.pub/2017/feature-visualization/ (accessed on 1 February 2020).
  • Carter, S. Exploring Neural Networks with Activation Atlases. 2019. Available online: https://ai.googleblog.com/2019/03/exploring-neural-networks.html (accessed on 1 February 2020).
  • McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv 2018 , arXiv:1802.03426. [ Google Scholar ]
  • Hoover, B.; Strobelt, H.; Gehrmann, S. exBERT: A visual analysis tool to explore learned representations in transformers models. arXiv 2019 , arXiv:1910.05276. [ Google Scholar ]
  • Howard, J.; Ruder, S. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 328–339. [ Google Scholar ]
  • Adiwardana, D.; Luong, M.T.; Thus, D.R.; Hall, J.; Fiedel, N.; Thoppilan, R.; Yang, Z.; Kulshreshtha, A.; Nemade, G.; Lu, Y.; et al. Towards a human-like open-domain chatbot. arXiv 2020 , arXiv:2001.09977. [ Google Scholar ]
  • Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [ Google Scholar ]
  • Joo, H.; Simon, T.; Sheikh, Y. Total capture: A 3D deformation model for tracking faces, hands, and bodies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8320–8329. [ Google Scholar ]
  • Huang, D.A.; Nair, S.; Xu, D.; Zhu, Y.; Garg, A.; Fei-Fei, L.; Savarese, S.; Niebles, J.C. Neural task graphs: Generalizing to unseen tasks from a single video demonstration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 8565–8574. [ Google Scholar ]
  • McMahan, H.B.; Andrew, G.; Erlingsson, U.; Chien, S.; Mironov, I.; Papernot, N.; Kairouz, P. A general approach to adding differential privacy to iterative training procedures. arXiv 2018 , arXiv:1812.06210. [ Google Scholar ]
  • Buolamwini, J.; Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the Conference on Fairness, Accountability and Transparency, New York, NY, USA, 23–24 February 2018; pp. 77–91. [ Google Scholar ]
  • Xu, C.; Doshi, T. Fairness Indicators: Scalable Infrastructure for Fair ML Systems. 2019. Available online: https://ai.googleblog.com/2019/12/fairness-indicators-scalable.html (accessed on 1 February 2020).
  • Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. arXiv 2013 , arXiv:1312.6199. [ Google Scholar ]
  • Eykholt, K.; Evtimov, I.; Fernandes, E.; Li, B.; Rahmati, A.; Xiao, C.; Prakash, A.; Kohno, T.; Song, D. Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1625–1634. [ Google Scholar ]
  • Papernot, N.; Carlini, N.; Goodfellow, I.; Feinman, R.; Faghri, F.; Matyasko, A.; Hambardzumyan, K.; Juang, Y.L.; Kurakin, A.; Sheatsley, R.; et al. Cleverhans v2.0.0: An adversarial machine learning library. arXiv 2016 , arXiv:1610.00768. [ Google Scholar ]
  • Rauber, J.; Brendel, W.; Bethge, M. Foolbox: A Python toolbox to benchmark the robustness of machine learning models. arXiv 2017 , arXiv:1707.04131. [ Google Scholar ]
  • Nicolae, M.I.; Sinn, M.; Tran, M.N.; Rawat, A.; Wistuba, M.; Zantedeschi, V.; Baracaldo, N.; Chen, B.; Ludwig, H.; Molloy, I.M.; et al. Adversarial robustness toolbox v0.4.0. arXiv 2018 , arXiv:1807.01069. [ Google Scholar ]
  • Ling, X.; Ji, S.; Zou, J.; Wang, J.; Wu, C.; Li, B.; Wang, T. Deepsec: A uniform platform for security analysis of deep learning model. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 18–19 May 2019; pp. 673–690. [ Google Scholar ]
  • Goodman, D.; Xin, H.; Yang, W.; Yuesheng, W.; Junfeng, X.; Huan, Z. Advbox: A toolbox to generate adversarial examples that fool neural networks. arXiv 2020 , arXiv:2001.05574. [ Google Scholar ]
  • Sabour, S.; Cao, Y.; Faghri, F.; Fleet, D.J. Adversarial manipulation of deep representations. arXiv 2015 , arXiv:1511.05122. [ Google Scholar ]
  • Chen, P.Y.; Zhang, H.; Sharma, Y.; Yi, J.; Hsieh, C.J. ZOO: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, Dallas, TX, USA, 3 November 2017; pp. 15–26. [ Google Scholar ]
  • Miyato, T.; Maeda, S.i.; Koyama, M.; Nakae, K.; Ishii, S. Distributional smoothing with virtual adversarial training. arXiv 2015 , arXiv:1507.00677. [ Google Scholar ]
  • Brown, T.B.; Mané, D.; Roy, A.; Abadi, M.; Gilmer, J. Adversarial patch. In Proceedings of the NeurIPS Workshop, Long Beach, CA, USA, 4–9 December 2017. [ Google Scholar ]
  • Engstrom, L.; Tran, B.; Tsipras, D.; Schmidt, L.; Madry, A. Exploring the landscape of spatial robustness. arXiv 2017 , arXiv:1712.02779. [ Google Scholar ]
  • Papernot, N.; McDaniel, P.; Goodfellow, I. Transferability in machine learning: From phenomena to black-box attacks using adversarial samples. arXiv 2016 , arXiv:1605.07277. [ Google Scholar ]
  • Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv 2014 , arXiv:1412.6572. [ Google Scholar ]
  • Tramèr, F.; Kurakin, A.; Papernot, N.; Goodfellow, I.; Boneh, D.; McDaniel, P. Ensemble adversarial training: Attacks and defenses. arXiv 2017 , arXiv:1705.07204. [ Google Scholar ]
  • Dong, Y.; Liao, F.; Pang, T.; Su, H.; Zhu, J.; Hu, X.; Li, J. Boosting adversarial attacks with momentum. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 9185–9193. [ Google Scholar ]
  • Kurakin, A.; Goodfellow, I.; Bengio, S. Adversarial examples in the physical world. arXiv 2016 , arXiv:1607.02533. [ Google Scholar ]
  • Moosavi-Dezfooli, S.M.; Fawzi, A.; Fawzi, O.; Frossard, P. Universal adversarial perturbations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1765–1773. [ Google Scholar ]
  • Moosavi-Dezfooli, S.M.; Fawzi, A.; Frossard, P. DeepFool: A simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2574–2582. [ Google Scholar ]
  • Jang, U.; Wu, X.; Jha, S. Objective metrics and gradient descent algorithms for adversarial examples in machine learning. In Proceedings of the 33rd Annual Computer Security Applications Conference, Orlando, FL, USA, 4–8 December 2017; pp. 262–277. [ Google Scholar ]
  • Papernot, N.; McDaniel, P.; Jha, S.; Fredrikson, M.; Celik, Z.B.; Swami, A. The limitations of deep learning in adversarial settings. In Proceedings of the 2016 IEEE European symposium on security and privacy (EuroS&P), Saarbrücken, Germany, 21–24 March 2016; pp. 372–387. [ Google Scholar ]
  • Carlini, N.; Wagner, D. Towards evaluating the robustness of neural networks. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (sp), San Jose, CA, USA, 25 May 2017; pp. 39–57. [ Google Scholar ]
  • Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv 2017 , arXiv:1706.06083. [ Google Scholar ]
  • He, W.; Li, B.; Song, D. Decision boundary analysis of adversarial examples. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [ Google Scholar ]
  • Chen, P.Y.; Sharma, Y.; Zhang, H.; Yi, J.; Hsieh, C.J. EAD: Elastic-net attacks to deep neural networks via adversarial examples. In Proceedings of the Thirty-second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [ Google Scholar ]
  • Brendel, W.; Rauber, J.; Bethge, M. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. arXiv 2017 , arXiv:1712.04248. [ Google Scholar ]
  • Chen, J.; Jordan, M.I.; Wainwright, M.J. HopSkipJumpAttack: A query-efficient decision-based attack. arXiv 2019 , arXiv:1904.02144 3 . [ Google Scholar ]
  • Goodfellow, I.; Qin, Y.; Berthelot, D. Evaluation methodology for attacks against confidence thresholding models. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [ Google Scholar ]
  • Hosseini, H.; Xiao, B.; Jaiswal, M.; Poovendran, R. On the limitation of convolutional neural networks in recognizing negative images. In Proceedings of the 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico, 18–21 December 2017; pp. 352–358. [ Google Scholar ]
  • Tramèr, F.; Boneh, D. Adversarial training and robustness for multiple perturbations. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 5858–5868. [ Google Scholar ]
  • Uesato, J.; O’Donoghue, B.; Oord, A.V.D.; Kohli, P. Adversarial risk and the dangers of evaluating against weak attacks. arXiv 2018 , arXiv:1802.05666. [ Google Scholar ]
  • Grosse, K.; Pfaff, D.; Smith, M.T.; Backes, M. The limitations of model uncertainty in adversarial settings. arXiv 2018 , arXiv:1812.02606. [ Google Scholar ]
  • Alaifari, R.; Alberti, G.S.; Gauksson, T. ADef: An iterative algorithm to construct adversarial deformations. arXiv 2018 , arXiv:1804.07729. [ Google Scholar ]
  • Rony, J.; Hafemann, L.G.; Oliveira, L.S.; Ayed, I.B.; Sabourin, R.; Granger, E. Decoupling direction and norm for efficient gradient-based L2 adversarial attacks and defenses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4322–4330. [ Google Scholar ]
  • Narodytska, N.; Kasiviswanathan, S.P. Simple black-box adversarial perturbations for deep networks. arXiv 2016 , arXiv:1612.06299. [ Google Scholar ]
  • Schott, L.; Rauber, J.; Bethge, M.; Brendel, W. Towards the first adversarially robust neural network model on MNIST. arXiv 2018 , arXiv:1805.09190. [ Google Scholar ]
  • Alzantot, M.; Sharma, Y.; Chakraborty, S.; Zhang, H.; Hsieh, C.J.; Srivastava, M.B. GenAttack: Practical black-box attacks with gradient-free optimization. In Proceedings of the Genetic and Evolutionary Computation Conference, Prague, Prague, Czech Republic, 13–17 July 2019; pp. 1111–1119. [ Google Scholar ]
  • Xu, W.; Evans, D.; Qi, Y. Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv 2017 , arXiv:1704.01155. [ Google Scholar ]
  • Zantedeschi, V.; Nicolae, M.I.; Rawat, A. Efficient defenses against adversarial attacks. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, Dallas, TX, USA, 3 November 2017; pp. 39–49. [ Google Scholar ]
  • Buckman, J.; Roy, A.; Raffel, C.; Goodfellow, I. Thermometer encoding: One hot way to resist adversarial examples. In Proceedings of the International Conference of Machine Learning Research, Vancouver, BC, Canada, 30 April–3 May 2018. [ Google Scholar ]
  • Kurakin, A.; Goodfellow, I.; Bengio, S. Adversarial machine learning at scale. arXiv 2016 , arXiv:1611.01236. [ Google Scholar ]
  • Papernot, N.; McDaniel, P.; Wu, X.; Jha, S.; Swami, A. Distillation as a defense to adversarial perturbations against deep neural networks. In Proceedings of the 2016 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 22–26 May 2016; pp. 582–597. [ Google Scholar ]
  • Ross, A.S.; Doshi-Velez, F. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Proceedings of the Thirty-second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [ Google Scholar ]
  • Guo, C.; Rana, M.; Cisse, M.; Van Der Maaten, L. Countering adversarial images using input transformations. arXiv 2017 , arXiv:1711.00117. [ Google Scholar ]
  • Xie, C.; Wang, J.; Zhang, Z.; Ren, Z.; Yuille, A. Mitigating adversarial effects through randomization. arXiv 2017 , arXiv:1711.01991. [ Google Scholar ]
  • Song, Y.; Kim, T.; Nowozin, S.; Ermon, S.; Kushman, N. PixelDefend: Leveraging generative models to understand and defend against adversarial examples. arXiv 2017 , arXiv:1710.10766. [ Google Scholar ]
  • Cao, X.; Gong, N.Z. Mitigating evasion attacks to deep neural networks via region-based classification. In Proceedings of the 33rd Annual Computer Security Applications Conference, Orlando, FL, USA, 4–8 December 2017; pp. 278–287. [ Google Scholar ]
  • Das, N.; Shanbhogue, M.; Chen, S.T.; Hohman, F.; Chen, L.; Kounavis, M.E.; Chau, D.H. Keeping the bad guys out: Protecting and vaccinating deep learning with JPEG compression. arXiv 2017 , arXiv:1705.02900. [ Google Scholar ]
  • Raschka, S.; Kaufman, B. Machine learning and AI-based approaches for bioactive ligand discovery and GPCR-ligand recognition. arXiv 2020 , arXiv:2001.06545. [ Google Scholar ]
  • Battaglia, P.W.; Hamrick, J.B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkner, R.; et al. Relational inductive biases, deep learning, and graph networks. arXiv 2018 , arXiv:1806.01261. [ Google Scholar ]
  • Fey, M.; Lenssen, J.E. Fast graph representation learning with PyTorch Geometric. arXiv 2019 , arXiv:1903.02428. [ Google Scholar ]
  • Law, S. STUMPY: A powerful and scalable Python library for time series data mining. J. Open Source Softw. 2019 , 4 , 1504. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Carpenter, B.; Gelman, A.; Hoffman, M.; Lee, D.; Goodrich, B.; Betancourt, M.; Brubaker, M.A.; Li, P.; Riddell, A. Stan: A probabilistic programming language. J. Stat. Softw. 2016 . [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Salvatier, J.; Wiecki, T.V.; Fonnesbeck, C. Probabilistic programming in Python using PyMC3. PeerJ Comput. Sci. 2016 , 2016 . [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Tran, D.; Kucukelbir, A.; Dieng, A.B.; Rudolph, M.; Liang, D.; Blei, D.M. Edward: A library for probabilistic modeling, inference, and criticism. arXiv 2016 , arXiv:1610.09787. [ Google Scholar ]
  • Schreiber, J. Pomegranate: Fast and flexible probabilistic modeling in python. J. Mach. Learn. Res. 2017 , 18 , 5992–5997. [ Google Scholar ]
  • Bingham, E.; Chen, J.P.; Jankowiak, M.; Obermeyer, F.; Pradhan, N.; Karaletsos, T.; Singh, R.; Szerlip, P.; Horsfall, P.; Goodman, N.D. Pyro: Deep universal probabilistic programming. J. Mach. Learn. Res. 2019 , 20 , 973–978. [ Google Scholar ]
  • Phan, D.; Pradhan, N.; Jankowiak, M. Composable effects for flexible and accelerated probabilistic programming in NumPyro. arXiv 2019 , arXiv:1912.11554. [ Google Scholar ]
  • Broughton, M.; Verdon, G.; McCourt, T.; Martinez, A.J.; Yoo, J.H.; Isakov, S.V.; Massey, P.; Niu, M.Y.; Halavati, R.; Peters, E.; et al. TensorFlow Quantum: A Software Framework for Quantum Machine Learning. arXiv 2020 , arXiv:2003.02989. [ Google Scholar ]
  • Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. Mastering chess and Shogi by self-play with a general reinforcement learning algorithm. arXiv 2017 , arXiv:1712.01815. [ Google Scholar ]
  • Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019 , 575 , 350–354. [ Google Scholar ] [ CrossRef ]
  • Quach, K. DeepMind Quits Playing Games with AI, Ups the Protein Stakes with Machine-Learning Code. 2018. Available online: https://www.theregister.co.uk/2018/12/06/deepmind_alphafold_games/ (accessed on 1 February 2020).

Click here to enlarge figure

Cleverhans v3.0.1FoolBox v2.3.0ART v1.1.0DEEPSEC (2019)AdvBox v0.4.1
TensorFlowyesyesyesnoyes
MXNetyesyesyesnoyes
PyTorchnoyesyesyesyes
PaddlePaddlenonononoyes
Box-constrained L-BFGS [ ]yesnonoyesno
Adv. manipulation of deep repr. [ ]yesnononono
ZOO [ ]nonoyesnono
Virtual adversarial method [ ]yesyesyesnono
Adversarial patch [ ]nonoyesnono
Spatial transformation attack [ ]noyesyesnono
Decision tree attack [ ]nonoyesnono
FGSM [ ]yesyesyesyesyes
R+FGSM [ ]nononoyesno
R+LLC [ ]nononoyesno
U-MI-FGSM [ ]yesyesnoyesno
T-MI-FGSM [ ]yesyesnoyesno
Basic iterative method [ ]noyesyesyesyes
LLC/ILLC [ ]noyesnoyesno
Universal adversarial perturbation [ ]nonoyesyesno
DeepFool [ ]yesyesyesyesyes
NewtonFool [ ]noyesyesnono
Jacobian saliency map [ ]yesyesyesyesyes
CW/CW2 [ ]yesyesyesyesyes
Projected gradient descent [ ]yesnoyesyesyes
OptMargin [ ]nononoyesno
Elastic net attack [ ]yesyesyesyesno
Boundary attack [ ]noyesyesnono
HopSkipJumpAttack [ ]yesyesyesnono
MaxConf [ ]yesnononono
Inversion attack [ ]yesyesnonono
SparseL1 [ ]yesyesnonono
SPSA [ ]yesnononono
HCLU [ ]nonoyesnono
ADef [ ]noyesnonono
DDNL2 [ ]noyesnonono
Local search [ ]noyesnonono
Pointwise attack [ ]noyesnonono
GenAttack [ ]noyesnonono
Feature squeezing [ ]nonoyesnoyes
Spatial smoothing [ ]nonoyesnoyes
Label smoothing [ ]nonoyesnoyes
Gaussian augmentation [ ]nonoyesnoyes
Adversarial training [ ]nonoyesyesyes
Thermometer encoding [ ]nonoyesyesyes
NAT [ ]nononoyesno
Ensemble adversarial training [ ]nononoyesno
Distillation as a defense [ ]nononoyesno
Input gradient regularization [ ]nononoyesno
Image transformations [ ]nonoyesyesno
Randomization [ ]nononoyesno
PixelDefend [ ]nonoyesyesno
Regr.-based classfication [ ]nononoyesno
JPEG compression [ ]nonoyesnono

Share and Cite

Raschka, S.; Patterson, J.; Nolet, C. Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence. Information 2020 , 11 , 193. https://doi.org/10.3390/info11040193

Raschka S, Patterson J, Nolet C. Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence. Information . 2020; 11(4):193. https://doi.org/10.3390/info11040193

Raschka, Sebastian, Joshua Patterson, and Corey Nolet. 2020. "Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence" Information 11, no. 4: 193. https://doi.org/10.3390/info11040193

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

Enter the email address you signed up with and we'll email you a reset link.

  • We're Hiring!
  • Help Center

paper cover thumbnail

Python: The Programming Language of Future

Profile image of IJIRT Journal

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

Loading metrics

Open Access

An Introduction to Programming for Bioscientists: A Python-Based Primer

Contributed equally to this work with: Berk Ekmekci, Charles E. McAnany

Affiliation Department of Chemistry, University of Virginia, Charlottesville, Virginia, United States of America

* E-mail: [email protected]

  • Berk Ekmekci, 
  • Charles E. McAnany, 
  • Cameron Mura

PLOS

Published: June 7, 2016

  • https://doi.org/10.1371/journal.pcbi.1004867
  • Reader Comments

Table 1

Computing has revolutionized the biological sciences over the past several decades, such that virtually all contemporary research in molecular biology, biochemistry, and other biosciences utilizes computer programs. The computational advances have come on many fronts, spurred by fundamental developments in hardware, software, and algorithms. These advances have influenced, and even engendered, a phenomenal array of bioscience fields, including molecular evolution and bioinformatics; genome-, proteome-, transcriptome- and metabolome-wide experimental studies; structural genomics; and atomistic simulations of cellular-scale molecular assemblies as large as ribosomes and intact viruses. In short, much of post-genomic biology is increasingly becoming a form of computational biology. The ability to design and write computer programs is among the most indispensable skills that a modern researcher can cultivate. Python has become a popular programming language in the biosciences, largely because (i) its straightforward semantics and clean syntax make it a readily accessible first language; (ii) it is expressive and well-suited to object-oriented programming, as well as other modern paradigms; and (iii) the many available libraries and third-party toolkits extend the functionality of the core language into virtually every biological domain (sequence and structure analyses, phylogenomics, workflow management systems, etc.). This primer offers a basic introduction to coding, via Python, and it includes concrete examples and exercises to illustrate the language’s usage and capabilities; the main text culminates with a final project in structural bioinformatics. A suite of Supplemental Chapters is also provided. Starting with basic concepts, such as that of a “variable,” the Chapters methodically advance the reader to the point of writing a graphical user interface to compute the Hamming distance between two DNA sequences.

Author Summary

Contemporary biology has largely become computational biology, whether it involves applying physical principles to simulate the motion of each atom in a piece of DNA, or using machine learning algorithms to integrate and mine “omics” data across whole cells (or even entire ecosystems). The ability to design algorithms and program computers, even at a novice level, may be the most indispensable skill that a modern researcher can cultivate. As with human languages, computational fluency is developed actively, not passively. This self-contained text, structured as a hybrid primer/tutorial, introduces any biologist—from college freshman to established senior scientist—to basic computing principles (control-flow, recursion, regular expressions, etc.) and the practicalities of programming and software design. We use the Python language because it now pervades virtually every domain of the biosciences, from sequence-based bioinformatics and molecular evolution to phylogenomics, systems biology, structural biology, and beyond. To introduce both coding (in general) and Python (in particular), we guide the reader via concrete examples and exercises. We also supply, as Supplemental Chapters, a few thousand lines of heavily-annotated, freely distributed source code for personal study.

Citation: Ekmekci B, McAnany CE, Mura C (2016) An Introduction to Programming for Bioscientists: A Python-Based Primer. PLoS Comput Biol 12(6): e1004867. https://doi.org/10.1371/journal.pcbi.1004867

Editor: Francis Ouellette, Ontario Institute for Cancer Research, CANADA

Copyright: © 2016 Ekmekci et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: Portions of this work were supported by the University of Virginia, the Jeffress Memorial Trust (J-971), a UVa Harrison undergraduate research award (BE), NSF grant DUE-1044858 (CM), and NSF CAREER award MCB-1350957 (CM). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Motivation: big data and biology.

Datasets of unprecedented volume and heterogeneity are becoming the norm in science, and particularly in the biosciences. High-throughput experimental methodologies in genomics [ 1 ], proteomics [ 2 ], transcriptomics [ 3 ], metabolomics [ 4 ], and other “omics” [ 5 – 7 ] routinely yield vast stores of data on a system-wide scale. Growth in the quantity of data has been matched by an increase in heterogeneity: there is now great variability in the types of relevant data, including nucleic acid and protein sequences from large-scale sequencing projects, proteomic data and molecular interaction maps from microarray and chip experiments on entire organisms (and even ecosystems [ 8 – 10 ]), three-dimensional (3D) coordinate data from international structural genomics initiatives, petabytes of trajectory data from large-scale biomolecular simulations, and so on. In each of these areas, volumes of raw data are being generated at rates that dwarf the scale and exceed the scope of conventional data-processing and data-mining approaches.

The intense data-analysis needs of modern research projects feature at least three facets: data production , reduction/processing , and integration . Data production is largely driven by engineering and technological advances, such as commodity equipment for next-gen DNA sequencing [ 11 – 13 ] and robotics for structural genomics [ 14 , 15 ]. Data reduction requires efficient computational processing approaches, and data integration demands robust tools that can flexibly represent data ( abstractions ) so as to enable the detection of correlations and interdependencies (via, e.g., machine learning [ 16 ]). These facets are closely coupled: the rate at which raw data is now produced, e.g., in computing molecular dynamics (MD) trajectories [ 17 ], dictates the data storage, processing, and analysis needs. As a concrete example, the latest generation of highly-scalable, parallel MD codes can generate data more rapidly than they can be transferred via typical computer network backbones to local workstations for processing. Such demands have spurred the development of tools for “on-the-fly” trajectory analysis (e.g., [ 18 , 19 ]) as well as generic software toolkits for constructing parallel and distributed data-processing pipelines (e.g., [ 20 ] and S2 Text , §2). To appreciate the scale of the problem, note that calculation of all-atom MD trajectories over biologically-relevant timescales easily leads into petabyte-scale computing. Consider, for instance, a biomolecular simulation system of modest size, such as a 100-residue globular protein embedded in explicit water (corresponding to ≈10 5 particles), and with typical simulation parameters (32-bit precision, atomic coordinates written to disk, in binary format, for every ps of simulation time, etc.). Extending such a simulation to 10 µs duration—which may be at the low end of what is deemed biologically relevant for the system—would give an approximately 12-terabyte trajectory (≈10 5 particles × 3 coordinates/particle/frame × 10 7 frames × 4 bytes/coordinate = 12TB). To validate or otherwise follow-up predictions from a single trajectory, one might like to perform an additional suite of >10 such simulations, thus rapidly approaching the peta-scale.

Scenarios similar to the above example occur in other biological domains, too, at length-scales ranging from atomic to organismal. Atomistic MD simulations were mentioned above. At the molecular level of individual genes/proteins, an early step in characterizing a protein’s function and evolution might be to use sequence analysis methods to compare the protein sequence to every other known sequence, of which there are tens of millions [ 21 ]. Any form of 3D structural analysis will almost certainly involve the Protein Data Bank (PDB; [ 22 ]), which currently holds over 10 5 entries. At the cellular level, proteomics, transcriptomics, and various other “omics” areas (mentioned above) have been inextricably linked to high-throughput, big-data science since the inception of each of those fields. In genomics, the early bottleneck—DNA sequencing and raw data collection—was eventually supplanted by the problem of processing raw sequence data into derived (secondary) formats, from which point meaningful conclusions can be gleaned [ 23 ]. Enabled by the amount of data that can be rapidly generated, typical “omics” questions have become more subtle. For instance, simply assessing sequence similarity and conducting functional annotation of the open reading frames (ORFs) in a newly sequenced genome is no longer the end-goal; rather, one might now seek to derive networks of biomolecular functions from sparse, multi-dimensional datasets [ 24 ]. At the level of tissue systems, the modeling and simulation of inter-neuronal connections has developed into a new field of “connectomics” [ 25 , 26 ]. Finally, at the organismal and clinical level, the promise of personalized therapeutics hinges on the ability to analyze large, heterogeneous collections of data (e.g., [ 27 ]). As illustrated by these examples, all bioscientists would benefit from a basic understanding of the computational tools that are used daily to collect, process, represent, statistically manipulate, and otherwise analyze data. In every data-driven project, the overriding goal is to transform raw data into new biological principles and knowledge.

A New Kind of Scientist

Generating knowledge from large datasets is now recognized as a central challenge in science [ 28 ]. To succeed, each type of aforementioned data-analysis task hinges upon three things: greater computing power, improved computational methods, and computationally fluent scientists. Computing power is only marginally an issue: it lies outside the scope of most biological research projects, and the problem is often addressed by money and the acquisition of new hardware. In contrast, computational methods—improved algorithms, and the software engineering to implement the algorithms in high-quality codebases—are perpetual goals. To address the challenges, a new era of scientific training is required [ 29 – 32 ]. There is a dire need for biologists who can collect, structure, process/reduce, and analyze (both numerically and visually) large-scale datasets. The problems are more fundamental than, say, simply converting data files from one format to another (“data-wrangling”). Fortunately, the basics of the necessary computational techniques can be learned quickly. Two key pillars of computational fluency are (i) a working knowledge of some programming language and (ii) comprehension of core computer science principles (data structures, sort methods, etc.). All programming projects build upon the same set of basic principles, so a seemingly crude grasp of programming essentials will often suffice for one to understand the workings of very complex code; one can develop familiarity with more advanced topics (graph algorithms, computational geometry, numerical methods, etc.) as the need arises for particular research questions. Ideally, computational skills will begin to be developed during early scientific training. Recent educational studies have exposed the gap in life sciences and computer science knowledge among young scientists, and interdisciplinary education appears to be effective in helping bridge the gap [ 33 , 34 ].

Programming as the Way Forward

For many of the questions that arise in research, software tools have been designed. Some of these tools follow the Unix tradition to “make each program do one thing well” [ 35 ], while other programs have evolved into colossal applications that provide numerous sophisticated features, at the cost of accessibility and reliability. A small software tool that is designed to perform a simple task will, at some point, lack a feature that is necessary to analyze a particular type of dataset. A large program may provide the missing feature, but the program may be so complex that the user cannot readily master it, and the codebase may have become so unwieldy that it cannot be adapted to new projects without weeks of study. Guy Steele, a highly-regarded computer scientist, noted this principle in a lecture on programming language design [ 36 ]:

“ I should not design a small language, and I should not design a large one. I need to design a language that can grow. I need to plan ways in which it might grow—but I need, too, to leave some choices so that other persons can make those choices at a later time. ”

Programming languages provide just such a tool. Instead of supplying every conceivable feature, languages provide a small set of well-designed features and powerful tools to compose these features in new ways, using logical principles. Programming allows one to control every aspect of data analysis, and libraries provide commonly-used functionality and pre-made tools that the scientist can use for most tasks. A good library provides a simple interface for the user to perform routine tasks, but also allows the user to tweak and customize the behavior in any way desired (such code is said to be extensible ). The ability to compose programs into other programs is particularly valuable to the scientist. One program may be written to perform a particular statistical analysis, and another program may read in a data file from an experiment and then use the first program to perform the analysis. A third program might select certain datasets—each in its own file—and then call the second program for each chosen data file. In this way, the programs can serve as modules in a computational workflow.

On a related note, many software packages supply an application programming interface (API), which exposes some specific set of functionalities from the codebase without requiring the user/programmer to worry about the low-level implementation details. A well-written API enables users to combine already established codes in a modular fashion, thereby more efficiently creating customized new tools and pipelines for data processing and analysis.

A program that performs a useful task can (and, arguably, should [ 37 ]) be distributed to other scientists, who can then integrate it with their own code. Free software licenses facilitate this type of collaboration, and explicitly encourage individuals to enhance and share their programs [ 38 ]. This flexibility and ease of collaborating allows scientists to develop software relatively quickly, so they can spend more time integrating and mining, rather than simply processing, their data.

Data-processing workflows and pipelines that are designed for use with one particular program or software environment will eventually be incompatible with other software tools or workflow environments; such approaches are often described as being brittle . In contrast, algorithms and programming logic, together with robust and standards-compliant data-exchange formats, provide a completely universal solution that is portable between different tools. Simply stated, any problem that can be solved by a computer can be solved using any programming language [ 39 , 40 ]. The more feature-rich or high-level the language, the more concisely can a data-processing task be expressed using that language (the language is said to be expressive ). Many high-level languages (e.g., Python, Perl) are executed by an interpreter , which is a program that reads source code and does what the code says to do. Interpreted languages are not as numerically efficient as lower-level, compiled languages such as C or Fortran. The source code of a program in a compiled language must be converted to machine-specific instructions by a compiler, and those low-level machine code instructions ( binaries ) are executed directly by the hardware. Compiled code typically runs faster than interpreted code, but requires more work to program. High-level languages, such as Python or Perl, are often used to prototype ideas or to quickly combine modular tools (which may be written in a lower-level language) into “scripts”; for this reason they are also known as scripting languages . Very large programs often provide a scripting language for the user to run their own programs: Microsoft Office has the VBA scripting language, PyMOL [ 41 ] provides a Python interpreter, VMD [ 42 ] uses a Tcl interpreter for many tasks, and Coot [ 43 ] uses the Scheme language to provide an API to the end-user. The deep integration of high-level languages into packages such as PyMOL and VMD enables one to extend the functionality of these programs via both scripting commands (e.g., see PyMOL examples in [ 44 ]) and the creation of semi-standalone plugins (e.g., see the VMD plugin at [ 45 ]). While these tools supply interfaces to different programming languages, the fundamental concepts of programming are preserved in each case: a script written for PyMOL can be transliterated to a VMD script, and a closure in a Coot script is roughly equivalent to a closure in a Python script (see Supplemental Chapter 13 in S1 Text ). Because the logic underlying computer programming is universal, mastering one language will open the door to learning other languages with relative ease. As another major benefit, the algorithmic thinking involved in writing code to solve a problem will often lead to a deeper and more nuanced understanding of the scientific problem itself.

Why Python? (And Which Python?)

Python is the programming language used in this text because of its clear syntax [ 40 , 46 ], active developer community, free availability, extensive use in scientific communities such as bioinformatics, its role as a scripting language in major software suites, and the many freely available scientific libraries (e.g., BioPython [ 47 ]). Two of these characteristics are especially important for our purposes: (i) a clean syntax and straightforward semantics allow the student to focus on core programming concepts without the distraction of difficult syntactic forms, while (ii) the widespread adoption of Python has led to a vast base of scientific libraries and toolkits for more advanced programming projects [ 20 , 48 ]. As noted in the S2 Text (§1), several languages other than Python have also seen widespread use in the biosciences; see, e.g., [ 46 ] for a comparative analysis of some of these languages. As described by Hinsen [ 49 ], Python’s particularly rapid adoption in the sciences can be attributed to its powerful and versatile combination of features, including characteristics intrinsic to the language itself (e.g., expressiveness, a powerful object model) as well as extrinsic features (e.g., community libraries for numerical computing).

Two versions of Python are frequently encountered in scientific programming: Python 2 and Python 3. The differences between these are minor, and while this text uses Python 3 exclusively, most of the code we present will run under both versions of Python. Python 3 is being actively developed and new features are added regularly; Python 2 support continues mainly to serve existing (“legacy”) codes. New projects should use Python 3.

Role and Organization of This Text

This work, which has evolved from a modular “Programming for Bioscientists” tutorial series that has been offered at our institution, provides a self-contained, hands-on primer for general-purpose programming in the biosciences. Where possible, explanations are provided for key foundational concepts from computer science; more formal, and comprehensive, treatments can be found in several computer science texts [ 39 , 40 , 50 ] as well as bioinformatics titles, from both theoretical [ 16 , 51 ] and more practical [ 52 – 55 ] perspectives. Also, this work complements other practical Python primers [ 56 ], guides to getting started in bioinformatics (e.g., [ 57 , 58 ]), and more general educational resources for scientific programming [ 59 ].

Programming fundamentals, including variables, expressions, types, functions, and control flow and recursion, are introduced in the first half of the text (“Fundamentals of Programming”). The next major section (“Data Collections: Tuples, Lists, For Loops, and Dictionaries”) presents data structures for collections of items (lists, tuples, dictionaries) and more control flow (loops). Classes, methods, and other basics of object-oriented programming (OOP) are described in “Object-Oriented Programming in a Nutshell”. File management and input/output (I/O) is covered in “File Management and I/O”, and another practical (and fundamental) topic associated with data-processing—regular expressions for string parsing—is covered in “Regular Expressions for String Manipulations”. As an advanced topic, the text then describes how to use Python and Tkinter to create graphical user interfaces (GUIs) in “An Advanced Vignette: Creating Graphical User Interfaces with Tkinter”. Python’s role in general scientific computing is described as a topic for further exploration (“Python in General-Purpose Scientific Computing”), as is the role of software licensing (“Python and Software Licensing”) and project management via version control systems (“Managing Large Projects: Version Control Systems”). Exercises and examples occur throughout the text to concretely illustrate the language’s usage and capabilities. A final project (“Final Project: A Structural Bioinformatics Problem”) involves integrating several lessons from the text in order to address a structural bioinformatics question.

A collection of Supplemental Chapters ( S1 Text ) is also provided. The Chapters, which contain a few thousand lines of Python code, offer more detailed descriptions of much of the material in the main text. For instance, variables, functions and basic control flow are covered in Chapters 2, 3, and 5, respectively. Some topics are examined at greater depth, taking into account the interdependencies amongst topics—e.g., functions in Chapters 3, 7, and 13; lists, tuples, and other collections in Chapters 8, 9, and 10; OOP in Chapters 15 and 16. Finally, some topics that are either intermediate-level or otherwise not covered in the main text can be found in the Chapters, such as modules in Chapter 4 and lambda expressions in Chapter 13. The contents of the Chapters are summarized in Table 1 and in the S2 Text (§3, “Sample Python Chapters”).

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pcbi.1004867.t001

Using This Text

This text and the Supplemental Chapters work like the lecture and lab components of a course, and they are designed to be used in tandem. For readers who are new to programming, we suggest reading a section of text, including working through any examples or exercises in that section, and then completing the corresponding Supplemental Chapters before moving on to the next section; such readers should also begin by looking at §3.1 in the S2 Text , which describes how to interact with the Python interpreter, both in the context of a Unix Shell and in an integrated development environment (IDE) such as IDLE. For bioscientists who are somewhat familiar with a programming language (Python or otherwise), we suggest reading this text for background information and to understand the conventions used in the field, followed by a study of the Supplemental Chapters to learn the syntax of Python. For those with a strong programming background, this text will provide useful information about the software and conventions that commonly appear in the biosciences; the Supplemental Chapters will be rather familiar in terms of algorithms and computer science fundamentals, while the biological examples and problems may be new for such readers.

Typographic Conventions

research paper on python programming

Blocks of code are typeset in monospace font, with keywords in bold and strings in italics. Output appears on its own line without a line number, as in the following example:

1 if ( True ):

2   print (" hello ")

Fundamentals of Programming

Variables and expressions.

The concept of a variable offers a natural starting point for programming. A variable is a name that can be set to represent, or “hold,” a specific value. This definition closely parallels that found in mathematics. For example, the simple algebraic statement x = 5 is interpreted mathematically as introducing the variable x and assigning it the value 5. When Python encounters that same statement, the interpreter generates a variable named x (literally, by allocating memory), and assigns the value 5 to the variable name. The parallels between variables in Python and those in arithmetic continue in the following example, which can be typed at the prompt in any Python shell (§3.1 of the S2 Text describes how to access a Python shell):

3 z = x + 2 * y

4 print (z)

As may be expected, the value of z is set equal to the sum of x and 2*y , or in this case 19. The print() function makes Python output some text (the argument ) to the screen; its name is a relic of early computing, when computers communicated with human users via ink-on-paper printouts. Beyond addition ( + ) and multiplication ( * ), Python can perform subtraction ( - ) and division ( / ) operations. Python is also natively capable (i.e., without add-on libraries) of other mathematical operations, including those summarized in Table 2 .

thumbnail

https://doi.org/10.1371/journal.pcbi.1004867.t002

research paper on python programming

1 import math

3 y = math.sin(x)

4 print (y)

  0.8366556385360561

In the above program, the sine of 21 rad is calculated, stored in y , and printed to the screen as the code’s sole output. As in mathematics, an expression is formally defined as a unit of code that yields a value upon evaluation. As such, x + 2*y , 5 + 3 , sin(pi) , and even the number 5 alone, are examples of expressions (the final example is also known as a literal ). All variable definitions involve setting a variable name equal to an expression.

Python’s operator precedence rules mirror those in mathematics. For instance, 2+5*3 is interpreted as 2+(5*3) . Python supports some operations that are not often found in arithmetic, such as | and is ; a complete listing can be found in the official documentation [ 60 ]. Even complex expressions, like x+3>>1|y&4>=5 or 6 == z+ x) , are fully (unambiguously) resolved by Python’s operator precedence rules. However, few programmers would have the patience to determine the meaning of such an expression by simple inspection. Instead, when expressions become complex, it is almost always a good idea to use parentheses to explicitly clarify the order: (((x+3 >> 1) | y&4) >= 5) or (6 == (z + x)) .

The following block reveals an interesting deviation from the behavior of a variable as typically encountered in mathematics:

3 print (x)

Viewed algebraically, the first two statements define an inconsistent system of equations (one with no solution) and may seem nonsensical. However, in Python, lines 1–2 are a perfectly valid pair of statements. When run, the print statement will display 2 on the screen. This occurs because Python, like most other languages, takes the statement x = 2 to be a command to assign the value of 2 to x , ignoring any previous state of the variable x ; such variable assignment statements are often denoted with the typographic convention “ x ← 2”. Lines 1–2 above are instructions to the Python interpreter, rather than some system of equations with no solutions for the variable x . This example also touches upon the fact that a Python variable is purely a reference to an object such as the integer 5 (For now, take an object to simply be an addressable chunk of memory, meaning it can have a value and be referenced by a variable; objects are further described in the section on OOP.). This is a property of Python’s type system . Python is said to be dynamically typed , versus statically typed languages such as C. In statically typed languages, a program’s data (variable names) are bound to both an object and a type, and type checking is performed at compile-time; in contrast, variable names in a program written in a dynamically typed language are bound only to objects, and type checking is performed at run-time. An extensive treatment of this topic can be found in [ 61 ]. Dynamic typing is illustrated by the following example. (The pound sign, #, starts a comment ; Python ignores anything after a # sign, so in-line comments offer a useful mechanism for explaining and documenting one’s code.)

research paper on python programming

The above behavior results from the fact that, in Python, the notion of type (defined below) is attached to an object, not to any one of the potentially multiple names (variables) that reference that object. The first two lines illustrate that two or more variables can reference the same object (known as a shared reference ), which in this case is of type int . When y = x is executed, y points to the object x points to (the integer 1 ). When x is changed, y still points to that original integer object. Note that Python strings and integers are immutable , meaning they cannot be changed in-place. However, some other object types, such as lists (described below), are mutable. These aspects of the language can become rather subtle, and the various features of the variable/object relationship—shared references, object mutability, etc.—can give rise to complicated scenarios. Supplemental Chapter 8 ( S1 Text ) explores the Python memory model in more detail.

Statements and Types

A statement is a command that instructs the Python interpreter to do something. All expressions are statements, but a statement need not be an expression. For instance, a statement that, upon execution, causes a program to stop running would never return a value, so it cannot be an expression. Most broadly, statements are instructions, while expressions are combinations of symbols (variables, literals, operators, etc.) that evaluate to a particular value. This particular value might be numerical (e.g., 5 ), a string (e.g., 'foo' ), Boolean ( True / False ), or some other type. Further distinctions between expressions and statements can become esoteric, and are not pertinent to much of the practical programming done in the biosciences.

The type of an object determines how the interpreter will treat the object when it is used. Given the code x = 5 , we can say that “ x is a variable that refers to an object that is of type int ”. We may simplify this to say “ x is an int ”; while technically incorrect, that is a shorter and more natural phrase. When the Python interpreter encounters the expression x + y , if x and y are [variables that point to objects of type] int , then the interpreter would use the addition hardware on the computer to add them. If, on the other hand, x and y were of type str , then Python would join them together. If one is a str and one is an int , the Python interpreter would “raise an exception” and the program would crash. Thus far, each variable we have encountered has been an integer ( int ) type, a string ( str ), or, in the case of sin() ’s output, a real number stored to high precision (a float , for floating-point number). Strings and their constituent characters are among the most useful of Python’s built-in types. Strings are sequences of characters, such as any word in the English language. In Python, a character is simply a string of length one. Each character in a string has a corresponding index, starting from 0 and ranging to index n-1 for a string of n characters. Fig 1 diagrams the composition and some of the functionality of a string, and the following code-block demonstrates how to define and manipulate strings and characters:

1 x = " red "

2 y = " green "

3 z = " blue "

4 print (x + y + z)

 redgreenblue

8 print (a + " " + b + " " + c)

thumbnail

The anatomy and basic behavior of Python strings are shown, as samples of actual code (left panel) and corresponding conceptual diagrams (right panel). The Python interpreter prompts for user input on lines beginning with >>> (leftmost edge), while a starting … denotes a continuation of the previous line; output lines are not prefixed by an initial character (e.g., the fourth line in this example). Strings are simply character array objects (of type str ), and a sample string-specific method ( replace ) is shown on line 3. As with ordinary lists, strings can be ‘sliced’ using the syntax shown here: the first list element to be included in the slice is indexed by start , and the last included element is at stop-1 , with an optional stride of size step (defaults to one). Concatenation, via the + operator, is the joining of whole strings or subsets of strings that are generated via slicing (as in this case). For clarity, the integer indices of the string positions are shown only in the forward (left to right) direction for mySnake1 and in the reverse direction for mySnake2 . These two strings are sliced and concatenated to yield the object newSnake ; note that slicing mySnake1 as [0:7] and not [0:6] means that a whitespace char is included between the two words in the resultant newSnake , thus obviating the need for further manipulations to insert whitespace (e.g., concatenations of the form word1+' '+word2 ).

https://doi.org/10.1371/journal.pcbi.1004867.g001

Here, three variables are created by assignment to three corresponding strings. The first print may seem unusual: the Python interpreter is instructed to “add” three strings; the interpreter joins them together in an operation known as concatenation . The second portion of code stores the character 'e' , as extracted from each of the first three strings, in the respective variables, a , b and c . Then, their content is printed, just as the first three strings were. Note that spacing is not implicitly handled by Python (or most languages) so as to produce human-readable text; therefore, quoted whitespace was explicitly included between the strings (line 8; see also the underscore characters, ‘_’, in Fig 1 ).

Exercise 1 : Write a program to convert a temperature in degrees Fahrenheit to degrees Celsius and Kelvin. The topic of user input has not been covered yet (to be addressed in the section on File Management and I/O), so begin with a variable that you pre-set to the initial temperature (in °F). Your code should convert the temperature to these other units and print it to the console.

A deep benefit of the programming approach to problem-solving is that computers enable mechanization of repetitive tasks, such as those associated with data-analysis workflows. This is true in biological research and beyond. To achieve automation, a discrete and well-defined component of the problem-solving logic is encapsulated as a function. A function is a block of code that expresses the solution to a small, standalone problem/task; quite literally, a function can be any block of code that is defined by the user as being a function. Other parts of a program can then call the function to perform its task and possibly return a solution. For instance, a function can be repetitively applied to a series of input values via looping constructs (described below) as part of a data-processing pipeline.

research paper on python programming

1 def myFun (a,b):

2  c = a + b

3  d = a − b

4   return c*d  # NB: a return does not ' print ' anything on its own

5 x = myFun(1,3) + myFun(2,8) + myFun(-1,18)

6 print (x)

To see the utility of functions, consider how much code would be required to calculate x (line 5) in the absence of any calls to myFun . Note that discrete chunks of code, such as the body of a function, are delimited in Python via whitespace, not curly braces, {} , as in C or Perl. In Python, each level of indentation of the source code corresponds to a separate block of statements that group together in terms of program logic. The first line of above code illustrates the syntax to declare a function: a function definition begins with the keyword def , the following word names the function, and then the names within parentheses (separated by commas) define the arguments to the function. Finally, a colon terminates the function definition. (Default values of arguments can be specified as part of the function definition; e.g., writing line 1 as def myFun(a = 1,b = 3): would set default values of a and b .) The three statements after def myFun(a,b): are indented by some number of spaces (two, in this example), and so these three lines (2–4) constitute a block . In this block, lines 2–3 perform arithmetic operations on the arguments, and the final line of this function specifies the return value as the product of variables c and d . In effect, a return statement is what the function evaluates to when called, this return value taking the place of the original function call. It is also possible that a function returns nothing at all; e.g., a function might be intended to perform various manipulations and not necessarily return any output for downstream processing. For example, the following code defines (and then calls) a function that simply prints the values of three variables, without a return statement:

1 def readOut (a,b,c):

2   print (" Variable 1 is: ", a)

3   print (" Variable 2 is: ", b)

4   print (" Variable 3 is: ", c)

5 readOut(1,2,4)

  Variable 1 is : 1

  Variable 2 is : 2

  Variable 3 is : 4

6 readOut(21,5553,3.33)

  Variable 1 is : 21

  Variable 2 is : 5553

  Variable 3 is : 3.33

Code Organization and Scope

Beyond automation, structuring a program into functions also aids the modularity and interpretability of one’s code, and ultimately facilitates the debugging process—an important consideration in all programming projects, large or small.

Python functions can be nested ; that is, one function can be defined inside another. If a particular function is needed in only one place, it can be defined where it is needed and it will be unavailable elsewhere, where it would not be useful. Additionally, nested function definitions have access to the variables that are available when the nested function is defined. Supplemental Chapter 13 explores nested functions in greater detail. A function is an object in Python, just like a string or an integer. (Languages that allow function names to behave as objects are said to have “first-class functions.”) Therefore, a function can itself serve as an argument to another function, analogous to the mathematical composition of two functions, g ( f ( x )). This property of the language enables many interesting programming techniques, as explored in Supplemental Chapters 9 and 13.

research paper on python programming

https://doi.org/10.1371/journal.pcbi.1004867.g002

Well-established practices have evolved for structuring code in a logically organized (often hierarchical) and “clean” (lucid) manner, and comprehensive treatments of both practical and abstract topics are available in numerous texts. See, for instance, the practical guide Code Complete [ 64 ], the intermediate-level Design Patterns: Elements of Reusable Object-Oriented Software [ 65 ], and the classic (and more abstract) texts Structure and Interpretation of Computer Programs [ 39 ] and Algorithms [ 50 ]; a recent, and free, text in the latter class is Introduction to Computing [ 40 ]. Another important aspect of coding is closely related to the above: usage of brief, yet informative, names as identifiers for variables and function definitions. Even a mid-sized programming project can quickly grow to thousands of lines of code, employ hundreds of functions, and involve hundreds of variables. Though the fact that many variables will lie outside the scope of one another lessens the likelihood of undesirable references to ambiguous variable names, one should note that careless, inconsistent, or undisciplined nomenclature will confuse later efforts to understand a piece of code, for instance by a collaborator or, after some time, even the original programmer. Writing clear, well-defined and well-annotated code is an essential skill to develop. Table 3 outlines some suggested naming practices.

thumbnail

https://doi.org/10.1371/journal.pcbi.1004867.t003

research paper on python programming

Exercise 2 : Recall the temperature conversion program of Exercise 1. Now, write a function to perform the temperature conversion; this function should take one argument (the input temperature). To test your code, use the function to convert and print the output for some arbitrary temperatures of your choosing.

Control Flow: Conditionals

“Begin at the beginning,” the King said gravely, “and go on till you come to the end; then, stop.” —Lewis Carroll, Alice in Wonderland

Thus far, all of our sample code and exercises have featured a linear flow, with statements executed and values emitted in a predictable, deterministic manner. However, most scientific datasets are not amenable to analysis via a simple, predefined stream of instructions. For example, the initial data-processing stages in many types of experimental pipelines may entail the assignment of statistical confidence/reliability scores to the data, and then some form of decision-making logic might be applied to filter the data. Often, if a particular datum does not meet some statistical criterion and is considered a likely outlier, then a special task is performed; otherwise , another (default) route is taken. This branched if – then – else logic is a key decision-making component of virtually any algorithm, and it exemplifies the concept of control flow. The term control flow refers to the progression of logic as the Python interpreter traverses the code and the program “runs”—transitioning, as it runs, from one state to the next, choosing which statements are executed, iterating over a loop some number of times, and so on. (Loosely, the state can be taken as the line of code that is being executed, along with the collection of all variables, and their values, accessible to a running program at any instant; given the precise state, the next state of a deterministic program can be predicted with perfect precision.) The following code introduces the if statement:

1 from random import randint

2 a = randint(0,100)  # get a random integer between 0 and 100 (inclusive)

3 if (a < 50):

4   print (" variable is less than 50 ")

6   print (" the variable is not less than 50 ")

  variable is less than 50

In this example, a random integer between 0 and 100 is assigned to the variable a . (Though not applicable to randint , note that many sequence/list-related functions, such as range(a,b) , generate collections that start at the first argument and end just before the last argument. This is because the function range(a,b) produces b − a items starting at a ; with a default stepsize of one, this makes the endpoint b-1 .) Next, the if statement tests whether the variable is less than 50 . If that condition is unfulfilled, the block following else is executed. Syntactically, if is immediately followed by a test condition , and then a colon to denote the start of the if statement’s block ( Fig 3 illustrates the use of conditionals). Just as with functions, the further indentation on line 4 creates a block of statements that are executed together (here, the block has only one statement). Note that an if statement can be defined without a corresponding else block; in that case, Python simply continues executing the code that is indented by one less level (i.e., at the same indentation level as the if line). Also, Python offers a built-in elif keyword (a contraction of “else if”) that tests a subsequent conditional if and only if the first condition is not met. A series of elif statements can be used to achieve similar effects as the switch / case statement constructs found in C and in other languages (including Unix shell scripts) that are often encountered in bioinformatics.

thumbnail

https://doi.org/10.1371/journal.pcbi.1004867.g003

Now, consider the following extension to the preceding block of code. Is there any fundamental issue with it?

2 a = randint(0,100)

5 if (a > 50):

6   print (" variable is greater than 50 ")

8   print (" the variable must be 50 ")

  variable is greater than 50

This code will function as expected for a = 50 , as well as values exceeding 50 . However, for a less than 50 , the print statements will be executed from both the less-than (line 4) and equal-to (line 8) comparisons. This erroneous behavior results because an else statement is bound solely to the if statement that it directly follows; in the above code-block, an elif would have been the appropriate keyword for line 5. This example also underscores the danger of assuming that lack of a certain condition (a False built-in Boolean type) necessarily implies the fulfillment of a second condition (a True ) for comparisons that seem, at least superficially, to be linked. In writing code with complicated streams of logic (conditionals and beyond), robust and somewhat redundant logical tests can be used to mitigate errors and unwanted behavior. A strategy for building streams of conditional statements into code, and for debugging existing codebases, involves (i) outlining the range of possible inputs (and their expected outputs), (ii) crafting the code itself, and then (iii) testing each possible type of input, carefully tracing the logical flow executed by the algorithm against what was originally anticipated. In step (iii), a careful examination of “edge cases” can help debug code and pinpoint errors or unexpected behavior. (In software engineering parlance, edge cases refer to extreme values of parameters, such as minima/maxima when considering ranges of numerical types. Recognition of edge-case behavior is useful, as a disproportionate share of errors occur near these cases; for instance, division by zero can crash a function if the denominator in each division operation that appears in the function is not carefully checked and handled appropriately. Though beyond the scope of this primer, note that Python supplies powerful error-reporting and exception-handling capabilities; see, for instance, Python Programming [ 66 ] for more information.) Supplemental Chapters 14 and 16 in S1 Text provide detailed examples of testing the behavior of code.

Exercise 3 : Recall the temperature-conversion program designed in Exercises 1 and 2. Now, rewrite this code such that it accepts two arguments: the initial temperature, and a letter designating the units of that temperature. Have the function convert the input temperature to the alternative scale. If the second argument is ‘C’ , convert the temperature to Fahrenheit, if that argument is ‘F’ , convert it to Celsius.

Integrating what has been described thus far, the following example demonstrates the power of control flow—not just to define computations in a structured/ordered manner, but also to solve real problems by devising an algorithm. In this example, we sort three randomly chosen integers:

2 def numberSort ():

3  a = randint(0,100)

4  b = randint(0,100)

5  c = randint(0,100)

6  # reminder: text following the pound sign is a comment in Python.

7  # begin sort; note the nested conditionals here

8   if ((a > b) and (a > c)):

9   largest = a

10    if (b > c):

11    second = b

12    third = c

13    else:

14    second = c

15    third = b

16  # a must not be largest

17   elif (b > c):

18   largest = b

19    if (c > a):

20    second = c

21    third = a

22    else:

23    second = a

24    third = c

25  # a and b are not largest, thus c must be

27   largest = c

28    if (b < a):

29    second = a

30    third = b

31    else:

32    second = b

33    third = a

34  # Python’s assert function can be used for sanity checks.

35  # If the argument to assert() is False, the program will crash.

36   assert (largest > second)

37   assert (second > third)

38   print (" Sorted: ", largest, ",", second, ",", third)

39 numberSort()

  Sorted : 50, 47, 11

Control Flow: Repetition via While Loops

Whereas the if statement tests a condition exactly once and branches the code execution accordingly, the while statement instructs an enclosed block of code to repeat so long as the given condition (the continuation condition ) is satisfied. In fact, while can be considered as a repeated if . This is the simplest form of a loop, and is termed a while loop ( Fig 3 ). The condition check occurs once before entering the associated block; thus, Python’s while is a pre-test loop. (Some languages feature looping constructs wherein the condition check is performed after a first iteration; C’s do–while is an example of such a post-test loop. This is mentioned because looping constructs should be carefully examined when comparing source code in different languages.) If the condition is true, the block is executed and then the interpreter effectively jumps to the while statement that began the block. If the condition is false, the block is skipped and the interpreter jumps to the first statement after the block. The code below is a simple example of a while loop, used to generate a counter that prints each integer between 1 and 100 (inclusive):

1 counter = 1

2 while (counter <= 100):

3   print (counter)

4  counter = counter + 1

5 print (" done! ")

This code will begin with a variable, then print and increment it until its value is 101 , at which point the enclosing while loop ends and a final string (line 5) is printed. Crucially, one should verify that the loop termination condition can, in fact, be reached. If not—e.g., if the loop were specified as while(True): for some reason—then the loop would continue indefinitely, creating an infinite loop that would render the program unresponsive. (In many environments, such as a Unix shell, the keystroke Ctrl-c can be used as a keyboard interrupt to break out of the loop.)

Exercise 4 : With the above example as a starting point, write a function that chooses two randomly-generated integers between 0 and 100 , inclusive, and then prints all numbers between these two values, counting from the lower number to the upper number.

“In order to understand recursion, you must first understand recursion.” — Anonymous

research paper on python programming

1 def factorial (n):

2   assert (n > 0)  # Crash on invalid input

3   if (n == 1):

4    return 1

6    return n * factorial(n-1)

A call to this factorial function will return 1 if the input is equal to one, and otherwise will return the input value multiplied by the factorial of that integer less one ( factorial(n-1) ). Note that this recursive implementation of the factorial perfectly matches its mathematical definition. This often holds true, and many mathematical operations on data are most easily expressed recursively. When the Python interpreter encounters the call to the factorial function within the function block itself (line 6), it generates a new instance of the function on the fly, while retaining the original function in memory (technically, these function instances occupy the runtime’s call stack ). Python places the current function call on hold in the call stack while the newly-called function is evaluated. This process continues until the base case is reached, at which point the function returns a value. Next, the previous function instance in the call stack resumes execution, calculates its result, and returns it. This process of traversing the call stack continues until the very first invocation has returned. At that point, the call stack is empty and the function evaluation has completed.

Expressing Problems Recursively

Defining recursion simply as a function calling itself misses some nuances of the recursive approach to problem-solving. Any difficult problem (e.g., f ( n ) = n !) that can be expressed as a simpler instance of the same problem (e.g., f ( n ) = n * f ( n − 1)) is amenable to a recursive solution. Only when the problem is trivially easy (1!, factorial(1) above) does the recursive solution give a direct (one-step) answer. Recursive approaches fundamentally differ from more iterative (also known as procedural ) strategies: Iterative constructs (loops) express the entire solution to a problem in more explicit form, whereas recursion repeatedly makes a problem simpler until it is trivial. Many data-processing functions are most naturally and compactly solved via recursion.

The recursive descent/ascent behavior described above is extremely powerful, and care is required to avoid pitfalls and frustration. For example, consider the following addition algorithm, which uses the equality operator ( == ) to test for the base case:

1 def badRecursiveAdder (x):

2   if (x == 1):

3    return x

5    return x + badRecursiveAdder(x−2)

This function does include a base case (lines 2–3), and at first glance may seem to act as expected, yielding a sequence of squares (1, 4, 9, 16…) for x = 1, 3, 5, 7,… Indeed, for odd x greater than 1 , the function will behave as anticipated. However, if the argument is negative or is an even number, the base case will never be reached (note that line 5 subtracts 2 ), causing the function call to simply hang, as would an infinite loop. (In this scenario, Python’s maximum recursion depth will be reached and the call stack will overflow.) Thus, in addition to defining the function’s base case, it is also crucial to confirm that all possible inputs will reach the base case. A valid recursive function must progress towards—and eventually reach—the base case with every call. More information on recursion can be found in Supplemental Chapter 7 in S1 Text , in Chapter 4 of [ 40 ], and in most computer science texts.

research paper on python programming

Exercise 6 : Many functions can be coded both recursively and iteratively (using loops), though often it will be clear that one approach is better suited to the given problem (the factorial is one such example). In this exercise, devise an iterative Python function to compute the factorial of a user-specified integer argument. As a bonus exercise, try coding the Fibonacci sequence in iterative form. Is this as straightforward as the recursive approach? Note that Supplemental Chapter 7 in the S1 Text might be useful here.

Data Collections: Tuples, Lists, For Loops, and Dictionaries

A staggering degree of algorithmic complexity is possible using only variables, functions, and control flow concepts. However, thus far, numbers and strings are the only data types that have been discussed. Such data types can be used to represent protein sequences (a string) and molecular masses (a floating point number), but actual scientific data are seldom so simple! The data from a mass spectrometry experiment are a list of intensities at various m / z values (the mass spectrum). Optical microscopy experiments yield thousands of images, each consisting of a large two-dimensional array of pixels, and each pixel has color information that one may wish to access [ 68 ]. A protein multiple sequence alignment can be considered as a two-dimensional array of characters drawn from a 21-letter alphabet (one letter per amino acid (AA) and a gap symbol), and a protein 3D structural alignment is even more complex. Phylogenetic trees consist of sets of species, individual proteins, or other taxonomic entities, organized as (typically) binary trees with branch weights that represent some metric of evolutionary distance. A trajectory from an MD or Brownian dynamics simulation is especially dense: Cartesian coordinates and velocities are specified for upwards of 10 6 atoms at >10 6 time-points (every ps in a μs-scale trajectory). As illustrated by these examples, real scientific data exhibit a level of complexity far beyond Python’s relatively simple built-in data types. Modern datasets are often quite heterogeneous, particularly in the biosciences [ 69 ], and therefore data abstraction and integration are often the major goals. The data challenges hold true at all levels, from individual RNA transcripts [ 70 ] to whole bacterial cells [ 71 ] to biomedical informatics [ 72 ].

In each of the above examples, the relevant data comprise a collection of entities, each of which, in turn, is of some simpler data type. This unifying principle offers a way forward. The term data structure refers to an object that stores data in a specifically organized (structured) manner, as defined by the programmer. Given an adequately well-specified/defined data structure, arbitrarily complex collections of data can be readily handled by Python, from a simple array of integers to a highly intricate, multi-dimensional, heterogeneous (mixed-type) data structure. Python offers several built-in sequence data structures, including strings, lists, and tuples.

A tuple (pronounced like “couple”) is simply an ordered sequence of objects, with essentially no restrictions as to the types of the objects. Thus, the tuple is especially useful in building data structures as higher-order collections. Data that are inherently sequential (e.g., time-series data recorded by an instrument) are naturally expressed as a tuple, as illustrated by the following syntactic form: myTuple = (0,1,3) . The tuple is surrounded by parentheses, and commas separate the individual elements. The empty tuple is denoted () , and a tuple of one element contains a comma after that element, e.g., (1,) ; the final comma lets Python distinguish between a tuple and a mathematical operation. That is, 2*(3+1) must not treat (3+1) as a tuple. A parenthesized expression is therefore not made into a tuple unless it contains commas. (The type function is a useful built-in function to probe an object’s type. At the Python interpreter, try the statements type((1)) and type((1,)) . How do the results differ?)

A tuple can contain any sort of object, including another tuple. For example, diverseTuple = (15.38,"someString",(0,1)) contains a floating-point number, a string, and another tuple. This versatility makes tuples an effective means of representing complex or heterogeneous data structures. Note that any component of a tuple can be referenced using the same notation used to index individual characters within a string; e.g., diverseTuple[0] gives 15.38 .

In general, data are optimally stored, analyzed, modified, and otherwise processed using data structures that reflect any underlying structure of the data itself. Thus, for example, two-dimensional datasets are most naturally stored as tuples of tuples. This abstraction can be taken to arbitrary depth, making tuples useful for storing arbitrarily complex data. For instance, tuples have been used to create generic tensor-like objects. These rich data structures have been used in developing new tools for the analysis of MD trajectories [ 18 ] and to represent biological sequence information as hierarchical, multidimensional entities that are amenable to further processing in Python [ 20 ].

As a concrete example, consider the problem of representing signal intensity data collected over time. If the data are sampled with perfect periodicity, say every second, then the information could be stored (most compactly) in a one-dimensional tuple, as a simple succession of intensities; the index of an element in the tuple maps to a time-point (index 0 corresponds to the measurement at time t 0 , index 1 is at time t 1 , etc.). What if the data were sampled unevenly in time? Then each datum could be represented as an ordered pair, ( t , I ( t )), of the intensity I at each time-point t ; the full time-series of measurements is then given by the sequence of 2-element tuples, like so:

research paper on python programming

Three notes concern the above code: (i) From this two-dimensional data structure, the syntax dataSet[i][j] retrieves the j th element from the i th tuple. (ii) Negative indices can be used as shorthand to index from the end of most collections (tuples, lists, etc.), as shown in Fig 1 ; thus, in the above example dataSet[-1] represents the same value as dataSet[4] . (iii) Recall that Python treats all lines of code that belong to the same block (or degree of indentation) as a single unit. In the example above, the first line alone is not a valid (closed) expression, and Python allows the expression to continue on to the next line; the lengthy dataSet expression was formatted as above in order to aid readability.

Once defined, a tuple cannot be altered; tuples are said to be immutable data structures. This rigidity can be helpful or restrictive, depending on the context and intended purpose. For instance, tuples are suitable for storing numerical constants, or for ordered collections that are generated once during execution and intended only for referencing thereafter (e.g., an input stream of raw data).

A mutable data structure is the Python list . This built-in sequence type allows for the addition, removal, and modification of elements. The syntactic form used to define lists resembles the definition of a tuple, except that the parentheses are replaced with square brackets, e.g. myList = [0, 1, 42, 78] . (A trailing comma is unnecessary in one-element lists, as [1] is unambiguously a list.) As suggested by the preceding line, the elements in a Python list are typically more homogeneous than might be found in a tuple: The statement myList2 = ['a',1] , which defines a list containing both string and numeric types, is technically valid, but myList2 = ['a','b'] or myList2 = [0, 1] would be more frequently encountered in practice. Note that myList[1] = 3.14 is a perfectly valid statement that can be applied to the already-defined object named myList (as long as myList already contains two or more elements), resulting in the modification of the second element in the list. Finally, note that myList[5] = 3.14 will raise an error, as the list defined above does not contain a sixth element. The index is said to be out of range , and a valid approach would be to append the value via myList.append(3.14) .

The foregoing description only scratches the surface of Python’s built-in data structures. Several functions and methods are available for lists, tuples, strings, and other built-in types. For lists, append , insert , and remove are examples of oft-used methods; the function len() returns the number of items in a sequence or collection, such as the length of a string or number of elements in a list. All of these “list methods” behave similarly as any other function—arguments are generally provided as input, some processing occurs, and values may be returned. (The OOP section, below, elaborates the relationship between functions and methods.)

Iteration with For Loops

Lists and tuples are examples of iterable types in Python, and the for loop is a useful construct in handling such objects. (Custom iterable types are introduced in Supplemental Chapter 17 in S1 Text .) A Python for loop iterates over a collection, which is a common operation in virtually all data-analysis workflows. Recall that a while loop requires a counter to track progress through the iteration, and this counter is tested against the continuation condition. In contrast, a for loop handles the count implicitly, given an argument that is an iterable object:

1 myData = [1.414, 2.718, 3.142, 4.669]

2 total = 0

3 for datum in myData:

4  # the next statement uses a compound assignment operator; in

5  # the addition assignment operator, a += b means a = a + b

6  total += datum

7   print (" added " + str (datum) + " to sum .")

8  # str makes a string from datum so we can concatenate with +.

  added 1.414 to sum.

  added 2.718 to sum.

  added 3.142 to sum.

  added 4.669 to sum.

9 print (total)

  11.942999999999998

In the above loop, all elements in myData are of the same type (namely, floating-point numbers). This is not mandatory. For instance, the heterogeneous object myData = ['a','b',1,2] is iterable, and therefore it is a valid argument to a for loop (though not the above loop, as string and integer types cannot be mixed as operands to the + operator). The context dependence of the + symbol, meaning either numeric addition or a concatenation operator, depending on the arguments, is an example of operator overloading . (Together with dynamic typing, operator overloading helps make Python a highly expressive programming language.) In each iteration of the above loop, the variable datum is assigned each successive element in myData ; specifying this iterative task as a while loop is possible, but less straightforward. Finally, note the syntactic difference between Python’s for loops and the for(<initialize>; <condition>; <update>) {<body>} construct that is found in C, Perl, and other languages encountered in computational biology.

Exercise 7 : Consider the fermentation of glucose into ethanol: C 6 H 12 O 6 → 2C 2 H 5 OH + 2CO 2 . A fermentor is initially charged with 10,000 liters of feed solution and the rate of carbon dioxide production is measured by a sensor in moles/hour. At t = 10, 20, 30, 40, 50, 60, 70, and 80 hours, the CO 2 generation rates are 58.2, 65.2, 67.8, 65.4, 58.8, 49.6, 39.1, and 15.8 moles/hour respectively. Assuming that each reading represents the average CO 2 production rate over the previous ten hours, calculate the total amount of CO 2 generated and the final ethanol concentration in grams per liter. Note that Supplemental Chapters 6 and 9 might be useful here.

Exercise 8 : Write a program to compute the distance, d ( r 1 , r 2 ), between two arbitrary (user-specified) points, r 1 = ( x 1 , y 1 , z 1 ) and r 2 = ( x 2 , y 2 , z 2 ), in 3D space. Use the usual Euclidean distance between two points—the straight-line, “as the bird flies” distance. Other distance metrics, such as the Mahalanobis and Manhattan distances, often appear in computational biology too. With your code in hand, note the ease with which you can adjust your entire data-analysis workflow simply by modifying a few lines of code that correspond to the definition of the distance function. As a bonus exercise, generalize your code to read in a list of points and compute the total path length. Supplemental Chapters 6, 7, and 9 might be useful here.

Sets and Dictionaries

research paper on python programming

Further Data Structures: Trees and Beyond

Python’s built-in data structures are made for sequential data, and using them for other purposes can quickly become awkward. Consider the task of representing genealogy: an individual may have some number of children, and each child may have their own children, and so on. There is no straightforward way to represent this type of information as a list or tuple. A better approach would be to represent each organism as a tuple containing its children. Each of those elements would, in turn, be another tuple with children, and so on. A specific organism would be a node in this data structure, with a branch leading to each of its child nodes; an organism having no children is effectively a leaf . A node that is not the child of any other node would be the root of this tree. This intuitive description corresponds, in fact, to exactly the terminology used by computer scientists in describing trees [ 73 ]. Trees are pervasive in computer science. This document, for example, could be represented purely as a list of characters, but doing so neglects its underlying structure, which is that of a tree (sections, sub-sections, sub-sub-sections, …). The whole document is the root entity, each section is a node on a branch, each sub-section a branch from a section, and so on down through the paragraphs, sentences, words, and letters. A common and intuitive use of trees in bioinformatics is to represent phylogenetic relationships. However, trees are such a general data structure that they also find use, for instance, in computational geometry applications to biomolecules (e.g., to optimally partition data along different spatial dimensions [ 74 , 75 ]).

Trees are, by definition, (i) acyclic , meaning that following a branch from node i will never lead back to node i , and any node has exactly one parent; and (ii) directed , meaning that a node knows only about the nodes “below” it, not the ones “above” it. Relaxing these requirements gives a graph [ 76 ], which is an even more fundamental and universal data structure: A graph is a set of vertices that are connected by edges. Graphs can be subtle to work with and a number of clever algorithms are available to analyze them [ 77 ].

There are countless data structures available, and more are constantly being devised. Advanced examples range from the biologically-inspired neural network, which is essentially a graph wherein the vertices are linked into communication networks to emulate the neuronal layers in a brain [ 78 ], to very compact probabilistic data structures such as the Bloom filter [ 79 ], to self-balancing trees [ 80 ] that provide extremely fast insertion and removal of elements for performance-critical code, to copy-on-write B-trees that organize terabytes of information on hard drives [ 81 ].

Object-Oriented Programming in a Nutshell: Classes, Objects, Methods, and All That

Oop in theory: some basic principles.

Computer programs are characterized by two essential features [ 82 ]: (i) algorithms or, loosely, the “programming logic,” and (ii) data structures , or how data are represented within the program, whether certain components are manipulable, iterable, etc. The object-oriented programming (OOP) paradigm, to which Python is particularly well-suited, treats these two features of a program as inseparable. Several thorough treatments of OOP are available, including texts that are independent of any language [ 83 ] and books that specifically focus on OOP in Python [ 84 ]. The core ideas are explored in this section and in Supplemental Chapters 15 and 16 in S1 Text .

Most scientific data have some form of inherent structure, and this serves as a starting point in understanding OOP. For instance, the time-series example mentioned above is structured as a series of ordered pairs, ( t , I ( t )), an X-ray diffraction pattern consists of a collection of intensities that are indexed by integer triples ( h , k , l ), and so on. In general, the intrinsic structure of scientific data cannot be easily or efficiently described using one of Python’s standard data structures because those types (strings, lists, etc.) are far too simple and limited. Consider, for instance, the task of representing a protein 3D structure, where “representing” means storing all the information that one may wish to access and manipulate: AA sequence (residue types and numbers), the atoms comprising each residue, the spatial coordinates of each atom, whether a cysteine residue is disulfide-bonded or not, the protein’s function, the year the protein was discovered, a list of orthologs of known structure, and so on. What data structure might be capable of most naturally representing such an entity? A simple (generic) Python tuple or list is clearly insufficient.

For this problem, one could try to represent the protein as a single tuple, where the first element is a list of the sequence of residues, the second element is a string describing the protein’s function, the third element lists orthologs, etc. Somewhere within this top-level list, the coordinates of the C α atom of Alanine-42 might be represented as [x,y,z] , which is a simple list of length three. (The list is “simple” in the sense that its rank is one; the rank of a tuple or list is, loosely, the number of dimensions spanned by its rows, and in this case we have but one row.) In other words, our overall data-representation problem can be hierarchically decomposed into simpler sub-problems that are amenable to representation via Python’s built-in types. While valid, such a data structure will be difficult to use: The programmer will have to recall multiple arbitrary numbers (list and sub-list indices) in order to access anything, and extensions to this approach will only make it clumsier. Additionally, there are many functions that are meaningful only in the context of proteins, not all tuples. For example, we may need to compute the solvent-accessible surface areas of all residues in all β -strands for a list of proteins, but this operation would be nonsensical for a list of Supreme Court cases. Conversely, not all tuple methods would be relevant to this protein data structure, yet a function to find Court cases that reached a 5-4 decision along party lines would accept the protein as an argument. In other words, the tuple mentioned above has no clean way to make the necessary associations. It’s just a tuple.

OOP Terminology

This protein representation problem is elegantly solved via the OOP concepts of classes, objects, and methods. Briefly, an object is an instance of a data structure that contains members and methods. Members are data of potentially any type, including other objects. Unlike lists and tuples, where the elements are indexed by numbers starting from zero, the members of an object are given names, such as yearDiscovered . Methods are functions that (typically) make use of the members of the object. Methods perform operations that are related to the data in the object’s members. Objects are constructed from class definitions, which are blocks that define what most of the methods will be for an object. The examples in the 'OOP in Practice' section will help clarify this terminology. (Note that some languages require that all methods and members be specified in the class declaration, but Python allows duck punching , or adding members after declaring a class. Adding methods later is possible too, but uncommon. Some built-in types, such as int , do not support duck punching.)

During execution of an actual program, a specific object is created by calling the name of the class, as one would do for a function. The interpreter will set aside some memory for the object’s methods and members, and then call a method named __init__ , which initializes the object for use.

Classes can be created from previously defined classes. In such cases, all properties of the parent class are said to be inherited by the child class. The child class is termed a derived class , while the parent is described as a base class . For instance, a user-defined Biopolymer class may have derived classes named Protein and NucleicAcid , and may itself be derived from a more general Molecule base class. Class names often begin with a capital letter, while object names (i.e., variables) often start with a lowercase letter. Within a class definition, a leading underscore denotes member names that will be protected. Working examples and annotated descriptions of these concepts can be found, in the context of protein structural analysis, in ref [ 85 ].

The OOP paradigm suffuses the Python language: Every value is an object. For example, the statement foo = ‘bar’ instantiates a new object (of type str ) and binds the name foo to that object. All built-in string methods will be exposed for that object (e.g., foo.upper() returns ‘BAR’ ). Python’s built-in dir() function can be used to list all attributes and methods of an object, so dir(foo) will list all available attributes and valid methods on the variable foo . The statement dir(1) will show all the methods and members of an int (there are many!). This example also illustrates the conventional OOP dot-notation, object.attribute , which is used to access an object’s members, and to invoke its methods ( Fig 1 , left). For instance, protein1.residues[2].CA.x might give the x -coordinate of the C α atom of the third residue in protein1 as a floating-point number, and protein1.residues[5].ssbond(protein2.residues[6]) might be used to define a disulfide bond (the ssbond() method) between residue-6 of protein1 and residue-7 of protein2 . In this example, the residues member is a list or tuple of objects, and an item is retrieved from the collection using an index in brackets.

Benefits of OOP

By effectively compartmentalizing the programming logic and implicitly requiring a disciplined approach to data structures, the OOP paradigm offers several benefits. Chief among these are (i) clean data/code separation and bundling (i.e., modularization), (ii) code reusability, (iii) greater extensibility (derived classes can be created as needs become more specialized), and (iv) encapsulation into classes/objects provides a clearer interface for other programmers and users. Indeed, a generally good practice is to discourage end-users from directly accessing and modifying all of the members of an object. Instead, one can expose a limited and clean interface to the user, while the back-end functionality (which defines the class) remains safely under the control of the class’ author. As an example, custom getter and setter methods can be specified in the class definition itself, and these methods can be called in another user’s code in order to enable the safe and controlled access/modification of the object’s members. A setter can ‘sanity-check’ its input to verify that the values do not send the object into a nonsensical or broken state; e.g., specifying the string "ham" as the x -coordinate of an atom could be caught before program execution continues with a corrupted object. By forcing alterations and other interactions with an object to occur via a limited number of well-defined getters/setters, one can ensure that the integrity of the object’s data structure is preserved for downstream usage.

The OOP paradigm also solves the aforementioned problem wherein a protein implemented as a tuple had no good way to be associated with the appropriate functions—we could call Python’s built-in max() on a protein, which would be meaningless, or we could try to compute the isoelectric point of an arbitrary list (of Supreme Court cases), which would be similarly nonsensical. Using classes sidesteps these problems. If our Protein class does not define a max() method, then no attempt can be made to calculate its maximum. If it does define an isoelectricPoint() method, then that method can be applied only to an object of type Protein . For users/programmers, this is invaluable: If a class from a library has a particular method, one can be assured that that method will work with objects of that class.

OOP in Practice: Some Examples

research paper on python programming

Note the usage of self as the first argument in each method defined in the above code. The self keyword is necessary because when a method is invoked it must know which object to use. That is, an object instantiated from a class requires that methods on that object have some way to reference that particular instance of the class, versus other potential instances of that class. The self keyword provides such a “hook” to reference the specific object for which a method is called. Every method invocation for a given object, including even the initializer called __init__ , must pass it self (the current instance) as the first argument to the method; this subtlety is further described at [ 86 ] and [ 87 ]. A practical way to view the effect of self is that any occurrence of objName.methodName(arg1, arg2) effectively becomes methodName(objName, arg1, arg2) . This is one key deviation from the behavior of top-level functions, which exist outside of any class. When defining methods, usage of self provides an explicit way for the object itself to be provided as an argument (self-reference), and its disciplined usage will help minimize confusion about expected arguments.

To illustrate how objects may interact with one another, consider a class to represent a chemical’s atom:

research paper on python programming

Then, we can use this Atom class in constructing another class to represent molecules:

research paper on python programming

And, finally, the following code illustrates the construction of a diatomic molecule:

research paper on python programming

If the above code is run, for example, in an interactive Python session, then note that the aforementioned dir() function is an especially useful built-in tool for querying the properties of new classes and objects. For instance, issuing the statement dir(Molecule) will return detailed information about the Molecule class (including its available methods).

research paper on python programming

File Management and I/O

Scientific data are typically acquired, processed, stored, exchanged, and archived as computer files. As a means of input/output (I/O) communication, Python provides tools for reading, writing and otherwise manipulating files in various formats. Supplemental Chapter 11 in S1 Text focuses on file I/O in Python. Most simply, the Python interpreter allows command-line input and basic data output via the print() function. For real-time interaction with Python, the free IPython [ 89 ] system offers a shell that is both easy to use and uniquely powerful (e.g., it features tab completion and command history scrolling); see the S2 Text , §3 for more on interacting with Python. A more general approach to I/O, and a more robust (persistent) approach to data archival and exchange, is to use files for reading, writing, and processing data. Python handles file I/O via the creation of file objects, which are instantiated by calling the open function with the filename and access mode as its two arguments. The syntax is illustrated by fileObject = open("myName.pdb", mode = ‘r’) , which creates a new file object from a file named "myName.pdb" . This file will be only readable because the ‘r’ mode is specified; other valid modes include ‘w’ to allow writing and ‘a’ for appending. Depending on which mode is specified, different methods of the file object will be exposed for use. Table 4 describes mode types and the various methods of a File object.

thumbnail

https://doi.org/10.1371/journal.pcbi.1004867.t004

The following example opens a file named myDataFile.txt and reads the lines, en masse , into a list named listOfLines . (In this example, the variable readFile is also known as a “file handle,” as it references the file object.) As for all lists, this object is iterable and can be looped over in order to process the data.

1 readFile = open (" myDataFile.txt ", mode = ‘ r ’)

2 listOfLines = readFile.readlines()

3 # Process the lines. Simply dump the contents to the console:

4 for l in listOfLines:

5   print (l)

  (The lines in the file will be printed)

6 readFile.close()

research paper on python programming

1 fp = open (‘ 1I8F.pdb ’, mode = ‘ r ’)

2 numHetatm = 0

3 for line in fp.readlines():

4   if ( len (line) > 6):

5    if (line[0:6] == " HETATM "):

6    numHetatm += 1

7 fp.close()

8 print (numHetatm)

research paper on python programming

Begin this exercise by choosing a FASTA protein sequence with more than 3000 AA residues. Then, write Python code to read in the sequence from the FASTA file and: (i) determine the relative frequencies of AAs that follow proline in the sequence; (ii) compare the distribution of AAs that follow proline to the distribution of AAs in the entire protein; and (iii) write these results to a human-readable file.

Regular Expressions for String Manipulations

research paper on python programming

In Python, a regex match es a string if the string starts with that regex. Python also provides a search function to locate a regex anywhere within a string. Returning to the notion that a regex “specifies a set of strings,” given some text the match es to a regex will be all strings that start with the regex, while the search hits will be all strings that contain the regex. For clarity, we will say that a regex find s a string if the string is completely described by the regex, with no trailing characters. (There is no find in Python but, for purposes of description here, it is useful to have a term to refer to a match without trailing characters.)

research paper on python programming

Beyond the central role of the regex in analyzing biological sequences, parsing datasets, etc., note that any effort spent learning Python regexes is highly transferable. In terms of general syntactic forms and functionality, regexes behave roughly similarly in Python and in many other mainstream languages (e.g., Perl, R), as well as in the shell scripts and command-line utilities (e.g., grep) found in the Unix family of operating systems (including all Linux distributions and Apple’s OS X).

Exercise 11 : Many human hereditary neurodegenerative disorders, such as Huntington’s disease (HD), are linked to anomalous expansions in the number of trinucleotide repeats in particular genes [ 94 ]. In HD, the pathological severity correlates with the number of (CAG) n repeats in exon-1 of the gene ( htt ) encoding the protein (huntingtin): More repeats means an earlier age of onset and a more rapid disease progression. The CAG codon specifies glutamine, and HD belongs to a broad class of polyglutamine (polyQ) diseases. Healthy (wild-type) variants of this gene feature n ≈ 6–35 tandem repeats, whereas n > 35 virtually assures the disease. For this exercise, write a Python regex that will locate any consecutive runs of (CAG) n >10 in an input DNA sequence. Because the codon CAA also encodes Q and has been found in long runs of CAGs, your regex should also allow interspersed CAAs. To extend this exercise, write code that uses your regex to count the number of CAG repeats (allow CAA too), and apply it to a publically-available genome sequence of your choosing (e.g., the NCBI GI code 588282786:1-585 is exon-1 from a human’s htt gene [accessible at http://1.usa.gov/1NjrDNJ ]).

An Advanced Vignette: Creating Graphical User Interfaces with Tkinter

Thus far, this primer has centered on Python programming as a tool for interacting with data and processing information. To illustrate an advanced topic, this section shifts the focus towards approaches for creating software that relies on user interaction, via the development of a graphical user interface (GUI; pronounced ‘gooey’). Text-based interfaces (e.g., the Python shell) have several distinct advantages over purely graphical interfaces, but such interfaces can be intimidating to the uninitiated. For this reason, many general users will prefer GUI-based software that permits options to be configured via graphical check boxes, radio buttons, pull-down menus and the like, versus text-based software that requires typing commands and editing configuration files. In Python, the tkinter package (pronounced ‘T-K-inter’) provides a set of tools to create GUIs. (Python 2.x calls this package Tkinter , with a capital T ; here, we use the Python 3.x notation.)

Tkinter programming has its own specialized vocabulary. Widgets are objects, such as text boxes, buttons and frames, that comprise the user interface. The root window is the widget that contains all other widgets. The root window is responsible for monitoring user interactions and informing the contained widgets to respond when the user triggers an interaction with them (called an event ). A frame is a widget that contains other widgets. Frames are used to group related widgets together, both in the code and on-screen. A geometry manager is a system that places widgets in a frame according to some style determined by the programmer. For example, the grid geometry manager arranges widgets on a grid, while the pack geometry manager places widgets in unoccupied space. Geometry managers are discussed at length in Supplemental Chapter 18 in S1 Text , which shows how intricate layouts can be generated.

The basic style of GUI programming fundamentally differs from the material presented thus far. The reason for this is that the programmer cannot predict what actions a user might perform, and, more importantly, in what order those actions will occur. As a result, GUI programming consists of placing a set of widgets on the screen and providing instructions that the widgets execute when a user interaction triggers an event. (Similar techniques are used, for instance, to create web interfaces and widgets in languages such as JavaScript.) Supplemental Chapter 19 ( S1 Text ) describes available techniques for providing functionality to widgets. Once the widgets are configured, the root window then awaits user input. A simple example follows:

1 from tkinter import Tk, Button

2 def buttonWindow ():

3  window = Tk()

4   def onClick ():

5    print (" Button clicked ")

6  btn = Button(window, text = " Sample Button ", command = onClick)

7  btn.pack()

8  window.mainloop()

research paper on python programming

Graphical widgets, such as text entry fields and check-boxes, receive data from the user, and must communicate that data within the program. To provide a conduit for this information, the programmer must provide a variable to the widget. When the value in the widget changes, the widget will update the variable and the program can read it. Conversely, when the program should change the data in a widget (e.g., to indicate the status of a real-time calculation), the programmer sets the value of the variable and the variable updates the value displayed on the widget. This roundabout tack is a result of differences in the architecture of Python and Tkinter—an integer in Python is represented differently than an integer in Tkinter, so reading the widget’s value directly would result in a nonsensical Python value. These variables are discussed in Supplemental Chapter 19 in S1 Text .

From a software engineering perspective, a drawback to graphical interfaces is that multiple GUIs cannot be readily composed into new programs. For instance, a GUI to display how a particular restriction enzyme will cleave a DNA sequence will not be practically useful in predicting the products of digesting thousands of sequences with the enzyme, even though some core component of the program (the key, non-GUI program logic) would be useful in automating that task. For this reason, GUI applications should be written in as modular a style as possible—one should be able to extract the useful functionality without interacting with the GUI-specific code. In the restriction enzyme example, an optimal solution would be to write the code that computes cleavage sites as a separate module, and then have the GUI code interact with the components of that module.

Python in General-purpose Scientific Computing: Numerical Efficiency, Libraries

In pursuing biological research, the computational tasks that arise will likely resemble problems that have already been solved, problems for which software libraries already exist. This occurs largely because of the interdisciplinary nature of biological research, wherein relatively well-established formalisms and algorithms from physics, computer science, and mathematics are applied to biological systems. For instance, (i) the simulated annealing method was developed as a physically-inspired approach to combinatorial optimization, and soon thereafter became a cornerstone in the refinement of biomolecular structures determined by NMR spectroscopy or X-ray crystallography [ 95 ]; (ii) dynamic programming was devised as an optimization approach in operations research, before becoming ubiquitous in sequence alignment algorithms and other areas of bioinformatics; and (iii) the Monte Carlo method, invented as a sampling approach in physics, underlies the algorithms used in problems ranging from protein structure prediction to phylogenetic tree estimation.

Each computational approach listed above can be implemented in Python. The language is well-suited to rapidly develop and prototype any algorithm, be it intended for a relatively lightweight problem or one that is more computationally intensive (see [ 96 ] for a text on general-purpose scientific computing in Python). When considering Python and other possible languages for a project, software development time must be balanced against a program’s execution time. These two factors are generally countervailing because of the inherent performance trade-offs between codes that are written in interpreted (high-level) versus compiled (lower-level) languages; ultimately, the computational demands of a problem will help guide the choice of language. In practice, the feasibility of a pure Python versus non-Python approach can be practically explored via numerical benchmarking. While Python enables rapid development, and is of sufficient computational speed for many bioinformatics problems, its performance simply cannot match the compiled languages that are traditionally used for high-performance computing applications (e.g., many MD integrators are written in C or Fortran). Nevertheless, Python codes are available for molecular simulations, parallel execution, and so on. Python’s popularity and utility in the biosciences can be attributed to its ease of use (expressiveness), its adequate numerical efficiency for many bioinformatics calculations, and the availability of numerous libraries that can be readily integrated into one’s Python code (and, conversely, one’s Python code can “hook” into the APIs of larger software tools, such as PyMOL). Finally, note that rapidly-developed Python software can be integrated with numerically efficient, high-performance code written in a low-level languages such as C, in an approach known as “mixed-language programming” [ 49 ].

research paper on python programming

Many additional libraries can be found at the official Python Package Index (PyPI; [ 102 ]), as well as myriad packages from unofficial third-party repositories. The BioPython project, mentioned above in the 'Why Python?' subsection, offers an integrated suite of tools for sequence- and structure-based bioinformatics, as well as phylogenetics, machine learning, and other feature sets. We survey the computational biology software landscape in the S2 Text (§2), including tools for structural bioinformatics, phylogenetics, omics-scale data-processing pipelines, and workflow management systems. Finally, note that Python code can be interfaced with other languages. For instance, current support is provided for low-level integration of Python and R [ 103 , 104 ], as well as C-extensions in Python (Cython; [ 105 , 106 ]). Such cross-language interfaces extend Python’s versatility and flexibility for computational problems at the intersection of multiple scientific domains, as often occurs in the biosciences.

Python and Software Licensing

Any discussion of libraries, modules, and extensions merits a brief note on the important role of licenses in scientific software development. As evidenced by the widespread utility of existing software libraries in modern research communities, the development work done by one scientist will almost certainly aid the research pursuits of others—either near-term or long-term, in subfields that might be near to one’s own or perhaps more distant (and unforeseen). Free software licenses promote the unfettered advance of scientific research by encouraging the open exchange, transparency, communicability, and reproducibility of research projects. To qualify as free software, a program must allow the user to view and change the source code (for any purpose), distribute the code to others, and distribute modified versions of the code to others. The Open Source Initiative provides alphabetized and categorized lists of licenses that comply, to various degrees, with the open-source definition [ 107 ]. As an example, the Python interpreter, itself, is under a free license. Software licensing is a major topic unto itself, and helpful primers are available on technical [ 38 ] and strategic [ 37 , 108 ] considerations in adopting one licensing scheme versus another. All of the content (code and comments) that is provided as Supplemental Chapters ( S1 Text ) is licensed under the GNU Affero General Public License (AGPL) version 3, which permits anyone to examine, edit, and distribute the source so long as any works using it are released under the same license.

Managing Large Projects: Version Control Systems

As a project grows, it becomes increasingly difficult—yet increasingly important—to be able to track changes in source code. A version control system (VCS) tracks changes to documents and facilitates the sharing of code among multiple individuals. In a distributed (as opposed to centralized) VCS, each developer has his own complete copy of the project, locally stored. Such a VCS supports the “committing,” “pulling,” “branching,” and “merging” of code. After making a change, the programmer commits the change to the VCS. The VCS stores a snapshot of the project, preserving the development history. If it is later discovered that a particular commit introduced a bug, one can easily revert the offending commit. Other developers who are working on the same project can pull from the author of the change (the most recent version, or any earlier snapshot). The VCS will incorporate the changes made by the author into the puller’s copy of the project. If a new feature will make the code temporarily unusable (until the feature is completely implemented), then that feature should be developed in a separate branch . Developers can switch between branches at will, and a commit made to one branch will not affect other branches. The master branch will still contain a working version of the program, and developers can still commit non-breaking changes to the master branch. Once the new feature is complete, the branches can be merged together. In a distributed VCS, each developer is, conceptually, a branch. When one developer pulls from others, this is equivalent to merging a branch from each developer. Git, Mercurial, and Darcs are common distributed VCS. In contrast, in a centralized VCS all commits are tracked in one central place (for both distributed and centralized VCS, this “place” is often a repository hosted in the cloud). When a developer makes a commit, it is pushed to every other developer (who is on the same branch). The essential behaviors—committing, branching, merging—are otherwise the same as for a distributed VCS. Examples of popular centralized VCSs include the Concurrent Versioning System (CVS) and Subversion.

While VCS are mainly designed to work with source code, they are not limited to this type of file. A VCS is useful in many situations where multiple people are collaborating on a single project, as it simplifies the task of combining, tracking, and otherwise reconciling the contributions of each person. In fact, this very document was developed using LaTeX and the Git VCS, enabling each author to work on the text in parallel. A helpful guide to Git and GitHub (a popular Git repository hosting service) was very recently published [ 109 ]; in addition to a general introduction to VCS, that guide offers extensive practical advice, such as what types of data/files are more or less ideal for version controlling.

Final Project: A Structural Bioinformatics Problem

Fluency in a programming language is developed actively, not passively. The exercises provided in this text have aimed to develop the reader’s command of basic features of the Python language. Most of these topics are covered more deeply in the Supplemental Chapters ( S1 Text ), which also include some advanced features of the language that lie beyond the scope of the main body of this primer. As a final exercise, a cumulative project is presented below. This project addresses a substantive scientific question, and its successful completion requires one to apply and integrate the skills from the foregoing exercises. Note that a project such as this—and really any project involving more than a few dozen lines of code—will benefit greatly from an initial planning phase. In this initial stage of software design, one should consider the basic functions, classes, algorithms, control flow, and overall code structure.

research paper on python programming

Data and algorithms are two pillars of modern biosciences. Data are acquired, filtered, and otherwise manipulated in preparation for further processing, and algorithms are applied in analyzing datasets so as to obtain results. In this way, computational workflows transform primary data into results that can, over time, become formulated into general principles and new knowledge. In the biosciences, modern scientific datasets are voluminous and heterogeneous. Thus, in developing and applying computational tools for data analysis, the two central goals are scalability , for handling the data-volume problem, and robust abstractions , for handling data heterogeneity and integration. These two challenges are particularly vexing in biology, and are exacerbated by the traditional lack of training in computational and quantitative methods in many biosciences curricula. Motivated by these factors, this primer has sought to introduce general principles of computer programming, at both basic and intermediate levels. The Python language was adopted for this purpose because of its broad prevalence and deep utility in the biosciences.

Supporting Information

S1 text. python chapters..

This suite of 19 Supplemental Chapters covers the essentials of programming. The Chapters are written in Python and guide the reader through the core concepts of programming, via numerous examples and explanations. The most recent versions of all materials are maintained at http://p4b.muralab.org . For purposes of self-study, solutions to the in-text exercises are also included.

https://doi.org/10.1371/journal.pcbi.1004867.s001

S2 Text. Supplemental text.

The supplemental text contains sections on: (i) Python as a general language for scientific computing, including the concepts of imperative and declarative languages, Python’s relationship to other languages, and a brief account of languages widely used in the biosciences; (ii) a structured guide to some of the available software packages in computational biology, with an emphasis on Python; and (iii) two sample Supplemental Chapters (one basic, one more advanced), along with a brief, practical introduction to the Python interpreter and integrated development environment (IDE) tools such as IDLE.

https://doi.org/10.1371/journal.pcbi.1004867.s002

Acknowledgments

We thank M. Cline, S. Coupe, S. Ehsan, D. Evans, R. Sood, and K. Stanek for critical reading and helpful feedback on the manuscript.

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 7. Gerstein Lab. “O M E S Table”;. Available from: http://bioinfo.mbb.yale.edu/what-is-it/omes/omes.html .
  • 16. Baldi P, Brunak S. Bioinformatics: The Machine Learning Approach (2 nd Edition). A Bradford Book; 2001.
  • 28. Rudin C, Dunson D, Irizarry R, Ji H, Laber E, Leek J, et al. Discovery with Data: Leveraging Statistics with Computer Science to Transform Science and Society. American Statistical Association; 2014. Available from: http://www.amstat.org/policy/pdfs/BigDataStatisticsJune2014.pdf .
  • 29. Committee on a New Biology for the 21 st Century: Ensuring the United States Leads the Coming Biology Revolution and Board on Life Sciences and Division on Earth and Life Studies and National Research Council. A New Biology for the 21 st Century. National Academies Press; 2009.
  • 39. Abelson H, Sussman GJ. Structure and Interpretation of Computer Programs (2 nd Edition). The MIT Press; 1996. Available from: http://mitpress.mit.edu/sicp/full-text/book/book.html .
  • 40. Evans D. Introduction to Computing: Explorations in Language, Logic, and Machines. CreateSpace Independent Publishing Platform; 2011. Available from: http://www.computingbook.org .
  • 41. The PyMOL Molecular Graphics System, Schrödinger, LLC;. Available from: http://pymol.org .
  • 45. PBCTools Plugin, Version 2.7;. Available from: http://www.ks.uiuc.edu/Research/vmd/plugins/pbctools .
  • 49. Hinsen K. High-Level Scientific Programming with Python. In: Proceedings of the International Conference on Computational Science-Part III. ICCS’02. London, UK, UK: Springer-Verlag; 2002. p. 691–700.
  • 50. Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to Algorithms (3 rd Edition). The MIT Press; 2009.
  • 51. Jones NC, Pevzner PA. An Introduction to Bioinformatics Algorithms. The MIT Press; 2004.
  • 52. Wünschiers R. Computational Biology: Unix/Linux, Data Processing and Programming. Springer-Verlag; 2004.
  • 53. Model ML. Bioinformatics Programming Using Python: Practical Programming for Biological Data. O’Reilly Media; 2009.
  • 54. Buffalo V. Bioinformatics Data Skills: Reproducible and Robust Research with Open Source Tools. O’Reilly Media; 2015.
  • 55. Libeskind-Hadas R, Bush E. Computing for Biologists: Python Programming and Principles. Cambridge University Press; 2014.
  • 59. Software Carpentry;. Accessed 2016-01-18. Available from: http://software-carpentry.org/ .
  • 60. Expressions—Python 3.5.1 documentation; 2016. Accessed 2016-01-18. Available from: https://docs.python.org/3/reference/expressions.html#operator-precedence .
  • 61. Pierce BC. Types and Programming Languages. The MIT Press; 2002.
  • 63. More Control Flow Tools—Python 3.5.1 documentation; 2016. Accessed 2016-01-18. Available from: https://docs.python.org/3.5/tutorial/controlflow.html#keyword-arguments .
  • 64. McConnell S. Code Complete: A Practical Handbook of Software Construction (2 nd Edition). Pearson Education; 2004.
  • 65. Gamma E, Helm R, Johnson R, Vlissides J. Design Patterns: Elements of Reusable Object-oriented Software. Pearson Education; 1994.
  • 66. Zelle J. Python Programming: An Introduction to Computer Science. 2 nd ed. Franklin, Beedle & Associates Inc.; 2010.
  • 69. National Research Council (US) Committee on Frontiers at the Interface of Computing and Biology. On the Nature of Biological Data. In: Lin HS, Wooley JC, editors. Catalyzing Inquiry at the Interface of Computing and Biology. Washington, DC: The National Academies Press; 2005. Available from: http://www.ncbi.nlm.nih.gov/books/NBK25464 .
  • 73. Wikipedia. Tree (data structure); 2016. Accessed 2016-01-18. Available from: https://en.wikipedia.org/wiki/Tree_%28data_structure%29 .
  • 74. Scipy. scipy.spatial.KDTree—SciPy v0.14.0 Reference Guide; 2014. Accessed 2016-01-18. Available from: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.KDTree.html .
  • 75. Wikipedia. k-d tree; 2016. Accessed 2016-01-18. Available from: https://en.wikipedia.org/wiki/K-d_tree .
  • 76. Wikipedia. Graph (abstract data type); 2015. Accessed 2016-01-18. Available from: https://en.wikipedia.org/wiki/Graph_%28abstract_data_type%29 .
  • 77. Hagberg AA, Schult DA, Swart PJ. Exploring network structure, dynamics, and function using NetworkX. In: Proceedings of the 7th Python in Science Conference (SciPy2008). Pasadena, CA USA; 2008. p. 11–15.
  • 80. Moitzi M. bintrees 2.0.2; 2016. Accessed 2016-01-18. Available from: https://pypi.python.org/pypi/bintrees/2.0.2 .
  • 82. Wirth N. Algorithms + Data Structures = Programs. Prentice-Hall Series in Automatic Computation. Prentice Hall; 1976.
  • 83. Budd T. An Introduction to Object-Oriented Programming. 3 rd ed. Pearson; 2001.
  • 84. Phillips D. Python 3 Object Oriented Programming. Packt Publishing; 2010.
  • 86. The Self Variable in Python Explained;. Available from: http://pythontips.com/2013/08/07/the-self-variable-in-python-explained .
  • 87. Why Explicit Self Has to Stay;. Available from: http://neopythonic.blogspot.com/2008/10/why-explicit-self-has-to-stay.html .
  • 90. Python Data Analysis Library;. Available from: http://pandas.pydata.org/ .
  • 91. Friedl JEF. Mastering Regular Expressions. O’Reilly Media; 2006.
  • 92. Regexes on Stack Overflow;. Available from: http://stackoverflow.com/tags/regex/info .
  • 93. Regex Tutorials, Examples and Reference;. Available from: http://www.regular-expressions.info .
  • 96. Langtangen HP. A Primer on Scientific Programming with Python. Texts in Computational Science and Engineering. Springer; 2014.
  • 97. Jones E, Oliphant T, Peterson P, et al. SciPy: Open-source Scientific Tools for Python; 2001-. [Online; accessed 2015-06-30]. Available from: http://www.scipy.org/ .
  • 98. Scientific Computing Tools for Python;. Available from: http://www.scipy.org/about.html .
  • 100. scikit-learn: machine learning in Python;. Available from: http://scikit-learn.org/ .
  • 102. PyPI: The Python Package Index;. Available from: http://pypi.python.org .
  • 104. rpy2, R in Python;. Available from: http://rpy.sourceforge.net .
  • 106. Cython: C-extensions for Python;. Available from: http://cython.org .
  • 107. Open Source Initiative: Licenses & Standards;. Available from: http://opensource.org/licenses .

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

How do I reference the Python programming language in a thesis or a paper?

I'm writing a scientific article and a dissertation in biology, for which I used Python for simulations. Some people in our department, especially the "non-computer-people", don't know what Python is, so I want to reference something helpful. Open-Source scientific tools such as CellProfiler usually tell you how to reference them, but Python doesn't.

How is the Python language properly referenced? Are there any articles in journals available I could link to?

  • publications

Ooker's user avatar

  • 3 I understand your consternation. Thank you for moving this. FWIW - r has a package to generate citations for each package e.g. citation("rmetadata"). The same format could be used for Python packages. Van Rossum, G. (2007). Python programming language. In USENIX Annual Technical Conference. As per the APA - "Do not cite standard office software (e.g. Word, Excel) or programming languages. Provide references only for specialized software. Ludwig, T. (2002). PsychInquiry [computer software]. New York: Worth." owl.english.purdue.edu/owl/resource/560/10 –  d-cubed Commented Feb 16, 2014 at 20:22
  • Use the biblatex-software 's @Software field if you are using Latex. You can include the actual versions or module, as well as the license. Highly recommended to cite software (version, module, fragment). –  Clément Commented Oct 29, 2020 at 16:57
  • 1 @Clément Had I only known this 8 years ago when I did my PhD! :-) Thank you, maybe it will help someone. –  Eekhoorn Commented Oct 31, 2020 at 13:13
  • 1 @Eekhoorn It was released this year, so it would not have changed much 8 years ago! –  Clément Commented Oct 31, 2020 at 13:46
  • Side note: if you're an academic writing an open-source software package that you think will be used by other academics, why not submit a paper announcing its existence to the Journal of Open Source Software , then there'll be something clear for people to cite, and the referees can be very helpful in refining the software. (Having said that, my last submission there was desk-rejected because the project was too small.) –  Daniel Hatton Commented Sep 2, 2023 at 10:01

4 Answers 4

In order to cite a programming language, a possible way is to cite the reference manual, including the version of the language you use (your approach might no longer work with the version of Python available in 20 years ...).

For instance, you can have a citation like:

Python Software Foundation. Python Language Reference, version 2.7. Available at http://www.python.org

According to this thread , you can also cite the original CWI TR:

"G. van Rossum, Python tutorial, Technical Report CS-R9526, Centrum voor Wiskunde en Informatica (CWI), Amsterdam, May 1995."

Community's user avatar

  • 18 +1 although I usually cite software as … software. It’s a publication, after all. Many citation managers might not recognise this as a citation type but the reason for this is that they’re stuck in the previous millenium, nothing more. –  Konrad Rudolph Commented Nov 27, 2012 at 16:51
  • 1 +1 for the second suggestion as I have always cited every bit of software that way, and I hope people that use my software would do the same. This is especially true for the more specialized libraries such as NumPy, SciPy and matplotlib for which I can share how I normally reference them if desired. –  Bas Jansen Commented Apr 18, 2017 at 14:52
  • According to APA6 (from owl.english.purdue.edu/owl/resource/560/09 ) the citation shall be: "Centrum voor Wiskunde en Informatica (1995). Python tutorial. Technical Report CS-R9526 . Amsterdam: van Rossum, G." or so –  abukaj Commented May 19, 2017 at 10:33
  • 2 To add on Latex, It was done the following way, can be improved.... @Techreport{CS-R9526, title= {Python tutorial}, author = {G. van Rossum}, number={CS-R9526}, institution= {Centrum voor Wiskunde en Informatica (CWI)}, year= {1995}, address={Amsterdam}, month={May} } and the output is: G. van Rossum. Python tutorial. Technical Report CS-R9526, Centrum voor Wiskunde en Informatica (CWI), Amsterdam, May 1995. –  danieltakeshi Commented Jun 7, 2018 at 16:46
  • 2 Okay, you can cite that technical report, but did you actually read that technical report? Probably not. I used python, and I'd like to cite it, but it's bad practice to cite papers/documents that you haven't actually read. I want to cite the programming language and the packages I used, not papers about them. In consideration of this, the first option seems better. –  Aaron Bramson Commented Mar 27, 2019 at 6:18

A common choice I have seen is to cite the software by name and give a link to the website or name the company (for proprietary software) or both. For MATLAB, a mathematical programming language, I have often seen:

...for the simulations we used Matlab (The MathWorks, Inc., Natick, Massachusetts, United States)....

Likewise in citation lists and also in text, you often see something like:

MATLAB and Signal Processing Toolbox Release 2012b, The MathWorks, Inc., Natick, Massachusetts, United States. http://www.mathworks.com/

Note that it is often good to include libraries or toolboxes as well as the languages used. Most computer languages used in academic research are not used alone but depend heavily on add-on components. For these, there may be explicitly given papers to cite or the authors may provide preferred citation rules. The most important component of citing a software package is the website, especially if it is open-source, as that allows others to dig into the details of your work but actually using the same tools!

For open-source software like Python, you could name the organization or give the website:

...for the simulations we used the Python programming language (Python Software Foundation, https://www.python.org/ ).

Obviously, check your schools formatting demands for dissertations/theses, and note that most style guides have explicit rules for software, and those would apply to computer languages as well.

Doctorambient's user avatar

This is probably a late answer but now Python's official FAQ page has information regarding 'Are there any published articles about Python that I can reference?'.

It’s probably best to cite your favorite book about Python. The very first article about Python was written in 1991 and is now quite outdated. Guido van Rossum and Jelke de Boer, “Interactively Testing Remote Servers Using the Python Programming Language”, CWI Quarterly, Volume 4, Issue 4 (December 1991), Amsterdam, pp 283–303.

user438383's user avatar

I think it should be standard to cite programming language and used libraries. To cite Python you can use this citation:

This citation can be found also here http://www.citebay.com/how-to-cite/python/ . At this website there are citations for many Python libraries, that are widely used (numpy, scipy, etc.).

J.V.'s user avatar

You must log in to answer this question.

Not the answer you're looking for browse other questions tagged publications citations code ..

  • Featured on Meta
  • Bringing clarity to status tag usage on meta sites
  • We've made changes to our Terms of Service & Privacy Policy - July 2024
  • Announcing a change to the data-dump process

Hot Network Questions

  • Can I use "historically" to mean "for a long time" in "Historically, the Japanese were almost vegetarian"?
  • What was I thinking when I made this grid?
  • Why an Out Parameter can be left unassigned in .NET 6 but not .NET 8 (CS0177)?
  • Fill a grid with numbers so that each row/column calculation yields the same number
  • Optimal Bath Fan Location
  • What does it say in the inscriptions on Benjamin's doorway; A Canticle for Leibowitz
  • Trying to find an old book (fantasy or scifi?) in which the protagonist and their romantic partner live in opposite directions in time
  • Can you solve this median geometry problem?
  • What are the risks of a compromised top tube and of attempts to repair it?
  • Is there any video of an air-to-air missile shooting down an aircraft?
  • Recovering data from a phone that won't boot
  • My colleagues and I are travelling to UK as delegates in an event and the company is paying for all our travel expenses. what documents are required
  • Are there any theoretical reasons why we cannot measure the position of a particle with zero error?
  • Is "using active record pattern" a reason to inherit from standard container (eg:vector)?
  • Is the theory of ordinals in Cantor normal form with just addition decidable?
  • Validity of ticket when using alternative train from a different station
  • Is Trace operation commutative when weighted by a positive semidefinite matrix?
  • Generalization of a simple result in Linear Algebra
  • Dress code for examiner in UK PhD viva
  • How can you trust a forensic scientist to have maintained the chain of custody?
  • about flag changes in 16-bit calculations on the MC6800
  • Can a 2-sphere be squashed flat?
  • Why is one of the Intel 8042 keyboard controller outputs inverted?
  • Is it possible to configure an eqnarry environment to automatically split over multiple pags

research paper on python programming

IMAGES

  1. THYAGU PYTHON: Python Question Paper June/July 2018

    research paper on python programming

  2. (PDF) Python for research and teaching economics

    research paper on python programming

  3. The Absolute Beginner’s Guide to Python Programming: A Step-by-Step

    research paper on python programming

  4. Python Research paper

    research paper on python programming

  5. Python Programming Language (Handwritten) Study Notes

    research paper on python programming

  6. SOLUTION: Python programming notes

    research paper on python programming

COMMENTS

  1. (PDF) The Rise of Python: A Survey of Recent Research

    Abstract and Figures. Python is a general-purpose programming language that is becoming increasingly popular for research. This paper surveys recent research in Python, with a focus on the ...

  2. SciPy 1.0: fundamental algorithms for scientific computing in Python

    Communications Earth & Environment (2024) SciPy is an open-source scientific computing library for the Python programming language. Since its initial release in 2001, SciPy has become a de facto ...

  3. Landscape of High-Performance Python to Develop Data Science and

    Its task-based distributed programming model is based on Parsl apps, which may be decorated Python functions (@python_app) or calls to shell commands (@bash_app). Like for torcpy [ 32 ] and SCOOP [ 36 ], task distribution is enabled by a custom futures implementation to manage asynchronous function execution in a distributed context.

  4. Python current trend applications-an overview

    In this paper, we discussed the python programming as a more suitable choice for ... "Python -The Fastest Growing Programming Language," International Research Journal of Engineering and ...

  5. Python for Scientists and Engineers

    Metrics. Abstract: Python has arguably become the de facto standard for exploratory, interactive, and computation-driven scientific research. This issue discusses Python's advantages for scientific research and presents several of the core Python libraries and tools used in scientific research. Published in: Computing in Science & Engineering ...

  6. Using Python for Research

    Using a combination of a guided introduction and more independent in-depth exploration, you will get to practice your new Python skills with various case studies chosen for their scientific breadth and their coverage of different Python features. This run of the course includes revised assessments and a new module on machine learning.

  7. An Exploratory Study on the Predominant Programming Paradigms in Python

    An Exploratory Study on the Predominant Programming Paradigms in Python Code. Robert Dyer. [email protected] University of Nebraska—Lincoln Lincoln, NE, USA. ABSTRACT. Python is a multi-paradigm programming language that fully sup-ports object-oriented (OO) programming. The language allows writ-ing code in a non-procedural imperative manner ...

  8. Array programming with NumPy

    Array programming provides a powerful, compact and expressive syntax for accessing, manipulating and operating on data in vectors, matrices and higher-dimensional arrays. NumPy is the primary ...

  9. 82679 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on PYTHON. Find methods information, sources, references or conduct a literature review on PYTHON

  10. Python

    The characteristics and most important features of python language, the types of programming supported by python and its users and its applications are discussed. : Python is a suitable language for both learning and real world programming. Python is a powerful high-level, object-oriented programming language created by Guido van Rossum. In this paper we first introduce you to the python ...

  11. Python for Data Analytics, Scientific and Technical Applications

    Python's ever-evolving libraries make it a good choice for Data analytics. The paper talks about the features and characteristics of Python programming language and later discusses reasons behind python being credited as one of the fastest growing programming language and why it is at the forefront of data science applications, research and ...

  12. Programming with Python

    Abstract: Python is a powerful, easy-to-learn programming language based on traditional languages but better suited to current operating systems, networks, and hardware. Published in: IT Professional ( Volume: 7 , Issue: 5 , Sept.-Oct. 2005 )

  13. Full article: Python for Teaching Introductory Programming: A

    In this paper it is therefore suggested to introduce Python (Citation Python Software Foundation, 2010) for the basic (procedural) aspects of programming and then introduce Java (Citation Java, 2010) to focus on object-oriented aspects. The use of Python is expected to both reduce the overhead attached to Java syntax and allow immediate ...

  14. [1810.09538] Pyro: Deep Universal Probabilistic Programming

    Pyro is a probabilistic programming language built on Python as a platform for developing advanced probabilistic models in AI research. To scale to large datasets and high-dimensional models, Pyro uses stochastic variational inference algorithms and probability distributions built on top of PyTorch, a modern GPU-accelerated deep learning framework. To accommodate complex or model-specific ...

  15. [2108.07732] Program Synthesis with Large Language Models

    This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to ...

  16. PDF Python Programming: a Comprehensive Study

    Python Programming Language is very well suited for Beginners, also for experienced programmers with other programming languages like C++ and JAVA. There is a need to up - skilling and training young people to need demand of industry needs. One of the greatest strengths of Python ... International Research Journal of Modernization in ...

  17. PDF Python The Fastest Growing Programming Language

    our program until you hit the problematic part. Python is. a flexible, simple coding programming language. This language can support different styles of progr. mming including structural and. object-oriented. Other styles can be used, too. Python is very flexible, because of its ability to use modular components th.

  18. Python

    Python is a general-purpose programming language that is becoming increasingly popular for research. This paper surveys recent research in Python, with a focus on the following application domains: web development, data science, machine learning, natural language processing, robotics, and scientific computing.

  19. Information

    Smarter applications are making better use of the insights gleaned from data, having an impact on every industry and research discipline. At the core of this revolution lies the tools and the methods that are driving it, from processing the massive piles of data generated each day to learning from and taking useful action. Deep neural networks, along with advancements in classical machine ...

  20. Python: The Programming Language of Future

    Python is a powerful high-level programming language created by a programmer named Guido van Rossum. In the review paper, we first introduce you to the python programming characteristics. This paper also discusses the reasons behind python being awarded as the fastestgrowing programming language in recent times supported by research done worldwide.

  21. An Introduction to Programming for Bioscientists: A Python-Based Primer

    Author Summary Contemporary biology has largely become computational biology, whether it involves applying physical principles to simulate the motion of each atom in a piece of DNA, or using machine learning algorithms to integrate and mine "omics" data across whole cells (or even entire ecosystems). The ability to design algorithms and program computers, even at a novice level, may be the ...

  22. pyResearchInsights—An open‐source Python package for scientific text

    1 INTRODUCTION. Keeping track of conceptual and methodological developments in any scientific discipline is imperative to advance research. An exponential growth in published scientific literature has made it extremely difficult to keep track of scientific advancements (Roll et al., 2018).Within the field of ecology, we have observed a twofold increase in published literature over the last ...

  23. How do I reference the Python programming language in a thesis or a paper?

    Python programming language. In USENIX Annual Technical Conference. As per the APA - "Do not cite standard office software (e.g. Word, Excel) or programming languages. ... Most computer languages used in academic research are not used alone but depend heavily on add-on components. For these, there may be explicitly given papers to cite or the ...