Materials Platform
for Data Science

based on the PAULING FILE experimental inorganic database

This is the description of the MPDS application programming interface (API). Its intended audience is software engineers and data scientists. The MPDS API presents the scientific data of the PAULING FILE materials database online in the developer-friendly machine-readable formats.

The MPDS API is available by subscription within the service-level agreement. To start using it the reader needs a valid API key from the MPDS account. We encourage to contact us for opening the API-enabled MPDS account.

Some parts of data are opened and freely available. In particular, these are: (a) cell parameters - temperature diagrams and cell parameters - pressure diagrams, (b) all data for compounds containing both Ag and K, (c) all data for binary compounds of oxygen. Please, fill in the form to get the API access to these data.

§1. Downloading data

§1.1. MPDS data structure

The standard unit of the MPDS data is an entry. All the MPDS entries are subdivided into three kinds: crystalline structures, physical properties, and phase diagrams. They are called S-, P- or C-entries, correspondingly. Entries have persistent identifiers (analogous to DOIs), e.g. S377634, P600028, C100027.

Another dimension of the MPDS data is the distinct phases. The three kinds of entries are interlinked via the distinct materials phases they belong. A tremendous work was done by PAULING FILE in the past 20 years to manually distinguish more than 100 000 inorganic materials phases, appearing in the literature. Each phase has a unique combination of (a) chemical formula, (b) space group, (c) Pearson symbol. Each phase has an integer identifier called phase_id.

Consider the following example of the entries and distinct phases. There can be the following distinct phases for the titanium dioxide: rutile with the space group 136 (let us say, phase_id 1), anatase with the space group 141 (phase_id 2), and brookite with the space group 61 (phase_id 3). So then the S- and P-entries for the titanium dioxide must refer to either 1, or 2, or 3, and C-entries must refer to 1, 2, and 3 simultaneously.

§1.2. Categories of data

The most common action is to get the MPDS data according to some criteria. We support 12 basic criteria for searching and downloading the entries:

Criterion Machine-readable name Examples
Physical properties (from MPDS physical properties hierarchy) props conductivity
Chemical elements elements Cs, Fr, O
Materials classes (various classes under the same name: i.e. minerals, chemical groups, common terms etc.) classes perovskite, quaternary
Crystal system lattices orthorhombic
Chemical formula formulae SrTiO3, O3Al2
Space group (number or international short symbol) sgs Pm3m, 221
Prototype system ("Strukturbericht" or MPDS notation: formula, Pearson symbol, space group number) protos D51,
SiS2 oI12 72
Polyhedron atoms (center, ligands) aeatoms U, X-Se, UO6, UX7
(X = any atom)
Polyhedral type aetype icosahedron 12-vertex
Publication author authors Evarestov
CODEN (code of publication journal, see CODEN index) codens PPCPFQ
Publication years years 1960-1990
Table 1. Categories of MPDS data.

Thus, in an API request it is possible to combine all these criteria in a reasonable manner. Combination is always done implying conjunctive AND operator. The reasonable manner means one can combine e.g. publication authors with publication years, but cannot combine chemical formulae with chemical elements or space group with crystal system, i.e. the common sense rules are implied.

We distinguish the single-term categories, such as the physical properties and crystal systems, and many-term categories, such as the materials classes and publication authors. A combination of terms from the single-term category is not supported. That is, one cannot search for both band gap and conductivity. A combination of terms from the many-term category is easily possible. That is, one can search for both binaries and perovskites.

§1.3. REST-alike download endpoint

To serve data we employ the REST-alike endpoint, following the HTTP protocol basics. We support input parameters in the query string and request header. We provide the standard HTTP status codes in response. The address of the data download endpoint is the following:

The HTTP verb to be used is always GET. The header must always contain the Key string with the valid MPDS API key, obtained from the corresponding MPDS account. A minimal curl example looks:

curl -H Key:a_long_secret_string

The following query string parameters are supported:

Parameter Type Description
q JSON-serialized object The object q may contain any reasonable combination of criteria, given in the Table 1, column "Machine-readable name". Always required.
phases string or integer Distinct phase identifiers (called phase_id's) separated by commas can be optionally supplied to limit the search. This is convenient for programmatic chained searches. A relevant q parameter must be always given. The count of phase_id's cannot exceed 1000.
pagesize integer The maximum number of hits per a single response (pagination). Default value: 10. Allowed values: 10, 100, 500, and 1000.
page integer The page number, either in the selected or in the default pagination. Counts from 0. Defaults to the first page, which has number 0, i.e. the default value is 0.
fmt integer Output data format. Can be json or cif. The latter cif only makes sense for crystalline structures. Default value: json.
Table 2. Query string parameters for download endpoint.

An example of the q parameter to request the crystalline structures (with the spaces and line breaks added for readability):

{ "elements": "Sr-Ti-O", "classes": "perovskite, conductor", "sgs": "Pm-3m", "props": "structural properties" }

Another example of the q parameter to request the physical properties:

{ "elements": "Zr", "classes": "oxide", "props": "thermal expansion" }

By default the response has json format, containing the following properties: out (list i.e. array of the MPDS entries), npages (number of pages in the selected or default pagination), page (current page number), count (total number of hits), and error (should be null). Also other properties might be added in future.

Each kind of entries (i.e. S- or P-entries) has its own specific properties in JSON (see below).

{ "error": null, "out": [ { "entry_one_properties": "entry_one_values" }, { "entry_two_properties": "entry_two_values" } ], "npages": 1, "page": 0, "count": 2 }

And if no results are found:

{ "error": null, "out": [], "npages": 0, "count": 0, "page": 0 }

The output in cif format presents nothing more than concatenated CIFs in plain text.

The response for a correctly understood request will always have the status code 200, whereas the status codes 400 ("Wrong parameters") or 403 ("Forbidden") may signal about an error in the request. Generally, checking the response status code should be always done first.

§1.4. JSON schemata of the MPDS entries

JSON is the first-class-passenger format in the MPDS API. We provide JSON schemata for all kinds of the MPDS entries in the API output. This is visualized below:

Listing 1. Interactive JSON schema for crystalline structure or S-entry.
Listing 2. Interactive JSON schema for physical property or P-entry (clickable).
Listing 3. Interactive JSON schema for phase diagram or C-entry.

Any MPDS API JSON output can be validated against the above schemata e.g. using a Python script below. In the following example the external httplib2 and jsonschema Python packages are used:

#!/usr/bin/env python

import sys
import json
import httplib2
from jsonschema import validate, Draft4Validator
from jsonschema.exceptions import ValidationError

try: input_json = sys.argv[1]
except IndexError: sys.exit("Usage: %s input_json" % sys.argv[0])

network = httplib2.Http()
response, content = network.request("")
assert response.status == 200

schema = json.loads(content)

target = json.loads(open(input_json).read())

if not target.get("npages") or not target.get("out") or target.get("error"):
    sys.exit("Unexpected API response")

    validate(target["out"], schema)
except ValidationError, e:
    raise RuntimeError(
        "The item: \r\n\r\n %s \r\n\r\n has an issue: \r\n\r\n %s" % (
            e.instance, e.context

Listing 4. Python console script for JSON validation.

All the MPDS data are expected to be valid against the supplied schemata.

§1.5. Example scripts

Below are simple examples for the MPDS data retrieval in two programming languages: Python and JavaScript. More languages could be added by request. Note, that here the Python example requires an external package httplib2. We provide more convenient Python client library, so these examples here serve demonstration purposes only.

#!/usr/bin/env python

from urllib import urlencode
import httplib2
import json

api_key = "" # your key
endpoint = ""
search = {
    "elements": "Mn",
    "classes": "binary, oxide",
    "props": "isothermal bulk modulus", # see
    "lattices": "cubic"

req = httplib2.Http()
response, content = req.request(
    uri=endpoint + '?' + urlencode({
        'q': json.dumps(search),
        'pagesize': 10
    headers={'Key': api_key}

if response.status != 200:
    # NB 400 means wrong input, 403 means authorization issue etc.
    # see
    raise RuntimeError("Error code %s" % response.status)

content = json.loads(content)

if content.get('error'): raise RuntimeError(content['error'])

print "OK, got %s hits" % len(content['out'])
Listing 5. Sample Python script for working with MPDS API.
#!/usr/bin/env node

var https = require('https');
var qs = require('querystring');

var api_key = ""; // your key
var host = "", port = 443, path = "/v0/download/facet";
var search = {
    "elements": "Mn",
    "classes": "binary, oxide",
    "props": "isothermal bulk modulus", // see
    "lattices": "cubic"

    host: host,
    port: port,
    path: path + '?' + qs.stringify({q: JSON.stringify(search), pagesize: 10}),
    method: 'GET',
    headers: {'Key': api_key}
}, function(response){
    var result = '';
    if (response.statusCode != 200){
        // NB 400 means wrong input, 403 means authorization issue etc.
        // see
        return console.error('Error code: ' + response.statusCode);
    response.on('data', function(chunk){
        result += chunk;
    response.on('end', function(){
        result = JSON.parse(result);
        if (result.error) console.error(result.error);
        else {
            console.log('OK, got ' + result.out.length + ' hits');
}).on('error', function(err){
    console.error('Network error: ' + err);
Listing 6. Sample JavaScript code for working with MPDS API.

§1.6. API and GUI

The MPDS data provided in API and GUI (graphical user interface) are generally the same. There are however two important points to consider.

First, the GUI is intended for the human researchers, not for the automated processing. Therefore the data in GUI are less rigorously formatted, being thus closer to the original publications (which, of course, impedes machine analysis).

Second, some portion of data (~15%) present in the GUI is absent in the API. This is because currently a number of MPDS entries has only textual content or even no content at all (in process of preparation), i.e. present little value for the data mining. However, we make our best to expose the maximum MPDS data in the API, and this is the work in progress.

§2. Processing data

§2.1. Overview

To facilitate starting with the MPDS API we provide several exercises (or miniature case studies): (a) distribution density analysis, (b) clustering, and (c) correlation search. To a certain extent they could be considered as recipes, however aiming to illustrate the informatics aspect, not the materials science aspect.

All the exercises discussed here can be found in the Github repository.

§2.2. Client library

As mentioned, any programming language, able to execute HTTP requests and handle the JSON output, can be employed. However, one of the most frequently used languages in data processing is Python. Therefore we provide a client library for Python versions 2.7 and 3.6. This library takes care of many aspects of the MPDS API, such as pagination, error handling, validation, proper data extraction and more. We encourage our users to adopt this library for their needs. It is installed as any other Python library:

pip install mpds_client

The exercises below use this library. Some of them might also employ numpy, scipy, and pandas libraries (a popular data science toolbox). As a necessary preparation, we save the MPDS API key in an environment of our terminal session. Later the Python API client reads the key from there, if not explicitly provided. For Bash:

export MPDS_KEY=...

§2.3. Bond length analysis

Let us say we want to study the chemical bond of A and B elements, e.g. uranium and oxygen. What is the length or lengths of this bond the most frequently reported in the world scientific literature? MPDS contains more than 2500 crystalline structures with uranium and oxygen, so why should not we perform a quick data investigation to answer this question. Starting with the necessary imports in Python:

#!/usr/bin/env python
import pandas as pd
from mpds_client import MPDSDataRetrieval

We use a naive brute-force approach to calculate bond lengths between particular atom types in a crystalline environment. Obviously, we are interested in neighboring atoms only, so we do not consider interatomic distances more than, let us say, 4 Å. We then represent a crystalline structure with the ase's Atoms class and calculate distances using its get_distance method. Note rounding of distances marked by comment NB in the code:

def calculate_lengths(ase_obj, elA, elB, limit=4):
    assert elA != elB
    lengths = []
    for n, atom in enumerate(ase_obj):
        if atom.symbol == elA:
            for m, neighbor in enumerate(ase_obj):
                if neighbor.symbol == elB:
                    dist = round(ase_obj.get_distance(n, m), 2) # NB occurrence <-> rounding
                    if dist < limit:
    return lengths

Note that the crystalline structures are not retrieved from MPDS by default, so we need to specify additional five fields: cell_abc, sg_n, setting, basis_noneq, els_noneq. On top of that, we also ask for the phase_id's, MPDS entry numbers, and chemical formulae. Note that get_data API client method returns a usual Python list, whereas get_dataframe API client method returns a Pandas dataframe. We use the former below:

client = MPDSDataRetrieval()

answer = client.get_data(
    {"elements": "U-O", "props": "atomic structure"},

The MPDSDataRetrieval.compile_crystal API client method helps us to handle the crystalline structure in the ase's Atoms flavor. A popular Pymatgen library can be also used instead. Mind however that these two libraries are generally incompatible, so never mix crystalline structures in ase and Pymatgen flavors.

We then call the calculate_lengths function defined earlier:

lengths = []

for item in answer:
    crystal = MPDSDataRetrieval.compile_crystal(item, 'ase')
    if not crystal: continue
    lengths.extend( calculate_lengths(crystal, 'U', 'O') )

That runs a little bit slow (about five minutes), since ase's Atoms are expectedly not performing very well on dozens of thousands of bond length calculations. We may want to employ a C-extension here, but this is outside the scope of this exercise. Anyway now we have a flat list lengths. Let's convert it into a Pandas Dataframe and find which U-O distances occur more often than the others.

dfrm = pd.DataFrame(sorted(lengths), columns=['length'])
dfrm['occurrence'] = dfrm.groupby('length')['length'].transform('count')
dfrm.drop_duplicates('length', inplace=True)

What did we do here? We calculated the numbers of occurrences (counts) of each particular U-O length and then augmented our dataframe with this new information, creating a new column occurrence. The reader remembers that we have rounded the interatomic distances calculated.

Now it's up to the reader to visualize the result (this is absolutely a matter of taste). Below is a bar chart rendered using Plotly and D3 JavaScript libraries. We provide visualization exporting toolbox in Python, allowing the export of the results into CSV and JSON formats and reproducing the figures, although its usage is optional.

Probability density distribution of U-O bond length across the MPDS database
Fig. 1. Distribution of U-O bond lengths across MPDS database.

We can see now that the most frequent bond lengths between the neighboring uranium and oxygen atoms are 1.78 and 2.35 Å. This excellently agrees with the well-known study of Burns et al., done in 1997. However, Burns considered only 105 structures, and we did more than 2500, confirming very well although his findings on the uranyl ion geometry.

§2.4. Clustering the band gaps

Unsupervised grouping the values into the clusters in a many-dimensional space is a classical mathematical problem. In an applied data science it is successfully managed using e.g. such algorithms as k-means and Gaussian mixtures.

Taking the non-conducting binary compounds, let us find the clusters in their band gaps and the periodic groups of their elements. That is, we obtain the band gaps and corresponding pairs of the chemical elements, forming a binary compound, and then, for the series of values group of the first element — group of the second element — band gap we apply k-means. Given a set of points (in our case, in a three-dimensional space), k-means aims to partition the points into a number of sets, in order to minimize the within-cluster sum of distance functions of each point to the cluster center.

Here we explicitly specify some fields of P-entries we are interested in. As said, all the fields are given in the JSON schema. Filtering by the units (eV), we get rid of some band gap derivatives, which are also present in the output. Then we remove a couple of unphysical values (negative or more than 20 eV) that are probably wrongly reported:

#!/usr/bin/env python
from import chemical_symbols
from retrieve_MPDS import MPDSDataRetrieval
from kmeans import Point, kmeans, k_from_n # standalone K-means implementation

client = MPDSDataRetrieval()

dfrm = client.get_dataframe(
    {"classes": "binary", "props": "band gap"},
    fields={'P': [
    columns=['Formula', 'Elements', 'SG', 'Units', 'Bandgap']
dfrm = dfrm[dfrm['Units'] == 'eV']
dfrm = dfrm[(dfrm['Bandgap'] > 0) & (dfrm['Bandgap'] < 20)]

Note, that there might be different values of the band gaps reported in the literature for each particular compound. Clearly, this is because we have not specified the character of the desired values (direct or indirect band gap), but even apart of that variations are expected, e.g. due to slightly different experimental conditions. Here we simply take the average of all the values per a compound, although there are of course better ways to handle that:

avgbgfrm = dfrm.groupby('Formula')['Bandgap'].mean().to_frame().reset_index() \
    .rename(columns={'Bandgap': 'AvgBandgap'})
dfrm = dfrm.merge(avgbgfrm, how='outer', on='Formula')
dfrm.drop_duplicates('Formula', inplace=True)

We now have almost 600 different binary compounds with the averaged values of band gap. So our data are ready for clustering. This is how we get the periodic element groups:

def get_element_group(el_num):
    if el_num == 1: return 1
    elif el_num == 2: return 18

    elif 2 < el_num < 19:
        if (el_num - 2) % 8 == 0: return 18
        elif (el_num - 2) % 8 <= 2: return (el_num - 2) % 8
        else: return 10 + (el_num - 2) % 8

    elif 18 < el_num < 55:
        if (el_num - 18) % 18 == 0: return 18
        else: return (el_num - 18) % 18

    if 56 < el_num < 72: return 3 # custom group for lanthanoids
    elif 88 < el_num < 104: return 3 # custom group for actinoids
    elif (el_num - 54) % 32 == 0: return 18
    elif (el_num - 54) % 32 >= 17: return (el_num - 54) % 32 - 14
    else: return (el_num - 54) % 32

And this is how we define the points for clustering based on our data:

fitdata = []

for n, row in dfrm.iterrows():
    groupA, groupB = \
        get_element_group(chemical_symbols.index(row['Elements'][0])), \
            sorted([groupA, groupB]) + [round(row['AvgBandgap'], 2)],
clusters = kmeans(fitdata, k_from_n(len(fitdata)))

We use a standalone k-means Python implementation kmeans, although the reader might want to employ sklearn or scipy. Note, that the k-means algorithm does not provide the number of clusters to divide our data. We try to guess this number from the size of our data naively using k_from_n function. For homogeneous data this guess is likely to be wrong, but for heterogeneous data it should make sense.

We then export the data in a flat Python list for plotting as follows. The plot below is prepared using the visualization exporting toolbox.

export_data = []

for cluster_n, cluster in enumerate(clusters, start=1):
    for pnt in cluster.points:
        export_data.append(pnt.coords + [pnt.reference] + [cluster_n])

Binary compounds clustering by their band gaps and constituent element groups
Fig. 2. Binary compounds clustering by their band gaps and constituent element groups.

§2.5. Structure-property correlations

In this excercise we quantitatively estimate the relationship between a physical property of interest and the atomic structure. This is the very basic task in chemoinformatics, called QSAR/QSPR study, which stands for the quantitative structure-activity or structure-property relationship modeling. Technically, we request some physical property from MPDS and try to figure out, whether it depends on the crystalline structure, and if yes, how much.

Here the question "how much" is answered by a correlation coefficient, which can be defined either as a measure of how well two datasets fit on a straight line (Pearson coefficient), or difference between concordant and discordant pairs among all the possible pairs in two datasets (Kendall's tau coefficient). Both Pearson and Kendall's tau correlation coefficients are defined between -1 and +1, with 0 implying no correlation. For the Pearson coefficient, values of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases. For the Kendall's tau coefficient, values close to +1 indicate strong agreement, values close to -1 indicate strong disagreement. Calculation of both these correlation coefficients is provided by pandas Python library.

As usual, we start with the Python imports:

#!/usr/bin/env python
from __future__ import division

import numpy as np
import pandas as pd
from import chemical_symbols, covalent_radii

from retrieve_MPDS import MPDSDataRetrieval

We then present two descriptors for the crystalline structures. The term descriptor stands for the compact information-rich number, allowing the convenient mathematical treatment of the encoded complex data (crystalline structure, in our case). We employ an APF descriptor, as well as topological Wiener index, defined per a unit cell. It is up to the reader to check the other descriptors for the crystalline structures.

def get_APF(ase_obj):
    volume = 0.0
    for atom in ase_obj:
        volume += 4/3 * np.pi * covalent_radii[chemical_symbols.index(atom.symbol)]**3
    return volume/abs(np.linalg.det(ase_obj.cell))

def get_Wiener(ase_obj):
    return np.sum(ase_obj.get_all_distances()) * 0.5

Note that our descriptors generally depend on whether we use compile_crystal(item, 'ase') or compile_crystal(item, 'pmg'). This is because by default ase and Pymatgen libraries handle crystalline structures in a slightly different manner. The reader is welcomed to edit the code above for the Pymatgen library.

Concerning the property being investigated, we restrict ourselves with some materials classes, in order to reduce the time needed for descriptor calculations:

client = MPDSDataRetrieval()

dfrm = client.get_dataframe({
    "classes": "transitional, oxide",
    "props": "isothermal bulk modulus"
dfrm = dfrm[np.isfinite(dfrm['Phase'])]
dfrm = dfrm[dfrm['Units'] == 'GPa'] # cleaning
dfrm = dfrm[dfrm['Value'] > 0] # cleaning

phases = set(dfrm['Phase'].tolist())
answer = client.get_data(
    {"props": "atomic structure"},
descriptors = []

for item in answer:
    crystal = MPDSDataRetrieval.compile_crystal(item, 'ase')
    if not crystal: continue
    descriptors.append(( item[0], get_APF(crystal), get_Wiener(crystal) ))

descriptors = pd.DataFrame(descriptors, columns=['Phase', 'APF', 'Wiener'])

Now we have got the physical property values in the dfrm dataframe and the structure descriptors in the descriptors dataframe. The reader remembers that in MPDS the physical properties and crystalline structures are related via the distinct phase concept, i.e. the same phase identifiers, which we have saved in advance in the Phase columns of dfrm and descriptors dataframes. As the MPDS might have several crystalline structures per a physical property value (or vice versa), we take an average (mean) of all the respective quantities within a distinct phase. After that we join our dataframes using the Phase column:

d1 = descriptors.groupby('Phase')['APF'].mean().to_frame().reset_index()
d2 = descriptors.groupby('Phase')['Wiener'].mean().to_frame().reset_index()
dfrm = dfrm.groupby('Phase')['Value'].mean().to_frame().reset_index()

dfrm = dfrm.merge(d1, how='outer', on='Phase')
dfrm = dfrm.merge(d2, how='outer', on='Phase')

dfrm.drop('Phase', axis=1, inplace=True)
dfrm.rename(columns={'Value': 'Prop'}, inplace=True)

And this is how the both correlation coefficients are calculated:

corr_pearson = dfrm.corr(method='pearson')
corr_kendall = dfrm.corr(method='kendall')

print("Pearson. Prop vs. APF = \t%s" % corr_pearson.loc['Prop']['APF'])
print("Pearson. Prop vs. Wiener = \t%s" % corr_pearson.loc['Prop']['Wiener'])
print("Kendall Tau. Prop vs. APF = \t%s" % corr_kendall.loc['Prop']['APF'])
print("Kendall Tau. Prop vs. Wiener = \t%s" % corr_kendall.loc['Prop']['Wiener'])

Expectedly, we can see some correlation between the selected property (namely, isothermal bulk modulus for the oxides of transitional elements) and two structural descriptors. However it varies considerably with the different materials classes, since the crystalline structure also changes considerably. Using the intuition, may the reader suggest the physical property which is expected to show higher correlation with either of two structural descriptors within a certain materials class — or no correlation at all?

§2.6. Visualizations

Although the reader is encouraged to visualize the data using his habitual tools, we provide a set of helper utilities. Using a simple exporting toolbox each of the exercises considered above may output two files for the further plotting: CSV and JSON. By default these files are written in a system-wide temporary directory /tmp (subdirectory _MPDS). CSV is commonly used in the electronic sheets (such as OpenOffice Calc or Excel), and JSON has a custom layout suitable for Vis-à-vis web-viewer. This is quite unsophisticated browser-based JavaScript application, heavily used inside the MPDS GUI. It employs Plotly and D3 visualization libraries. JSON produced with the exporting toolbox may be simply drag-n-dropped in the browser window with the loaded Vis-à-vis.

We thank the reader for the time and interest! Any questions or feedback is very welcomed and greatly appreciated.

Twitter @mpdsio
See on GitHub