Design of scatter plots

Not all diagrams in this article are based on real data.

Line charts are used to show the change in one parameter over time. For instance, change of happiness index in one country (USA) over the past half century.

It is not always interesting to see how changes one parameter for one object of observation. Sometimes it is necessary to see the dependence of one parameter on another for a group of objects at once. For instance, it would be useful to explore how different people are happy depending on their salary.

For such tasks we use scatter plots. This is an area within two axes. The given parameter, size of income, is displayed on the horizontal axis. The dependent parameter, happiness level, is marked on the vertical axis.

Each point on the plot represents one person of given income and happiness. Points scattered along the plot show that the rich are happier — on average, but not always.

Scatter plot can show more than just two parameters. In addition to the position on the axes, each point may have visual differences. The easiest way is to use color.

For example, men can be marked in blue and women in pink. In the new plot, the blue dots lie just below the pink dots for the same income. This suggests that women are, on average, happier than men at the same salary.

Another obvious trick is to make dots of different sizes. What can we show in this way? For instance, this could be the person's weight.

Let the size of the dots reflect level of obesity. In the next diagram, large dots are clustered on the left, and almost all medium-sized dots are on the right. It can be concluded that the lower the income of a person, the more difficult it is for them to have a healthy diet. At the same time, there are also many smaller dots on the left side. Poor people not only eat bad quality food that leads to obesity, but they are also more likely to suffer from malnutrition.

Finally, the dots may differ in shape. The most popular shapes for a scatter plot are: circle, square, and triangle.

Let us depict married people as triangles and leave single people as circles. From the new diagram, one can see that triangles are piled up on the left. Rich people are less likely to marry than poor people. Perhaps they do not see the point in this, absorbed in business or work, or maybe they simply do not want to share their property with someone.

Design of points

A point is an abstract object that does not have any measurable parameters other than coordinates. Is not it weird to talk about the design of points, if points do not even have a body? Rather more correct to call them “figures”.

A big drawback of the previous diagram is that it is a real dump of figures overlapping each other. The first thing to fix is to make them transparent, but in this case the diagram becomes bleak.

Figures can overlap in “multiply” blending mode. This mode is available not only in graphic editors. CSS offers the mix-blend-mode property.

Unlike with transparency, figures in multiply mode do not fade. Figure colors remain the same, but in the intersection area it is multiplied and becomes more saturated.

This technique works well if the figures on the diagram are of the same color. On such diagram, it is not only visible which figure lies under which, but one can also evaluate the density of data in different parts of the diagram.

However, if the figures are of different colors, then the intersection will turn out to be of a mixed color that does not belong to any of the figure types.

Plots that use this trick look dirty.

For such cases, it is possible to combine two tricks: put copies of the transparent figures on top of the figures in multiply mode.

Then at the same time the colors will remain saturated, and the intersection will not blacken.

Design of average lines

Scatter plots can not only contain figures, but also lines. However, scatter plot lines are very different graph lines. Figures shall not interconnect, because they are independent of each other.

Instead, a “line of best fit” or simply an “average line” is drawn. This line runs closest to all points, on the average.

The average line does not have to be straight. The following diagram shows the relationship between body mass and speed of swimming animals. The average line of this set bends over on the right side of the diagram.

One diagram can display multiple sets. Below, terrestrial animals and birds are added. Personal average line is drawn for each set, and all lines are quite similar to each other.

In the previous example sets differ in color. This is not always the case. For instance, several sets of bonds with the same investment rating can be displayed on a scatter plot. Rating is usually indicated by color. In such case, several sets will be of the same color, and their average lines will not differ.

The problem can be solved by making the lines dashed and dotted. Figures themselves can hardly be make distinct from each other on a static image, so it remains to use highlighting when hovering over the cursor.

Figure captions, and legend

A legend must be on any chart or plot that shows more than one type of data. Legend can be placed in a dozen ways, and any one will do depending on the task. For static image, legend can be placed at the bottom or to the right. If the diagram is part of a software, its legend can be placed in the upper left corner and have a button to turn off unnecessary sets.

Particular elements of the diagram can be labeled with callouts. For instance, these may be the figures with the largest values. Diagrams with callouts can be used as illustrations. For this, parts of the grid and the axes can be removed, leaving only tick marks and horizontal lines. Somewhere here data design ends and infographics starts.

If the diagram is part of a software, such callouts will not work. It is often necessary to sign all figures, but this is not always possible. If there are many figures, one needs to sign only those that can be signed without overlapping the neighboring figures.

This diagram shows the relationship between the number of coronavirus infections and the number of tests in a country. On average, the more tests, the more cases of the disease, which is logical. Countries are one of the good examples where one can use abbreviations and sign almost all figures.

Let us add a new parameter to the plot. It is logical to show the population of the country with the size of its figure. The more people live in the country, the larger radius will it have.

Radius is shown in legend by concentric circles.

It is unlikely that such a legend will fit on the screen, and there is not much benefit from it. What an interactive diagram needs is a choice of axes parameters. To implement it, axis labels will have to be moved up and changed to drop-down lists. Let us sign “x” and “y” next to the drop-downs so they can be distinguished.

Time series

Although scatter plots are intended for other purposes, it is still possible to show the change over time on it. To do this, one need to attach tails to figures, which would show the path traveled by the figure.

Tails can have time marks. It is important to remember that time lies outside the two-dimensional plane of the diagram, so the marks go unevenly.

The following plot shows the relationship between a country's life expectancy and per capita spending on healthcare. The United States stands out from the slender bunch of European countries, Japan, Korea and Israel: with the colossal cost of medical care, people on average do not live up to 80 years. The “American tail” is marked with years and highlighted in red for clarity.

It is not necessary to clutter up the plot area to show the timeline. One can show the time series under the diagram after click on a specific figure.

The example below shows a scatter plot for bonds. Horizontal axis is marked with duration, that is, the average return term of investment put in a bond. Bond yields are marked along the vertical axis. Color indicates bond investment rating, triangle means that the bond is in the investor's portfolio, and size indicates liquidity.

Density of scattering

There can be a lot of figures on a scatter plot. Zoom can be used to view them all. Another way is to limit the range of axes values. To do this, one needs to add a range control to the axes, which consists of two sliders and a scale between them.

The scale will be much more informative if one draw a diagram of figures density. Blue color highlights the area that is displayed on the scatter plot if the axis constraint is enabled.

The example below shows how axis constraints might look in the interface of a financial software.

Sometimes hundreds and thousands of figures are plotted on scatter plots. The example below shows the relationship between the rating and age of FIFA players, which are so numerous that only 10 % of the sample is shown on the plot.

If the figures have transparency of 25 % and the multiply mode is turned on, the diagram can be read much easier. It is now possible to see how high the sample density is at the bottom left of the plot.

For illustrative purposes, such dense diagrams are often plotted with areas of density similar to heat maps.

Sometimes the data is not as important as general understanding of the distribution density. In this case, one can get rid of the figures at all by replacing them with a stepped gradient. As in the previous example, the color here means the density of the figures in the given area of the scatter plot.

Finally, the data can be presented quite concise and only the center of density can be shown as a round mark, while the amplitude can be represented by a line going up and down from the mark. This is an analogue of Japanese candlesticks, which are often used in financial charts. Such scatter plots are called “Binned scatter plots”.