Skip to main content

Exploratory data analysis

·2096 words·10 mins
Alex Haslam
Alex Haslam
I’m an engineer based in London, with expertise in optimisation, machine learning and simulation.
Table of Contents
Coffee Rating Prediction - This article is part of a series.


The objective of this project is to be able to predict how highly rated a coffee would be on based purely on information about the coffee such as:

  • Origin
  • Roaster and roasting style
  • Price
  • Flavour profile

I will use this dataset from Kaggle which contains ratings for ~1900 coffees.

Data processing

I will begin by loading in the data into a pd.DataFrame and also renaming the columns to more clearly distinguish between the country of origin and roaster country.

import pandas as pd

df = pd.read_csv("./data/simplified_coffee.csv")
for col in ["name", "roaster", "roast", "loc_country", "origin", "review"]:
    df[col] = df[col].astype("string")

df["review_date"] = pd.to_datetime(df["review_date"])
df = df.rename(columns={"loc_country": "roaster_country", "100g_$": "price_per_100g", "origin": "country_of_origin"})
Ethiopia Shakiso MormoraRevel CoffeeMedium-LightUnited StatesEthiopia4.70922017-11-01Crisply sweet, cocoa-toned. Lemon blossom, roa…
Ethiopia Suke QutoRoast HouseMedium-LightUnited StatesEthiopia4.19922017-11-01Delicate, sweetly spice-toned. Pink peppercorn…
Ethiopia Gedeb Halo BeritiBig Creek Coffee RoastersMediumUnited StatesEthiopia4.85942017-11-01Deeply sweet, subtly pungent. Honey, pear, tan…
Ethiopia Kayon MountainRed Rooster Coffee RoasterLightUnited StatesEthiopia5.14932017-11-01Delicate, richly and sweetly tart. Dried hibis…
Ethiopia Gelgelu Natural OrganicWilloughby’s Coffee & TeaMedium-LightUnited StatesEthiopia3.97932017-11-01High-toned, floral. Dried apricot, magnolia, a…

We should check for any NaNs:


The roast column is the only one containing NaNs. Since there are only 12 missing values, we could just remove these rows. However, since most coffees have the same roasting style (as will see later), let us fill with the modal value.

df["roast"] = df["roast"].fillna(df["roast"].mode().iloc[0])

Let’s fix a typo in the roaster country for one coffee.

df["roaster_country"] = df["roaster_country"].str.replace("New Taiwan", "Taiwan")

Some roasters such as “El Gran Cafe” are duplicated since they entered with slightly different spellings in different rows (eg sometimes with the accented é and sometimes with a plain e). I will rename the roasters consistently by replacing these characters:

replace = {"’s": "'s", "é": "e", "’": "'"}
for k, v in replace.items():
    df["roaster"] = df["roaster"].str.replace(k, v)

We should also verify that the information about each roaster (in this case only country) is consistent across all coffees from the same roaster.

def _assert_identical_values(df: pd.DataFrame) -> pd.Series:
    assert (df.iloc[1:, :] == df.iloc[0, :]).all().all()
    return df.iloc[0, :]

Region of origin

Most coffees come from certain regions of the world, and the coffees from each region tend to be similar in flavour profile (eg African coffees are typically more acidic and South American coffees more nutty.) Therefore it may be useful to categorise the countries by region.

The mapping from regions to countries has been manually compiled and stored in regions.json. We need to load this JSON file, invert the mapping so that it goes from country to region, and then engineer a new column for the region.

with open("./data/regions.json", "r") as f:
    REGIONS = json.load(f)

regions = {}
for r, countries in REGIONS.items():
    for c in countries:
        regions[c] = r

df["region_of_origin"] = df["country_of_origin"].map(regions).fillna("Other")

Flavour notes

As it stands, we cannot glean any information from the review column as it is unstructured. Let’s begin by analysing the keywords present in the reviews using the WorldCloud package:

from wordcloud import WordCloud, STOPWORDS

COFFEE_WORDS = {"cup", "notes", "finish", "aroma", "hint", "undertones", "mouthfeel", "structure", "toned"}

word_cloud = WordCloud(collocations=False, width = 1000, height = 500, background_color='white', stopwords=set(STOPWORDS) | COFFEE_WORDS).generate(
    " ".join(df["review"])

Word cloud

Note that we have had to remove common coffee-related nouns, which is assumed provide no information about the particular coffee. We can see that the most common words relate to the flavour of the coffee such as “acidity” or “chocolate”. This suggests that we can extract some features for the different flavours in the coffee.

Using this insight and the coffee flavour wheel, we can manually define some flavours and corresponding keywords which are stored in flavours.json.

with open("./data/flavours.json", "r") as f:
    FLAVOURS = json.load(f)

We can now engineer boolean features for each flavour.

def rating_contains_words(review: str, keywords: list[str]) -> bool:
    words = extract_words(review)
    for w in keywords:
        if w in words:
            return True
    return False

for flavour, keywords in FLAVOURS.items():
    df[flavour] = df["review"].apply(rating_contains_words, args=(keywords,))

It is also interesting to check how many flavours the different coffees have. If we have done a good job at defining the flavour keywords, we would expect that mode coffees would:

  • Have at least some flavours since this is a key component of any review
  • Not have an excessive number of flavours, as this would indicate we have chosen too “common” keywords

Indeed, this appears to be the case. All coffees have at least 2 flavours, and in fact most coffees have ~6 flavours.

Distributions of each feature

We will begin by exploring the distribution of each feature in the dataset.


We can see that the ratings appear to be approximately normally distributed. However, the median rating is surprisingly high at ~94% and the standard deviation is low, so that the effect rating range ranges roughly from 85-100.



The distribution for the price of the coffee is shown below.


There is a very long tail, due to a few very expensive coffees. This suggests that there may be a benefit in applying the log transformation when we come to fit the model.

Roasting style

The vast majority of the coffees have the medium-light roast type. This large bias in the dataset may make it challenging for a model to detect any impact of roasting style on coffee rating.


Roaster country

If we look at the value counts, we see that most of the coffees are from US roasters.

United States774
Hong Kong9

If we look at the distribution of pricing for the most common countries, we see that the distribution is quite different in each country. In particular, the distribution of coffees from US roasters has a much more pronounced peak at the lower price level. This likely indicates that there is some bias in the dataset. Given that CoffeeReview is based in the US, they have reviewed a disproportionate number of more affordable coffees from US roasters.

countries = ["United States", "Taiwan", "Guatemala"]
df[df["roaster_country"].apply(lambda c: c in countries)].histogram("price_per_100g",by="roaster_country")
This strong bias towards coffees roasted in the US means that it is unclear how well any model trained on this data will generalise to coffees roasted outside the US. In addition, the number of different countries present is very small, and we cannot for example, predict if a coffee from a German roaster would be more or less likely to be highly rated since there are no coffees from German roasters in the dataset.


Country of origin

We will first examine the different countries of origin in the dataset:


As expected, most of the coffees come from the largest coffee producing countries in the world, almost a third are from Ethiopia alone.

Region of origin

We also plot the regions of origin:


Almost all examples are from one of the following regions which are the major coffee producing regions of the world:

  • East Africa such as Ethiopia or Kenya
  • Central or South America such as Colombia or Guatemala

There are also a lot of coffees from Hawaii, which is likely again because the data is from a US-based website.


In order to examine the popularity of the different flavours, we can plot the histogram:


We can see that the most common flavours are:

  • Caramelly
  • Acidic
  • Fruity
  • Chocolate

Intuitively, this makes sense as these are the sorts of flavours seen mentioned on packets of coffee.

Influence of features on rating

We can now move onto trying to identify any influence of the different features on the rating of the coffee.


To visualise the influence of price on the rating, we can simply produce a scatter plot:

df.plot.scatter(x="price_per_100g", y="rating")

There appears to be some positive correlation between the two variables, although there is a fair amount of scatter so it is clear that there are other mechanisms at play here as well. This suggests that either:

  • Price is genuinely an indicator of quality
  • Price biases the reviewers

There is evidence of diminishing returns from increasing the price of the coffee, as the relationship between price and rating starts to flatten off.


If we look at the highest and lowest rated coffees, we see that they are dominated by certain roasters.

df.loc[df["rating"] > 96, ["name", "roaster"]].groupby("roaster").count()
Barrington Coffee Roasting1
Bird Rock Coffee Roasters1
Dragonfly Coffee Roasters1
Hula Daddy Kona Coffee1
JBC Coffee Roasters3
Kakalove Cafe1
Paradise Roasters2
df.loc[df["rating"] < 90, ["name", "roaster"]].groupby("roaster").count()
El Gran Cafe8

To gain further insight, we can plot the average rating for the coffees with the most coffees:

roasters = df["roaster"].value_counts()
popular_roasters = sorted(roasters[roasters > 10].index)

This is shown below, with the average rating across all coffees shown in black.

We can see that there is significant variation in the average rating for each roaster. For example, “El Gran Cafe” has a particularly low average rating around 3.0 below the average rating. This suggests that either:

  • Certain roasters find the best/worst coffees or roast them particularly well
  • The reviewers favour/dislike certainer roasters

Roasting style

To assess if roasting style has an impact on the rating, we can plot the histogram grouping by the roasting style:

df.hist("rating", by="roast")

It is clear that dark and medium-dark roasted coffees are often rated more poorly, since the tail has a significant skew to the left for these roasting styles. We can also see that the lighter roasted coffees have a much higher median rating, with medium roasted coffees somewhere in between.

We can plot the mean rating for each roasting style in the same way as for the roasters:


As expected, the darker roasting styles have a much lower rating on average, which is because some of the darker roasted coffees are related particularly poorly. However, there is a much smaller difference between the light and medium-light roasting styles.


Since there are many countries, we will analyse the influence of region of origin instead of country. We can plot the histogram of rating grouping by each region:

df.hist("rating", by="region_of_origin")

We can see that the distribution for Central American coffees has a left-leaning tail, and the median is slightly lower than the other regions. However, we can glean more information by looking at the mean rating for each region of origin:


This highlights that the East African coffees are more highly rated on average, and Central American are less highly rated. Therefore, there is evidence that the origin may have some influence on the rating of a coffee. However, the effect does not appear to be as strong as the roaster for example, since the difference between the highest and lowest average ratings is only around 0.5 compared to around 3.0 for the roaster.


To investigate the impact of flavour on rating, we can compute the average rating for coffees with and without each flavour, and compute the difference. If a flavour has a big impact on rating, we would expect to see a large difference.

rating_with_flavour = pd.Series({f: df.loc[df[f], "rating"].mean() for f in FLAVOURS})
rating_without_flavour = pd.Series({f: df.loc[~df[f], "rating"].mean() for f in FLAVOURS})
(rating_with_flavour-rating_without_flavour).hist("region_of_origin", by="roast")

We can see that for most flavours, the difference is quite small. However, it appears that fruitiness has a large positive influence and resinous a large negative influence on the rating.


In this article, we has processed the coffee dataset and performed analysis to establish that:

  1. The dataset has a strong bias towards:
    • Coffees from US roasters
    • Coffees from East Africa or South America
    • Medium-light roasted coffees
  2. We have detected which factors may be more likely to impact the rating:
    • There is some definite correlation between price and rating
    • Certain roasters have much higher and lower average ratings for their coffees, suggesting that this does impact the rating
    • The roast style, origin and flavour profile all appear to have some impact on the rating, but it does not appear that the relationship is as strong as the roaster

In the next post, we will use the insight gained from this analysis to engineer features, and then training a predictive model.

Coffee Rating Prediction - This article is part of a series.