TyT2024W21 - EDA:Carbon Majors Emissions Data

By Johanie Fournier, agr., M.Sc. in rstats tidymodels tidytuesday eda

October 24, 2024

I have worked extensively with spatial data over the past two years, so I decided to select suitable #TidyTuesday dataset and document what I have learned so far."

My latest contribution to the #TidyTuesday project featuring a recent dataset on carbon major emissions. The dataset is a compilation of emissions data from 1854 to 2019.

Goal

The overall goal of this blog series is to predict carbon emissions over time and space.

In this first part, the goal is to do some Exploratory Data Analysis (EDA) to look at the data set and summarize the main characteristics. To do so, I will look at the data structure, anomalies, outliers and relationships.

Get the data

Let’s start by reading in the data:

emissions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-05-21/emissions.csv')

library(skimr)
library(PerformanceAnalytics)

skim(emissions)
Data summary
Name emissions
Number of rows 12551
Number of columns 7
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
Column type frequency:
character 4
numeric 3
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
parent_entity 0 1 2 39 0 122 0
parent_type 0 1 12 22 0 3 0
commodity 0 1 6 19 0 9 0
production_unit 0 1 6 18 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 1987.15 29.20 1854 1973.00 1994.00 2009.00 2022.00 ▁▁▁▅▇
production_value 0 1 412.71 1357.57 0 10.60 63.20 320.66 27192.00 ▇▁▁▁▁
total_emissions_MtCO2e 0 1 113.22 329.81 0 8.79 33.06 102.15 8646.91 ▇▁▁▁▁
chart.Correlation(select_if(emissions, is.numeric))

So, we have a temporal dataset because their’s a year column, 3 classifications columns (parent_entity, parent_type, commodity) and our variable of interest total_emission_MtCO2e.

Trend over time

Is there a general trend over time?

sum_emissions_year<-emissions  |> 
  group_by(year) |>  
  summarise(sum=sum(total_emissions_MtCO2e)) |>  
  ungroup()

ggplot(data=sum_emissions_year, aes(x=year, y=sum))+
  geom_line()

We can see a clear augmentation of carbon emissions over time.

The ultimate goal for this blog series will be to predict over time and space the carbon emission and visualize the result. To achieve that, we first need to understand more the relationship between parent_entity and total_emission_MtCO2e.

Space trend

sum_emissions_entity<-emissions  |> 
  group_by(parent_entity) |>  
  summarise(sum=sum(total_emissions_MtCO2e)) |>  
  ungroup() |> 
  arrange(desc(sum))

DT::datatable(sum_emissions_entity) |> 
  DT::formatRound(columns=c("sum"), digits=0)

We have a clear indication that country does not produce the same amount of carbon.

Spatio-temporal trend

Can we link the spatial trend to the temporal trend? Let’s find out by looking at the top 10 countries with the highest emissions.

top10_entity<-sum_emissions_entity |> 
  top_n(6, sum) |> 
  select(parent_entity)

emissions_top10<-emissions |> 
  filter(parent_entity %in% top10_entity$parent_entity) 

plot_data<-emissions_top10 |> 
  group_by(parent_entity, year) |>
  summarize(sum=sum(total_emissions_MtCO2e)) |>
  ungroup() |>
  mutate(date=as.Date(as.character(year), "%Y"),
         parent_entity_fct=as.factor(parent_entity)) |>
  select(parent_entity_fct, date, sum) |>
  pad_by_time(date, .by = "year")

plot_data |> 
    group_by(parent_entity_fct) |> 
    plot_time_series(
        .date_var    = date,
        .value       = sum,
        .interactive = FALSE,
        .facet_ncol  = 2,
        .facet_scales = "free",
    )

Each parent_entity has its own trend over time.

Anomalies and outliers

library(anomalize)

plot_data |> 
    group_by(parent_entity_fct) |> 
    time_decompose(sum) |> 
    anomalize(remainder) |>
  plot_anomalies(size_dots = 1, ncol = 2)
plot_data |> 
    filter(parent_entity_fct=="Saudi Aramco") |> 
    time_decompose(sum) |> 
    anomalize(remainder) |>
  plot_anomaly_decomposition()

So for simplicity, I will replace the anomalies detected by the trend for all the data. All the subsequent analysis will be done with the corrected data for the top 50 countries

top50_entity<-sum_emissions_entity |> 
  top_n(50, sum) |> 
  select(parent_entity)


final_data<-emissions |> 
  filter(parent_entity %in% top50_entity$parent_entity) |>
  group_by(parent_entity, year) |>
  summarize(sum=sum(total_emissions_MtCO2e)) |>
  ungroup() |>
  mutate(date=as.Date(as.character(year), "%Y"),
         parent_entity_fct=as.factor(parent_entity)) |>
  select(parent_entity_fct, date, sum) |>
  filter(parent_entity_fct %ni% c("Seriti Resources", "CNX Resources", "Navajo Transitional Energy Company"))|> 
  pad_by_time(date, 
              .by = "year", 
              .pad_value = NA) |> 
    group_by(parent_entity_fct) |> 
    time_decompose(sum) |> 
    anomalize(remainder) |> 
  mutate(observed=case_when(anomaly=="Yes" ~ trend,
                            TRUE ~ observed)) |> 
  select(parent_entity_fct, date, observed)

Conclusion

In this first part, we have explored the dataset and identified the main characteristics. We have seen that the carbon emissions have increased over time and that the top 50 countries have different trends. We have also identified some anomalies and outliers that have been correct for the work to come in the next part.

Session Info

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] anomalize_0.3.0            PerformanceAnalytics_2.0.4
 [3] xts_0.12.1                 zoo_1.8-12                
 [5] jofou.lib_0.0.0.9000       reticulate_1.37.0         
 [7] tidytuesdayR_1.0.2         tictoc_1.2.1              
 [9] terra_1.6-17               sf_1.0-5                  
[11] pins_1.0.1.9000            fs_1.5.2                  
[13] timetk_2.6.1               yardstick_1.2.0           
[15] workflowsets_0.1.0         workflows_0.2.4           
[17] tune_0.1.6                 rsample_0.1.0             
[19] parsnip_1.1.1              modeldata_0.1.1           
[21] infer_1.0.0                dials_0.0.10              
[23] scales_1.2.1               broom_1.0.4               
[25] tidymodels_0.1.4           recipes_0.1.17            
[27] doFuture_0.12.0            future_1.22.1             
[29] foreach_1.5.1              skimr_2.1.5               
[31] forcats_1.0.0              stringr_1.5.0             
[33] dplyr_1.1.2                purrr_1.0.1               
[35] readr_2.1.4                tidyr_1.3.0               
[37] tibble_3.2.1               ggplot2_3.4.2             
[39] tidyverse_2.0.0            lubridate_1.9.2           
[41] kableExtra_1.3.4.9000      inspectdf_0.0.11          
[43] openxlsx_4.2.4             knitr_1.36                

loaded via a namespace (and not attached):
  [1] readxl_1.4.2       backports_1.4.1    systemfonts_1.0.3 
  [4] lazyeval_0.2.2     repr_1.1.7         splines_4.1.1     
  [7] crosstalk_1.1.1    listenv_0.8.0      usethis_2.0.1     
 [10] digest_0.6.29      htmltools_0.5.8.1  fansi_0.5.0       
 [13] magrittr_2.0.3     tzdb_0.1.2         globals_0.14.0    
 [16] ggfittext_0.9.1    gower_0.2.2        vroom_1.6.0       
 [19] svglite_2.0.0      hardhat_1.3.0      timechange_0.1.1  
 [22] tseries_0.10-48    forecast_8.15      prettyunits_1.1.1 
 [25] colorspace_2.0-2   rvest_1.0.3        rappdirs_0.3.3    
 [28] xfun_0.39          crayon_1.4.2       jsonlite_1.8.4    
 [31] survival_3.2-11    iterators_1.0.13   glue_1.6.2        
 [34] gtable_0.3.0       ipred_0.9-12       webshot_0.5.2     
 [37] future.apply_1.8.1 quantmod_0.4.18    padr_0.6.0        
 [40] DBI_1.1.1          Rcpp_1.0.13        viridisLite_0.4.0 
 [43] progress_1.2.2     units_0.7-2        GPfit_1.0-8       
 [46] bit_4.0.4          proxy_0.4-26       tibbletime_0.1.8  
 [49] DT_0.19            lava_1.6.10        prodlim_2019.11.13
 [52] htmlwidgets_1.5.4  httr_1.4.6         farver_2.1.0      
 [55] pkgconfig_2.0.3    sass_0.4.0         nnet_7.3-16       
 [58] utf8_1.2.2         labeling_0.4.2     tidyselect_1.2.0  
 [61] rlang_1.1.1        DiceDesign_1.9     munsell_0.5.0     
 [64] cellranger_1.1.0   tools_4.1.1        cli_3.6.1         
 [67] sweep_0.2.5        generics_0.1.3     evaluate_0.14     
 [70] fastmap_1.2.0      yaml_2.2.1         bit64_4.0.5       
 [73] zip_2.2.0          nlme_3.1-152       xml2_1.3.4        
 [76] compiler_4.1.1     rstudioapi_0.14    plotly_4.10.0     
 [79] curl_5.2.3         png_0.1-7          e1071_1.7-9       
 [82] lhs_1.1.3          bslib_0.3.1        stringi_1.7.5     
 [85] highr_0.9          lattice_0.20-44    Matrix_1.3-4      
 [88] classInt_0.4-3     urca_1.3-0         vctrs_0.6.5       
 [91] pillar_1.9.0       lifecycle_1.0.3    furrr_0.2.3       
 [94] lmtest_0.9-38      jquerylib_0.1.4    data.table_1.14.2 
 [97] R6_2.5.1           renv_1.0.7         KernSmooth_2.23-20
[100] parallelly_1.28.1  codetools_0.2-18   assertthat_0.2.1  
[103] MASS_7.3-54        withr_2.5.0        fracdiff_1.5-1    
[106] parallel_4.1.1     hms_1.1.3          quadprog_1.5-8    
[109] grid_4.1.1         rpart_4.1-15       timeDate_3043.102 
[112] class_7.3-19       rmarkdown_2.25     TTR_0.24.2        
[115] base64enc_0.1-3   
Posted on:
October 24, 2024
Length:
5 minute read, 1022 words
Categories:
rstats tidymodels tidytuesday eda
Tags:
eda rstats tidymodels tidytuesday
See Also:
TyT2024W21 - VIZ:Carbon Majors Emissions Data
TyT2024W21 - ML:Carbon Majors Emissions Data
Predicting MO with H2O Models from IRDA data