TyT2024W21 - EDA:Carbon Majors Emissions Data

By Johanie Fournier, agr., M.Sc. in rstats tidymodels tidytuesday eda

October 24, 2024

Subscribe button

I have worked extensively with spatial data over the past two years, so I decided to select suitable #TidyTuesday dataset and document what I have learned so far."

My latest contribution to the #TidyTuesday project featuring a recent dataset on carbon major emissions. The dataset is a compilation of emissions data from 1854 to 2019.

Goal

The overall goal of this blog series is to predict carbon emissions over time and space.

In this first part, the goal is to do some Exploratory Data Analysis (EDA) to look at the data set and summarize the main characteristics. To do so, I will look at the data structure, anomalies, outliers and relationships.

Get the data

Let’s start by reading in the data:

emissions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-05-21/emissions.csv')

library(skimr)
library(PerformanceAnalytics)

skim(emissions)


Name	emissions
Number of rows	12551
Number of columns	7
_______________________
Column type frequency:
character	4
numeric	3
________________________
Group variables	None

Data summary

Variable type: character

skim_variable	complete_rate	min	max	n_unique
parent_entity	1	2	39	122
parent_type	1	12	22	3
commodity	1	6	19	9
production_unit	1	6	18	4

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	1987.15	29.20	1854	1973.00	1994.00	2009.00	2022.00	▁▁▁▅▇
production_value	1	412.71	1357.57	0	10.60	63.20	320.66	27192.00	▇▁▁▁▁
total_emissions_MtCO2e	1	113.22	329.81	0	8.79	33.06	102.15	8646.91	▇▁▁▁▁

chart.Correlation(select_if(emissions, is.numeric))

So, we have a temporal dataset because their’s a year column, 3 classifications columns (parent_entity, parent_type, commodity) and our variable of interest total_emission_MtCO2e.

Trend over time

Is there a general trend over time?

sum_emissions_year<-emissions  |> 
  group_by(year) |>  
  summarise(sum=sum(total_emissions_MtCO2e)) |>  
  ungroup()

ggplot(data=sum_emissions_year, aes(x=year, y=sum))+
  geom_line()

We can see a clear augmentation of carbon emissions over time.

The ultimate goal for this blog series will be to predict over time and space the carbon emission and visualize the result. To achieve that, we first need to understand more the relationship between parent_entity and total_emission_MtCO2e.

Space trend

sum_emissions_entity<-emissions  |> 
  group_by(parent_entity) |>  
  summarise(sum=sum(total_emissions_MtCO2e)) |>  
  ungroup() |> 
  arrange(desc(sum))

DT::datatable(sum_emissions_entity) |> 
  DT::formatRound(columns=c("sum"), digits=0)

We have a clear indication that country does not produce the same amount of carbon.

Spatio-temporal trend

Can we link the spatial trend to the temporal trend? Let’s find out by looking at the top 10 countries with the highest emissions.

top10_entity<-sum_emissions_entity |> 
  top_n(6, sum) |> 
  select(parent_entity)

emissions_top10<-emissions |> 
  filter(parent_entity %in% top10_entity$parent_entity) 

plot_data<-emissions_top10 |> 
  group_by(parent_entity, year) |>
  summarize(sum=sum(total_emissions_MtCO2e)) |>
  ungroup() |>
  mutate(date=as.Date(as.character(year), "%Y"),
         parent_entity_fct=as.factor(parent_entity)) |>
  select(parent_entity_fct, date, sum) |>
  pad_by_time(date, .by = "year")

plot_data |> 
    group_by(parent_entity_fct) |> 
    plot_time_series(
        .date_var    = date,
        .value       = sum,
        .interactive = FALSE,
        .facet_ncol  = 2,
        .facet_scales = "free",
    )

Each parent_entity has its own trend over time.

Anomalies and outliers

library(anomalize)

plot_data |> 
    group_by(parent_entity_fct) |> 
    time_decompose(sum) |> 
    anomalize(remainder) |>
  plot_anomalies(size_dots = 1, ncol = 2)

plot_data |> 
    filter(parent_entity_fct=="Saudi Aramco") |> 
    time_decompose(sum) |> 
    anomalize(remainder) |>
  plot_anomaly_decomposition()

So for simplicity, I will replace the anomalies detected by the trend for all the data. All the subsequent analysis will be done with the corrected data for the top 50 countries

top50_entity<-sum_emissions_entity |> 
  top_n(50, sum) |> 
  select(parent_entity)


final_data<-emissions |> 
  filter(parent_entity %in% top50_entity$parent_entity) |>
  group_by(parent_entity, year) |>
  summarize(sum=sum(total_emissions_MtCO2e)) |>
  ungroup() |>
  mutate(date=as.Date(as.character(year), "%Y"),
         parent_entity_fct=as.factor(parent_entity)) |>
  select(parent_entity_fct, date, sum) |>
  filter(parent_entity_fct %ni% c("Seriti Resources", "CNX Resources", "Navajo Transitional Energy Company"))|> 
  pad_by_time(date, 
              .by = "year", 
              .pad_value = NA) |> 
    group_by(parent_entity_fct) |> 
    time_decompose(sum) |> 
    anomalize(remainder) |> 
  mutate(observed=case_when(anomaly=="Yes" ~ trend,
                            TRUE ~ observed)) |> 
  select(parent_entity_fct, date, observed)

Conclusion

In this first part, we have explored the dataset and identified the main characteristics. We have seen that the carbon emissions have increased over time and that the top 50 countries have different trends. We have also identified some anomalies and outliers that have been correct for the work to come in the next part.

Session Info

sessionInfo()

R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Toronto
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] anomalize_0.3.0            PerformanceAnalytics_2.0.8
 [3] xts_0.14.1                 zoo_1.8-12                
 [5] jofou.lib_0.0.0.9000       reticulate_1.40.0         
 [7] tidytuesdayR_1.1.2         tictoc_1.2.1              
 [9] terra_1.8-10               sf_1.0-19                 
[11] pins_1.4.0                 fs_1.6.5                  
[13] timetk_2.9.0               yardstick_1.3.2           
[15] workflowsets_1.1.0         workflows_1.1.4           
[17] tune_1.2.1                 rsample_1.2.1             
[19] parsnip_1.2.1              modeldata_1.4.0           
[21] infer_1.0.7                dials_1.3.0               
[23] scales_1.3.0               broom_1.0.7               
[25] tidymodels_1.2.0           recipes_1.1.0             
[27] doFuture_1.0.1             future_1.34.0             
[29] foreach_1.5.2              skimr_2.1.5               
[31] forcats_1.0.0              stringr_1.5.1             
[33] dplyr_1.1.4                purrr_1.0.2               
[35] readr_2.1.5                tidyr_1.3.1               
[37] tibble_3.2.1               ggplot2_3.5.1             
[39] tidyverse_2.0.0            lubridate_1.9.4           
[41] kableExtra_1.4.0           inspectdf_0.0.12.1        
[43] openxlsx_4.2.7.1           knitr_1.49                

loaded via a namespace (and not attached):
  [1] rstudioapi_0.17.1   jsonlite_1.8.9      magrittr_2.0.3     
  [4] farver_2.1.2        rmarkdown_2.29      vctrs_0.6.5        
  [7] base64enc_0.1-3     blogdown_1.20       htmltools_0.5.8.1  
 [10] progress_1.2.3      curl_6.1.0          TTR_0.24.4         
 [13] sass_0.4.9          parallelly_1.41.0   bslib_0.8.0        
 [16] KernSmooth_2.23-26  htmlwidgets_1.6.4   cachem_1.1.0       
 [19] ggfittext_0.10.2    lifecycle_1.0.4     iterators_1.0.14   
 [22] pkgconfig_2.0.3     Matrix_1.7-2        R6_2.5.1           
 [25] fastmap_1.2.0       digest_0.6.37       colorspace_2.1-1   
 [28] furrr_0.3.1         crosstalk_1.2.1     labeling_0.4.3     
 [31] timechange_0.3.0    compiler_4.4.2      proxy_0.4-27       
 [34] bit64_4.6.0-1       withr_3.0.2         tseries_0.10-58    
 [37] backports_1.5.0     DBI_1.2.3           MASS_7.3-64        
 [40] lava_1.8.1          rappdirs_0.3.3      classInt_0.4-11    
 [43] tibbletime_0.1.9    tools_4.4.2         units_0.8-5        
 [46] lmtest_0.9-40       quantmod_0.4.26     zip_2.3.1          
 [49] future.apply_1.11.3 nnet_7.3-20         glue_1.8.0         
 [52] quadprog_1.5-8      nlme_3.1-166        grid_4.4.2         
 [55] generics_0.1.3      gtable_0.3.6        tzdb_0.4.0         
 [58] class_7.3-23        data.table_1.16.4   hms_1.1.3          
 [61] xml2_1.3.6          pillar_1.10.1       vroom_1.6.5        
 [64] splines_4.4.2       lhs_1.2.0           lattice_0.22-6     
 [67] padr_0.6.3          renv_1.0.7          survival_3.8-3     
 [70] bit_4.5.0.1         tidyselect_1.2.1    urca_1.3-4         
 [73] svglite_2.1.3       forecast_8.23.0     xfun_0.50          
 [76] hardhat_1.4.0       timeDate_4041.110   DT_0.33            
 [79] stringi_1.8.4       DiceDesign_1.10     yaml_2.3.10        
 [82] evaluate_1.0.3      codetools_0.2-20    cli_3.6.3          
 [85] rpart_4.1.24        systemfonts_1.2.1   jquerylib_0.1.4    
 [88] repr_1.1.7          munsell_0.5.1       Rcpp_1.0.14        
 [91] globals_0.16.3      png_0.1-8           parallel_4.4.2     
 [94] fracdiff_1.5-3      assertthat_0.2.1    gower_1.0.2        
 [97] prettyunits_1.2.0   sweep_0.2.5         GPfit_1.0-8        
[100] listenv_0.9.1       viridisLite_0.4.2   ipred_0.9-15       
[103] prodlim_2024.06.25  e1071_1.7-16        crayon_1.5.3       
[106] rlang_1.1.5

Posted on:: October 24, 2024

Length:: 5 minute read, 1028 words

Categories:: rstats tidymodels tidytuesday eda

Tags:: eda rstats tidymodels tidytuesday

See Also:: TyT2025W19: Seismic Events at Mount Vesuvius; Heatmap to Visualize Spatio-Temporal Data; Side by side interactive map with {leaflet} and {leaflet.extras2}