TyT2024W21 - EDA:Carbon Majors Emissions Data
By Johanie Fournier, agr., M.Sc. in rstats tidymodels tidytuesday eda
October 24, 2024
I have worked extensively with spatial data over the past two years, so I decided to select suitable
#TidyTuesday
dataset and document what I have learned so far."
My latest contribution to the
#TidyTuesday
project featuring a recent dataset on carbon major emissions. The dataset is a compilation of emissions data from 1854 to 2019.
Goal
The overall goal of this blog series is to predict carbon emissions over time and space.
In this first part, the goal is to do some Exploratory Data Analysis (EDA) to look at the data set and summarize the main characteristics. To do so, I will look at the data structure, anomalies, outliers and relationships.
Get the data
Let’s start by reading in the data:
emissions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-05-21/emissions.csv')
library(skimr)
library(PerformanceAnalytics)
skim(emissions)
Name | emissions |
Number of rows | 12551 |
Number of columns | 7 |
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ | |
Column type frequency: | |
character | 4 |
numeric | 3 |
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
parent_entity | 0 | 1 | 2 | 39 | 0 | 122 | 0 |
parent_type | 0 | 1 | 12 | 22 | 0 | 3 | 0 |
commodity | 0 | 1 | 6 | 19 | 0 | 9 | 0 |
production_unit | 0 | 1 | 6 | 18 | 0 | 4 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1 | 1987.15 | 29.20 | 1854 | 1973.00 | 1994.00 | 2009.00 | 2022.00 | ▁▁▁▅▇ |
production_value | 0 | 1 | 412.71 | 1357.57 | 0 | 10.60 | 63.20 | 320.66 | 27192.00 | ▇▁▁▁▁ |
total_emissions_MtCO2e | 0 | 1 | 113.22 | 329.81 | 0 | 8.79 | 33.06 | 102.15 | 8646.91 | ▇▁▁▁▁ |
chart.Correlation(select_if(emissions, is.numeric))
So, we have a temporal dataset because their’s a year column, 3 classifications columns (parent_entity, parent_type, commodity) and our variable of interest total_emission_MtCO2e.
Trend over time
Is there a general trend over time?
sum_emissions_year<-emissions |>
group_by(year) |>
summarise(sum=sum(total_emissions_MtCO2e)) |>
ungroup()
ggplot(data=sum_emissions_year, aes(x=year, y=sum))+
geom_line()
We can see a clear augmentation of carbon emissions over time.
The ultimate goal for this blog series will be to predict over time and space the carbon emission and visualize the result. To achieve that, we first need to understand more the relationship between parent_entity and total_emission_MtCO2e.
Space trend
sum_emissions_entity<-emissions |>
group_by(parent_entity) |>
summarise(sum=sum(total_emissions_MtCO2e)) |>
ungroup() |>
arrange(desc(sum))
DT::datatable(sum_emissions_entity) |>
DT::formatRound(columns=c("sum"), digits=0)
We have a clear indication that country does not produce the same amount of carbon.
Spatio-temporal trend
Can we link the spatial trend to the temporal trend? Let’s find out by looking at the top 10 countries with the highest emissions.
top10_entity<-sum_emissions_entity |>
top_n(6, sum) |>
select(parent_entity)
emissions_top10<-emissions |>
filter(parent_entity %in% top10_entity$parent_entity)
plot_data<-emissions_top10 |>
group_by(parent_entity, year) |>
summarize(sum=sum(total_emissions_MtCO2e)) |>
ungroup() |>
mutate(date=as.Date(as.character(year), "%Y"),
parent_entity_fct=as.factor(parent_entity)) |>
select(parent_entity_fct, date, sum) |>
pad_by_time(date, .by = "year")
plot_data |>
group_by(parent_entity_fct) |>
plot_time_series(
.date_var = date,
.value = sum,
.interactive = FALSE,
.facet_ncol = 2,
.facet_scales = "free",
)
Each parent_entity has its own trend over time.
Anomalies and outliers
library(anomalize)
plot_data |>
group_by(parent_entity_fct) |>
time_decompose(sum) |>
anomalize(remainder) |>
plot_anomalies(size_dots = 1, ncol = 2)
plot_data |>
filter(parent_entity_fct=="Saudi Aramco") |>
time_decompose(sum) |>
anomalize(remainder) |>
plot_anomaly_decomposition()
So for simplicity, I will replace the anomalies detected by the trend for all the data. All the subsequent analysis will be done with the corrected data for the top 50 countries
top50_entity<-sum_emissions_entity |>
top_n(50, sum) |>
select(parent_entity)
final_data<-emissions |>
filter(parent_entity %in% top50_entity$parent_entity) |>
group_by(parent_entity, year) |>
summarize(sum=sum(total_emissions_MtCO2e)) |>
ungroup() |>
mutate(date=as.Date(as.character(year), "%Y"),
parent_entity_fct=as.factor(parent_entity)) |>
select(parent_entity_fct, date, sum) |>
filter(parent_entity_fct %ni% c("Seriti Resources", "CNX Resources", "Navajo Transitional Energy Company"))|>
pad_by_time(date,
.by = "year",
.pad_value = NA) |>
group_by(parent_entity_fct) |>
time_decompose(sum) |>
anomalize(remainder) |>
mutate(observed=case_when(anomaly=="Yes" ~ trend,
TRUE ~ observed)) |>
select(parent_entity_fct, date, observed)
Conclusion
In this first part, we have explored the dataset and identified the main characteristics. We have seen that the carbon emissions have increased over time and that the top 50 countries have different trends. We have also identified some anomalies and outliers that have been correct for the work to come in the next part.
Session Info
sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] anomalize_0.3.0 PerformanceAnalytics_2.0.4
[3] xts_0.12.1 zoo_1.8-12
[5] jofou.lib_0.0.0.9000 reticulate_1.37.0
[7] tidytuesdayR_1.0.2 tictoc_1.2.1
[9] terra_1.6-17 sf_1.0-5
[11] pins_1.0.1.9000 fs_1.5.2
[13] timetk_2.6.1 yardstick_1.2.0
[15] workflowsets_0.1.0 workflows_0.2.4
[17] tune_0.1.6 rsample_0.1.0
[19] parsnip_1.1.1 modeldata_0.1.1
[21] infer_1.0.0 dials_0.0.10
[23] scales_1.2.1 broom_1.0.4
[25] tidymodels_0.1.4 recipes_0.1.17
[27] doFuture_0.12.0 future_1.22.1
[29] foreach_1.5.1 skimr_2.1.5
[31] forcats_1.0.0 stringr_1.5.0
[33] dplyr_1.1.2 purrr_1.0.1
[35] readr_2.1.4 tidyr_1.3.0
[37] tibble_3.2.1 ggplot2_3.4.2
[39] tidyverse_2.0.0 lubridate_1.9.2
[41] kableExtra_1.3.4.9000 inspectdf_0.0.11
[43] openxlsx_4.2.4 knitr_1.36
loaded via a namespace (and not attached):
[1] readxl_1.4.2 backports_1.4.1 systemfonts_1.0.3
[4] lazyeval_0.2.2 repr_1.1.7 splines_4.1.1
[7] crosstalk_1.1.1 listenv_0.8.0 usethis_2.0.1
[10] digest_0.6.29 htmltools_0.5.8.1 fansi_0.5.0
[13] magrittr_2.0.3 tzdb_0.1.2 globals_0.14.0
[16] ggfittext_0.9.1 gower_0.2.2 vroom_1.6.0
[19] svglite_2.0.0 hardhat_1.3.0 timechange_0.1.1
[22] tseries_0.10-48 forecast_8.15 prettyunits_1.1.1
[25] colorspace_2.0-2 rvest_1.0.3 rappdirs_0.3.3
[28] xfun_0.39 crayon_1.4.2 jsonlite_1.8.4
[31] survival_3.2-11 iterators_1.0.13 glue_1.6.2
[34] gtable_0.3.0 ipred_0.9-12 webshot_0.5.2
[37] future.apply_1.8.1 quantmod_0.4.18 padr_0.6.0
[40] DBI_1.1.1 Rcpp_1.0.13 viridisLite_0.4.0
[43] progress_1.2.2 units_0.7-2 GPfit_1.0-8
[46] bit_4.0.4 proxy_0.4-26 tibbletime_0.1.8
[49] DT_0.19 lava_1.6.10 prodlim_2019.11.13
[52] htmlwidgets_1.5.4 httr_1.4.6 farver_2.1.0
[55] pkgconfig_2.0.3 sass_0.4.0 nnet_7.3-16
[58] utf8_1.2.2 labeling_0.4.2 tidyselect_1.2.0
[61] rlang_1.1.1 DiceDesign_1.9 munsell_0.5.0
[64] cellranger_1.1.0 tools_4.1.1 cli_3.6.1
[67] sweep_0.2.5 generics_0.1.3 evaluate_0.14
[70] fastmap_1.2.0 yaml_2.2.1 bit64_4.0.5
[73] zip_2.2.0 nlme_3.1-152 xml2_1.3.4
[76] compiler_4.1.1 rstudioapi_0.14 plotly_4.10.0
[79] curl_5.2.3 png_0.1-7 e1071_1.7-9
[82] lhs_1.1.3 bslib_0.3.1 stringi_1.7.5
[85] highr_0.9 lattice_0.20-44 Matrix_1.3-4
[88] classInt_0.4-3 urca_1.3-0 vctrs_0.6.5
[91] pillar_1.9.0 lifecycle_1.0.3 furrr_0.2.3
[94] lmtest_0.9-38 jquerylib_0.1.4 data.table_1.14.2
[97] R6_2.5.1 renv_1.0.7 KernSmooth_2.23-20
[100] parallelly_1.28.1 codetools_0.2-18 assertthat_0.2.1
[103] MASS_7.3-54 withr_2.5.0 fracdiff_1.5-1
[106] parallel_4.1.1 hms_1.1.3 quadprog_1.5-8
[109] grid_4.1.1 rpart_4.1-15 timeDate_3043.102
[112] class_7.3-19 rmarkdown_2.25 TTR_0.24.2
[115] base64enc_0.1-3
- Posted on:
- October 24, 2024
- Length:
- 5 minute read, 1022 words
- Categories:
- rstats tidymodels tidytuesday eda
- Tags:
- eda rstats tidymodels tidytuesday