TyT2024W21 - ML:Carbon Majors Emissions Data

By Johanie Fournier, agr., M.Sc. in rstats tidymodels tidytuesday models

October 25, 2024

This is my latest contribution to the #TidyTuesday dataset project, featuring a recent dataset on carbon major emissions.

In the firs part, the goal was to do some Exploratory Data Analysis (EDA). To do so, I looked at the data structure, anomalies, outliers and relationships to ensure I have everithing I need for this second part.

In this second part, I will be predicting carbon emissions over space and time. I will be using machine learning techniques to predict future emissions based on historical data. I will be using the spdep package to create a spatial weights matrix, and the modeltime package to train machine learning models and make predictions for future emissions. Let’s get started!

Data Preprocessing

The first step here is to add the GPS coordinate for all th 50 parent_entity in the dataset. To do so, we first need to find the headquarters location for each company.

Get Headquarters Location

# Load the data
data<-read_rds("data/final_data.rds")
head(data)
# A tibble: 6 × 3
  parent_entity_fct              date       observed
  <fct>                          <date>        <dbl>
1 Abu Dhabi National Oil Company 1962-10-24    0.498
2 Abu Dhabi National Oil Company 1963-10-24    1.05 
3 Abu Dhabi National Oil Company 1964-10-24    4.17 
4 Abu Dhabi National Oil Company 1965-10-24    6.19 
5 Abu Dhabi National Oil Company 1966-10-24    7.56 
6 Abu Dhabi National Oil Company 1967-10-24    8.00 
parent_entity<-data|>  
  mutate(parent_entity = as.character(parent_entity_fct))|> 
  select(parent_entity)|>  
  unique()

DT::datatable(parent_entity)
import pandas as pd
import requests

# Load data
data = r.parent_entity

# Function to get headquarters from Wikidata
def get_wikidata_headquarters(company):
    try:
        # Search for the company in Wikidata
        search_url = f"https://www.wikidata.org/w/api.php?action=wbsearchentities&search={company}&language=en&format=json"
        search_response = requests.get(search_url).json()
        
        if search_response['search']:
            # Get the Wikidata ID of the first search result
            wikidata_id = search_response['search'][0]['id']

            # Get detailed information from Wikidata using the ID
            entity_url = f"https://www.wikidata.org/wiki/Special:EntityData/{wikidata_id}.json"
            entity_response = requests.get(entity_url).json()

            # Extract the headquarters location
            entity_data = entity_response['entities'].get(wikidata_id, {})
            claims = entity_data.get('claims', {})
            headquarters_claim = claims.get('P159', [])
            
            if headquarters_claim:
                # Get the headquarters location label
                location_id = headquarters_claim[0]['mainsnak']['datavalue']['value']['id']
                location_url = f"https://www.wikidata.org/wiki/Special:EntityData/{location_id}.json"
                location_response = requests.get(location_url).json()

                location_data = location_response['entities'].get(location_id, {})
                location_label = location_data['labels'].get('en', {}).get('value', 'Unknown')
                
                return location_label
        return 'Headquarters not found'
    except Exception as e:
        return f'Error: {str(e)}'

# Assuming 'data' is your dataframe and 'parent_entity' is the column containing company names
data['headquarters'] = data['parent_entity'].apply(get_wikidata_headquarters)

# Display the updated dataframe
data.head()
                    parent_entity            headquarters
0  Abu Dhabi National Oil Company               Abu Dhabi
1   Alpha Metallurgical Resources  Headquarters not found
2                  Anglo American                  London
3                  Arch Resources               St. Louis
4                             BHP  Headquarters not found
# Complete missing headquarters
headquarters<-py$data |> 
  mutate(headquarters=case_when(
    parent_entity=="Alpha Metallurgical Resources" ~ "Bristol, Tennessee, États-Unis",
    parent_entity=="BHP" ~ "Melbourne, Ausralie",
    parent_entity=="BP" ~ "Londres, Royaume-Uni",
    parent_entity=="British Coal Corporation" ~ "Hobart House, Grosvenor Place, London",
    parent_entity=="Chevron" ~ "San Ramon, Californie, États-Unis",
    parent_entity=="China (Cement)" ~ "China",
    parent_entity=="China (Coal)" ~ "Beijing, Chine",
    parent_entity=="Czechoslovakia" ~ "Czech Republic",
    parent_entity=="Eni" ~ "Rpme, Italie",
    parent_entity=="Former Soviet Union" ~ "Moscou, Russie",
    parent_entity=="Kazakhstan" ~ "Astana, Kazakhstan",
    parent_entity=="Kuwait Petroleum Corp." ~ "Kuwait City, Koweït",
    parent_entity=="Libya National Oil Corp." ~ "Tripoli, Libye",
    parent_entity=="National Iranian Oil Co." ~ "Tehran, Iran",
    parent_entity=="Nigerian National Petroleum Corp." ~ "Abuja, Nigeria",
    parent_entity=="ONGC India" ~ "India",
    parent_entity=="Peabody Coal Group" ~ "Saint-Louis, Missouri, États-Unis",
    parent_entity=="Poland" ~ "Poland",
    parent_entity=="Russian Federation" ~ "Russie",
    parent_entity=="Shell" ~ "Londres, Royaume-Uni",
    parent_entity=="Ukraine" ~ "Ukraine",
    TRUE ~ headquarters
  ))

Get GPS Coordinates

Now that we have the headquarters locations, we can extract the GPS coordinates for each location using the ggmap package.

#Get GPS coordinate with ggmap
ggmap::register_google("Insert_your_key_here")
geocoded_data<-ggmap::geocode(headquarters$headquarters) |> 
  bind_cols(headquarters) |> 
    drop_na()

write_rds(geocoded_data, "data/geocoded_data.rds")
#Load my data
geocoded_data<-read_rds("data/geocoded_data.rds")
head(geocoded_data)
# A tibble: 6 × 4
      lon   lat parent_entity                  headquarters                  
    <dbl> <dbl> <chr>                          <chr>                         
1  54.4    24.5 Abu Dhabi National Oil Company Abu Dhabi                     
2 -82.2    36.6 Alpha Metallurgical Resources  Bristol, Tennessee, États-Unis
3  -0.128  51.5 Anglo American                 London                        
4 -90.2    38.6 Arch Resources                 St. Louis                     
5 145.    -37.8 BHP                            Melbourne, Ausralie           
6  -0.128  51.5 BP                             Londres, Royaume-Uni          

Let’s put it all together!

data_final<-data |> 
  mutate(parent_entity = as.character(parent_entity_fct)) |>
  right_join(geocoded_data, by=c("parent_entity")) |> 
  select(-parent_entity_fct, -headquarters) 

Create a Spatial Weights Matrix

We need some spatial information extracted from the coordinate to put into the temporal model. We will use the GPS coordinates to create a k-nearest neighbors spatial weights matrix. This matrix will be used as spatial information of the observed emissions, which will be used as a predictor in the machine learning models.

# Extract coordinates
coords <- data_final |> 
  select(lon, lat)

# Create k-nearest neighbors spatial weights matrix (example with k = 4)
knn <- spdep::knearneigh(coords, k = 4)
nb <- spdep::knn2nb(knn)
listw_knn <- spdep::nb2listw(nb, style = "W")

# Compute spatial lag using the weights matrix
spatial_lag <- spdep::lag.listw(listw_knn, data_final$observed)

# Add the spatial lag
data_spatial <- data_final |> 
  mutate(spatial_lag = spatial_lag) |> 
  select(-lon, -lat)
top6_entity<-data_spatial |>
  select(parent_entity) |>
  unique() |> 
  top_n(6) 

data_spatial |> 
    group_by(parent_entity) |> 
    filter(parent_entity %in% top6_entity$parent_entity) |> 
    plot_time_series(date, observed,
                     .facet_ncol = 2,
                     .smooth = FALSE, 
                     .interactive = FALSE)

Modeling

Now that we have the spatial information, we can start training machine learning models to predict future emissions. We will use the modeltime package to train and evaluate the models.

Extend, Nesting and Splitting Data

First, we need to extend the time series data to include future observations for the next 20 years, then nest the data by the parent entity, and finally split the data into training and testing sets.

nested_data_tbl <- data_spatial|> 
    group_by(parent_entity)|> 
  #Extend
    extend_timeseries(
        .id_var = parent_entity,
        .date_var = date,
        .length_future = 20
    )|> 
  #Nest
    nest_timeseries(
        .id_var = parent_entity,
        .length_future = 20
    )|> 
  #Split
    split_nested_timeseries(
        .length_test = 20
    )

Model training

Next, we will train machine learning models to predict future emissions. We will train two XGBoost models with different learning rates and a Temporal Hierarchical Forecasting (THief) model.

# Recipe
rec_xgb <- recipe(observed ~ ., extract_nested_train_split(nested_data_tbl)) |> 
    step_timeseries_signature(date) |> 
    step_rm(date) |> 
    step_zv(all_predictors()) |> 
    step_dummy(all_nominal_predictors(), one_hot = TRUE)

# Models 
wflw_xgb_1 <- workflow()|> 
    add_model(boost_tree("regression", learn_rate = 0.35)|>  set_engine("xgboost"))|> 
    add_recipe(rec_xgb)

wflw_xgb_2 <- workflow()|> 
    add_model(boost_tree("regression", learn_rate = 0.50)|>  set_engine("xgboost"))|> 
    add_recipe(rec_xgb)

wflw_thief <- workflow()|> 
    add_model(temporal_hierarchy()|>  set_engine("thief"))|> 
    add_recipe(recipe(observed ~ ., extract_nested_train_split(nested_data_tbl)))


# Nested Modeling
parallel_start(6)
nested_modeltime_tbl <- nested_data_tbl|> 
    modeltime_nested_fit(
        model_list = list(
            wflw_xgb_1,
            wflw_xgb_2,
            wflw_thief
        ),
        control = control_nested_fit(
            verbose   = TRUE,
            allow_par = TRUE
        )
    )

Review Errors

Before selecting the best model, we need to review any errors that occurred during the training process. We will also check the accuracy of the models and investigate any anomalies.

# #1) list errors
# errors<-nested_modeltime_tbl|> 
#   extract_nested_error_report()
# 
# #2) Spot too high accuracy
# nested_modeltime_tbl |> 
#     extract_nested_test_accuracy() |> 
#     table_modeltime_accuracy()
# 
# #3) Investigate
# nested_modeltime_tbl|> 
#     filter(parent_entity == "Lukoil") |> 
#     extract_nested_train_split()
# 
# nested_modeltime_tbl|> 
#     extract_nested_test_forecast() |> 
#     filter(parent_entity == "Former Soviet Union") |> 
#     group_by(parent_entity) |> 
#     plot_modeltime_forecast(.facet_ncol = 3,
#                             .interactive = FALSE)

#4) remove small time series
ids_small_timeseries <- c("Lukoil", "Rosneft", "Glencore","Surgutneftegas", "Ukraine")

nested_modeltime_subset_tbl <- nested_modeltime_tbl|> 
    filter(!parent_entity %in% ids_small_timeseries)

Select Best Models

Next, we will select the best models based on the root mean square error (RMSE) metric. We will visualize the best models to compare their performance and identify any anomalies.

nested_best_tbl <- nested_modeltime_subset_tbl|> 
    modeltime_nested_select_best(metric = "rmse")

nested_best_tbl |> 
    extract_nested_test_forecast() |> 
    filter(parent_entity %in% c("CONSOL Energy", "Shell", "Former Soviet Union", "Anglo American","Chevron","BP")) |> 
    group_by(parent_entity) |> 
    plot_modeltime_forecast(.facet_ncol = 2, 
                            .legend_show = FALSE,
                            .interactive = FALSE)

Refit Models

Finally, we will refit the best models on the entire dataset to make predictions for future emissions. We will also review any errors that occurred during the refitting process and visualize the future forecasts for the selected companies.

# Refit
nested_best_refit_tbl <- nested_best_tbl|> 
    modeltime_nested_refit(
        control = control_refit(
            verbose   = TRUE,
            allow_par = TRUE
        )
    )

# Error last check
nested_best_refit_tbl|>  extract_nested_error_report()
# A tibble: 0 × 4
# ℹ 4 variables: parent_entity <chr>, .model_id <int>, .model_desc <chr>,
#   .error_desc <chr>
# Visualize Future Forecast 
nested_best_refit_tbl|> 
    extract_nested_future_forecast()|> 
    filter(parent_entity %in% c("CONSOL Energy", "Shell", "Former Soviet Union", "Anglo American","Chevron","BP"))%>%
    group_by(parent_entity)|> 
    plot_modeltime_forecast(.facet_ncol = 2, 
                            .legend_show = FALSE,
                            .interactive = FALSE)

Save preds

Finally, we will save the predictions for future emissions to a file for further analysis and visualization.

nested_best_refit_tbl |> 
    extract_nested_future_forecast() |> 
    write_rds("data/data_with_pred.rds")

Conclusion

In this post, I have shown how to predict carbon emissions over space and time using the Carbon Majors dataset. I have demonstrated how to preprocess the data, create a spatial weights matrix, and train machine learning models to predict future emissions. I have also shown how to evaluate the model’s performance and make predictions for future emissions. This is just one example of how machine learning can be used to analyze environmental data and make predictions about future trends. I hope this post has been helpful and informative, and that it has inspired you to explore the Carbon Majors dataset further and apply machine learning techniques to other environmental datasets.

Session Info

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] jofou.lib_0.0.0.9000  reticulate_1.37.0     tidytuesdayR_1.0.2   
 [4] tictoc_1.2.1          terra_1.6-17          sf_1.0-5             
 [7] pins_1.0.1.9000       modeltime_1.0.0       fs_1.5.2             
[10] timetk_2.6.1          yardstick_1.2.0       workflowsets_0.1.0   
[13] workflows_0.2.4       tune_0.1.6            rsample_0.1.0        
[16] parsnip_1.1.1         modeldata_0.1.1       infer_1.0.0          
[19] dials_0.0.10          scales_1.2.1          broom_1.0.4          
[22] tidymodels_0.1.4      recipes_0.1.17        doFuture_0.12.0      
[25] future_1.22.1         foreach_1.5.1         skimr_2.1.5          
[28] forcats_1.0.0         stringr_1.5.0         dplyr_1.1.2          
[31] purrr_1.0.1           readr_2.1.4           tidyr_1.3.0          
[34] tibble_3.2.1          ggplot2_3.4.2         tidyverse_2.0.0      
[37] lubridate_1.9.2       kableExtra_1.3.4.9000 inspectdf_0.0.11     
[40] openxlsx_4.2.4        knitr_1.36           

loaded via a namespace (and not attached):
  [1] readxl_1.4.2         backports_1.4.1      systemfonts_1.0.3   
  [4] repr_1.1.7           sp_1.4-5             splines_4.1.1       
  [7] listenv_0.8.0        usethis_2.0.1        digest_0.6.29       
 [10] htmltools_0.5.8.1    fansi_0.5.0          magrittr_2.0.3      
 [13] doParallel_1.0.16    tzdb_0.1.2           globals_0.14.0      
 [16] ggfittext_0.9.1      gower_0.2.2          RcppParallel_5.1.4  
 [19] xts_0.12.1           svglite_2.0.0        hardhat_1.3.0       
 [22] timechange_0.1.1     prettyunits_1.1.1    colorspace_2.0-2    
 [25] rvest_1.0.3          rappdirs_0.3.3       xfun_0.39           
 [28] crayon_1.4.2         jsonlite_1.8.4       survival_3.2-11     
 [31] zoo_1.8-12           iterators_1.0.13     glue_1.6.2          
 [34] gtable_0.3.0         ipred_0.9-12         webshot_0.5.2       
 [37] future.apply_1.8.1   DBI_1.1.1            Rcpp_1.0.13         
 [40] spData_2.3.1         viridisLite_0.4.0    progress_1.2.2      
 [43] units_0.7-2          spdep_1.3-5          GPfit_1.0-8         
 [46] proxy_0.4-26         lava_1.6.10          StanHeaders_2.21.0-7
 [49] prodlim_2019.11.13   httr_1.4.6           ellipsis_0.3.2      
 [52] wk_0.6.0             farver_2.1.0         pkgconfig_2.0.3     
 [55] deldir_2.0-4         nnet_7.3-16          utf8_1.2.2          
 [58] labeling_0.4.2       tidyselect_1.2.0     rlang_1.1.1         
 [61] DiceDesign_1.9       munsell_0.5.0        cellranger_1.1.0    
 [64] tools_4.1.1          xgboost_1.4.1.1      cli_3.6.1           
 [67] generics_0.1.3       evaluate_0.14        fastmap_1.2.0       
 [70] yaml_2.2.1           zip_2.2.0            s2_1.0.7            
 [73] xml2_1.3.4           compiler_4.1.1       rstudioapi_0.14     
 [76] curl_5.2.3           png_0.1-7            e1071_1.7-9         
 [79] lhs_1.1.3            stringi_1.7.5        lattice_0.20-44     
 [82] Matrix_1.3-4         classInt_0.4-3       vctrs_0.6.5         
 [85] pillar_1.9.0         lifecycle_1.0.3      furrr_0.2.3         
 [88] data.table_1.14.2    R6_2.5.1             renv_1.0.7          
 [91] KernSmooth_2.23-20   parallelly_1.28.1    codetools_0.2-18    
 [94] boot_1.3-28          MASS_7.3-54          withr_2.5.0         
 [97] anomalize_0.3.0      parallel_4.1.1       hms_1.1.3           
[100] grid_4.1.1           rpart_4.1-15         timeDate_3043.102   
[103] class_7.3-19         rmarkdown_2.25       base64enc_0.1-3     
Posted on:
October 25, 2024
Length:
8 minute read, 1697 words
Categories:
rstats tidymodels tidytuesday models
Tags:
rstats tidymodels tidytuesday models
See Also:
TyT2024W21 - VIZ:Carbon Majors Emissions Data
TyT2024W21 - EDA:Carbon Majors Emissions Data
Predicting MO with H2O Models from IRDA data