B.6 Bases de datos con dplyr
Para poder usar tablas en bases de datos relacionales con dplyr hay que emplear el paquete dbplyr (convierte automáticamente el código de dplyr en consultas SQL).
Algunos enlaces:
B.6.1 Ejemplos
Como ejemplo emplearemos la base de datos de SQLite Sample Database Tutorial, almacenada en el archivo chinook.db.
# install.packages('dbplyr')
library(dplyr)
library(dbplyr)En primer lugar hay que conectar la base de datos:
chinook <- DBI::dbConnect(RSQLite::SQLite(), "datos/chinook.db")Podemos listar las tablas:
src_dbi(chinook)## src: sqlite 3.36.0 [D:\OneDrive - Universidade da Coruña\__Actual\__IA\_intror\datos\chinook.db]
## tbls: albums, artists, customers, employees, genres, invoice_items, invoices,
## media_types, playlist_track, playlists, sqlite_sequence, sqlite_stat1, tracks
Para enlazar una tabla:
invoices <- tbl(chinook, "invoices")
invoices## # Source: table<invoices> [?? x 9]
## # Database: sqlite 3.36.0 [D:\OneDrive - Universidade da
## # Coruña\__Actual\__IA\_intror\datos\chinook.db]
## InvoiceId CustomerId InvoiceD~1 Billi~2 Billi~3 Billi~4 Billi~5 Billi~6 Total
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 1 2 2009-01-0~ Theodo~ Stuttg~ <NA> Germany 70174 1.98
## 2 2 4 2009-01-0~ Ullevå~ Oslo <NA> Norway 0171 3.96
## 3 3 8 2009-01-0~ Grétry~ Brusse~ <NA> Belgium 1000 5.94
## 4 4 14 2009-01-0~ 8210 1~ Edmont~ AB Canada T6G 2C7 8.91
## 5 5 23 2009-01-1~ 69 Sal~ Boston MA USA 2113 13.9
## 6 6 37 2009-01-1~ Berger~ Frankf~ <NA> Germany 60316 0.99
## 7 7 38 2009-02-0~ Barbar~ Berlin <NA> Germany 10779 1.98
## 8 8 40 2009-02-0~ 8, Rue~ Paris <NA> France 75002 1.98
## 9 9 42 2009-02-0~ 9, Pla~ Bordea~ <NA> France 33000 3.96
## 10 10 46 2009-02-0~ 3 Chat~ Dublin Dublin Ireland <NA> 5.94
## # ... with more rows, and abbreviated variable names 1: InvoiceDate,
## # 2: BillingAddress, 3: BillingCity, 4: BillingState, 5: BillingCountry,
## # 6: BillingPostalCode
Ojo [?? x 9]: de momento no conoce el número de filas.
nrow(invoices)## [1] NA
Podemos mostrar la consulta SQL correspondiente a una operación:
show_query(head(invoices))## <SQL>
## SELECT *
## FROM `invoices`
## LIMIT 6
str(head(invoices))## List of 2
## $ src:List of 2
## ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 8 slots
## .. .. ..@ ptr :<externalptr>
## .. .. ..@ dbname : chr "D:\\OneDrive - Universidade da Coruña\\__Actual\\__IA\\_intror\\datos\\chinook.db"
## .. .. ..@ loadable.extensions: logi TRUE
## .. .. ..@ flags : int 70
## .. .. ..@ vfs : chr ""
## .. .. ..@ ref :<environment: 0x00000000137eb818>
## .. .. ..@ bigint : chr "integer64"
## .. .. ..@ extended_types : logi FALSE
## ..$ disco: NULL
## ..- attr(*, "class")= chr [1:4] "src_SQLiteConnection" "src_dbi" "src_sql" "src"
## $ ops:List of 4
## ..$ name: chr "head"
## ..$ x :List of 2
## .. ..$ x : 'ident' chr "invoices"
## .. ..$ vars: chr [1:9] "InvoiceId" "CustomerId" "InvoiceDate" "BillingAddress" ...
## .. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
## ..$ dots: list()
## ..$ args:List of 1
## .. ..$ n: num 6
## ..- attr(*, "class")= chr [1:3] "op_head" "op_single" "op"
## - attr(*, "class")= chr [1:5] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
Al trabajar con bases de datos, dplyr intenta ser lo más vago posible:
No exporta datos a R a menos que se pida explícitamente (
colect()).Retrasa cualquier operación lo máximo posible: agrupa todo lo que se desea hacer y luego hace una única petición a la base de datos.
invoices %>% head %>% collect## # A tibble: 6 x 9
## InvoiceId CustomerId InvoiceDate Billi~1 Billi~2 Billi~3 Billi~4 Billi~5 Total
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 1 2 2009-01-01~ Theodo~ Stuttg~ <NA> Germany 70174 1.98
## 2 2 4 2009-01-02~ Ullevå~ Oslo <NA> Norway 0171 3.96
## 3 3 8 2009-01-03~ Grétry~ Brusse~ <NA> Belgium 1000 5.94
## 4 4 14 2009-01-06~ 8210 1~ Edmont~ AB Canada T6G 2C7 8.91
## 5 5 23 2009-01-11~ 69 Sal~ Boston MA USA 2113 13.9
## 6 6 37 2009-01-19~ Berger~ Frankf~ <NA> Germany 60316 0.99
## # ... with abbreviated variable names 1: BillingAddress, 2: BillingCity,
## # 3: BillingState, 4: BillingCountry, 5: BillingPostalCode
invoices %>% count # número de filas## # Source: lazy query [?? x 1]
## # Database: sqlite 3.36.0 [D:\OneDrive - Universidade da
## # Coruña\__Actual\__IA\_intror\datos\chinook.db]
## n
## <int>
## 1 412
Por ejemplo, para obtener el importe mínimo, máximo y la media de las facturas:
res <- invoices %>% summarise(min = min(Total, na.rm = TRUE),
max = max(Total, na.rm = TRUE), med = mean(Total, na.rm = TRUE))
show_query(res)## <SQL>
## SELECT MIN(`Total`) AS `min`, MAX(`Total`) AS `max`, AVG(`Total`) AS `med`
## FROM `invoices`
res %>% collect## # A tibble: 1 x 3
## min max med
## <dbl> <dbl> <dbl>
## 1 0.99 25.9 5.65
Para obtener el total de las facturas de cada uno de los países:
res <- invoices %>% group_by(BillingCountry) %>%
summarise(n = n(), total = sum(Total, na.rm = TRUE))
show_query(res)## <SQL>
## SELECT `BillingCountry`, COUNT(*) AS `n`, SUM(`Total`) AS `total`
## FROM `invoices`
## GROUP BY `BillingCountry`
res %>% collect## # A tibble: 24 x 3
## BillingCountry n total
## <chr> <int> <dbl>
## 1 Argentina 7 37.6
## 2 Australia 7 37.6
## 3 Austria 7 42.6
## 4 Belgium 7 37.6
## 5 Brazil 35 190.
## 6 Canada 56 304.
## 7 Chile 7 46.6
## 8 Czech Republic 14 90.2
## 9 Denmark 7 37.6
## 10 Finland 7 41.6
## # ... with 14 more rows
Para obtener el listado de países junto con su facturación media, ordenado alfabéticamente por país:
res <- invoices %>% group_by(BillingCountry) %>%
summarise(n = n(), med = mean(Total, na.rm = TRUE)) %>%
arrange(BillingCountry)
show_query(res)## <SQL>
## SELECT `BillingCountry`, COUNT(*) AS `n`, AVG(`Total`) AS `med`
## FROM `invoices`
## GROUP BY `BillingCountry`
## ORDER BY `BillingCountry`
res %>% collect## # A tibble: 24 x 3
## BillingCountry n med
## <chr> <int> <dbl>
## 1 Argentina 7 5.37
## 2 Australia 7 5.37
## 3 Austria 7 6.09
## 4 Belgium 7 5.37
## 5 Brazil 35 5.43
## 6 Canada 56 5.43
## 7 Chile 7 6.66
## 8 Czech Republic 14 6.45
## 9 Denmark 7 5.37
## 10 Finland 7 5.95
## # ... with 14 more rows
Si el resultado lo queremos en orden decreciente por importe de facturación media:
invoices %>% group_by(BillingCountry) %>%
summarise(n = n(), med = mean(Total, na.rm = TRUE)) %>%
arrange(desc(med)) %>% collect## # A tibble: 24 x 3
## BillingCountry n med
## <chr> <int> <dbl>
## 1 Chile 7 6.66
## 2 Ireland 7 6.52
## 3 Hungary 7 6.52
## 4 Czech Republic 14 6.45
## 5 Austria 7 6.09
## 6 Finland 7 5.95
## 7 Netherlands 7 5.80
## 8 India 13 5.79
## 9 USA 91 5.75
## 10 Norway 7 5.66
## # ... with 14 more rows
Para obtener un listado con Nombre y Apellidos de cliente y el importe de cada una de sus facturas (Hint: WHERE customer.CustomerID=invoices.CustomerID):
customers <- tbl(chinook, "customers")
tbl_vars(customers) ## <dplyr:::vars>
## [1] "CustomerId" "FirstName" "LastName" "Company" "Address"
## [6] "City" "State" "Country" "PostalCode" "Phone"
## [11] "Fax" "Email" "SupportRepId"
res <- customers %>% inner_join(invoices, by = "CustomerId") %>% select(FirstName, LastName, Country, Total)
show_query(res)## <SQL>
## SELECT `FirstName`, `LastName`, `Country`, `Total`
## FROM (SELECT `LHS`.`CustomerId` AS `CustomerId`, `FirstName`, `LastName`, `Company`, `Address`, `City`, `State`, `Country`, `PostalCode`, `Phone`, `Fax`, `Email`, `SupportRepId`, `InvoiceId`, `InvoiceDate`, `BillingAddress`, `BillingCity`, `BillingState`, `BillingCountry`, `BillingPostalCode`, `Total`
## FROM `customers` AS `LHS`
## INNER JOIN `invoices` AS `RHS`
## ON (`LHS`.`CustomerId` = `RHS`.`CustomerId`)
## )
res %>% collect## # A tibble: 412 x 4
## FirstName LastName Country Total
## <chr> <chr> <chr> <dbl>
## 1 Luís Gonçalves Brazil 3.98
## 2 Luís Gonçalves Brazil 3.96
## 3 Luís Gonçalves Brazil 5.94
## 4 Luís Gonçalves Brazil 0.99
## 5 Luís Gonçalves Brazil 1.98
## 6 Luís Gonçalves Brazil 13.9
## 7 Luís Gonçalves Brazil 8.91
## 8 Leonie Köhler Germany 1.98
## 9 Leonie Köhler Germany 13.9
## 10 Leonie Köhler Germany 8.91
## # ... with 402 more rows
Para obtener el porcentaje de canciones que son vídeos:
tracks <- tbl(chinook, "tracks")
head(tracks) ## # Source: lazy query [?? x 9]
## # Database: sqlite 3.36.0 [D:\OneDrive - Universidade da
## # Coruña\__Actual\__IA\_intror\datos\chinook.db]
## TrackId Name AlbumId Media~1 GenreId Compo~2 Milli~3 Bytes UnitP~4
## <int> <chr> <int> <int> <int> <chr> <int> <int> <dbl>
## 1 1 For Those Abou~ 1 1 1 Angus ~ 343719 1.12e7 0.99
## 2 2 Balls to the W~ 2 2 1 <NA> 342562 5.51e6 0.99
## 3 3 Fast As a Shark 3 2 1 F. Bal~ 230619 3.99e6 0.99
## 4 4 Restless and W~ 3 2 1 F. Bal~ 252051 4.33e6 0.99
## 5 5 Princess of th~ 3 2 1 Deaffy~ 375418 6.29e6 0.99
## 6 6 Put The Finger~ 1 1 1 Angus ~ 205662 6.71e6 0.99
## # ... with abbreviated variable names 1: MediaTypeId, 2: Composer,
## # 3: Milliseconds, 4: UnitPrice
tracks %>% group_by(MediaTypeId) %>%
summarise(n = n()) %>% collect %>% mutate(freq = n / sum(n))## # A tibble: 5 x 3
## MediaTypeId n freq
## <int> <int> <dbl>
## 1 1 3034 0.866
## 2 2 237 0.0677
## 3 3 214 0.0611
## 4 4 7 0.00200
## 5 5 11 0.00314
media_types <- tbl(chinook, "media_types")
head(media_types)## # Source: lazy query [?? x 2]
## # Database: sqlite 3.36.0 [D:\OneDrive - Universidade da
## # Coruña\__Actual\__IA\_intror\datos\chinook.db]
## MediaTypeId Name
## <int> <chr>
## 1 1 MPEG audio file
## 2 2 Protected AAC audio file
## 3 3 Protected MPEG-4 video file
## 4 4 Purchased AAC audio file
## 5 5 AAC audio file
tracks %>% inner_join(media_types, by = "MediaTypeId") %>% count(Name.y) %>%
collect %>% mutate(freq = n / sum(n)) %>% filter(grepl('video', Name.y))## # A tibble: 1 x 3
## Name.y n freq
## <chr> <int> <dbl>
## 1 Protected MPEG-4 video file 214 0.0611
Para listar los 10 mejores clientes (aquellos a los que se les ha facturado más cantidad) indicando Nombre, Apellidos, Pais y el importe total de su facturación:
customers %>% inner_join(invoices, by = "CustomerId") %>% group_by(CustomerId) %>%
summarise(FirstName, LastName, country, total = sum(Total, na.rm = TRUE)) %>%
arrange(desc(total)) %>% head(10) %>% collect## # A tibble: 10 x 5
## CustomerId FirstName LastName Country total
## <int> <chr> <chr> <chr> <dbl>
## 1 6 Helena Holý Czech Republic 49.6
## 2 26 Richard Cunningham USA 47.6
## 3 57 Luis Rojas Chile 46.6
## 4 45 Ladislav Kovács Hungary 45.6
## 5 46 Hugh O'Reilly Ireland 45.6
## 6 28 Julia Barnett USA 43.6
## 7 24 Frank Ralston USA 43.6
## 8 37 Fynn Zimmermann Germany 43.6
## 9 7 Astrid Gruber Austria 42.6
## 10 25 Victor Stevens USA 42.6
Para listar los géneros musicales por orden decreciente de popularidad (definida la popularidad como el número de canciones de ese género), indicando el porcentaje de las canciones de ese género:
tracks %>% inner_join(tbl(chinook, "genres"), by = "GenreId") %>% count(Name.y) %>%
arrange(desc(n)) %>% collect %>% mutate(freq = n / sum(n))## # A tibble: 25 x 3
## Name.y n freq
## <chr> <int> <dbl>
## 1 Rock 1297 0.370
## 2 Latin 579 0.165
## 3 Metal 374 0.107
## 4 Alternative & Punk 332 0.0948
## 5 Jazz 130 0.0371
## 6 TV Shows 93 0.0265
## 7 Blues 81 0.0231
## 8 Classical 74 0.0211
## 9 Drama 64 0.0183
## 10 R&B/Soul 61 0.0174
## # ... with 15 more rows
Para listar los 10 artistas con mayor número de canciones de forma descendente según el número de canciones:
tracks %>% inner_join(tbl(chinook, "albums"), by = "AlbumId") %>%
inner_join(tbl(chinook, "artists"), by = "ArtistId") %>%
count(Name.y) %>% arrange(desc(n)) %>% collect## # A tibble: 204 x 2
## Name.y n
## <chr> <int>
## 1 Iron Maiden 213
## 2 U2 135
## 3 Led Zeppelin 114
## 4 Metallica 112
## 5 Lost 92
## 6 Deep Purple 92
## 7 Pearl Jam 67
## 8 Lenny Kravitz 57
## 9 Various Artists 56
## 10 The Office 53
## # ... with 194 more rows
Al finalizar hay que desconectar la base de datos:
DBI::dbDisconnect(chinook)