B.6 Bases de datos con dplyr
Para poder usar tablas en bases de datos relacionales con dplyr
hay que emplear el paquete dbplyr (convierte automáticamente el código de dplyr en consultas SQL).
Algunos enlaces:
B.6.1 Ejemplos
Como ejemplo emplearemos la base de datos de SQLite Sample Database Tutorial, almacenada en el archivo chinook.db.
# install.packages('dbplyr')
library(dplyr)
library(dbplyr)
En primer lugar hay que conectar la base de datos:
<- DBI::dbConnect(RSQLite::SQLite(), "datos/chinook.db") chinook
Podemos listar las tablas:
src_dbi(chinook)
## src: sqlite 3.36.0 [D:\OneDrive - Universidade da Coruña\__Actual\__IA\_intror\datos\chinook.db]
## tbls: albums, artists, customers, employees, genres, invoice_items, invoices,
## media_types, playlist_track, playlists, sqlite_sequence, sqlite_stat1, tracks
Para enlazar una tabla:
<- tbl(chinook, "invoices")
invoices invoices
## # Source: table<invoices> [?? x 9]
## # Database: sqlite 3.36.0 [D:\OneDrive - Universidade da
## # Coruña\__Actual\__IA\_intror\datos\chinook.db]
## InvoiceId CustomerId InvoiceD~1 Billi~2 Billi~3 Billi~4 Billi~5 Billi~6 Total
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 1 2 2009-01-0~ Theodo~ Stuttg~ <NA> Germany 70174 1.98
## 2 2 4 2009-01-0~ Ullevå~ Oslo <NA> Norway 0171 3.96
## 3 3 8 2009-01-0~ Grétry~ Brusse~ <NA> Belgium 1000 5.94
## 4 4 14 2009-01-0~ 8210 1~ Edmont~ AB Canada T6G 2C7 8.91
## 5 5 23 2009-01-1~ 69 Sal~ Boston MA USA 2113 13.9
## 6 6 37 2009-01-1~ Berger~ Frankf~ <NA> Germany 60316 0.99
## 7 7 38 2009-02-0~ Barbar~ Berlin <NA> Germany 10779 1.98
## 8 8 40 2009-02-0~ 8, Rue~ Paris <NA> France 75002 1.98
## 9 9 42 2009-02-0~ 9, Pla~ Bordea~ <NA> France 33000 3.96
## 10 10 46 2009-02-0~ 3 Chat~ Dublin Dublin Ireland <NA> 5.94
## # ... with more rows, and abbreviated variable names 1: InvoiceDate,
## # 2: BillingAddress, 3: BillingCity, 4: BillingState, 5: BillingCountry,
## # 6: BillingPostalCode
Ojo [?? x 9]
: de momento no conoce el número de filas.
nrow(invoices)
## [1] NA
Podemos mostrar la consulta SQL correspondiente a una operación:
show_query(head(invoices))
## <SQL>
## SELECT *
## FROM `invoices`
## LIMIT 6
str(head(invoices))
## List of 2
## $ src:List of 2
## ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 8 slots
## .. .. ..@ ptr :<externalptr>
## .. .. ..@ dbname : chr "D:\\OneDrive - Universidade da Coruña\\__Actual\\__IA\\_intror\\datos\\chinook.db"
## .. .. ..@ loadable.extensions: logi TRUE
## .. .. ..@ flags : int 70
## .. .. ..@ vfs : chr ""
## .. .. ..@ ref :<environment: 0x00000000137eb818>
## .. .. ..@ bigint : chr "integer64"
## .. .. ..@ extended_types : logi FALSE
## ..$ disco: NULL
## ..- attr(*, "class")= chr [1:4] "src_SQLiteConnection" "src_dbi" "src_sql" "src"
## $ ops:List of 4
## ..$ name: chr "head"
## ..$ x :List of 2
## .. ..$ x : 'ident' chr "invoices"
## .. ..$ vars: chr [1:9] "InvoiceId" "CustomerId" "InvoiceDate" "BillingAddress" ...
## .. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
## ..$ dots: list()
## ..$ args:List of 1
## .. ..$ n: num 6
## ..- attr(*, "class")= chr [1:3] "op_head" "op_single" "op"
## - attr(*, "class")= chr [1:5] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
Al trabajar con bases de datos, dplyr intenta ser lo más vago posible:
No exporta datos a R a menos que se pida explícitamente (
colect()
).Retrasa cualquier operación lo máximo posible: agrupa todo lo que se desea hacer y luego hace una única petición a la base de datos.
%>% head %>% collect invoices
## # A tibble: 6 x 9
## InvoiceId CustomerId InvoiceDate Billi~1 Billi~2 Billi~3 Billi~4 Billi~5 Total
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 1 2 2009-01-01~ Theodo~ Stuttg~ <NA> Germany 70174 1.98
## 2 2 4 2009-01-02~ Ullevå~ Oslo <NA> Norway 0171 3.96
## 3 3 8 2009-01-03~ Grétry~ Brusse~ <NA> Belgium 1000 5.94
## 4 4 14 2009-01-06~ 8210 1~ Edmont~ AB Canada T6G 2C7 8.91
## 5 5 23 2009-01-11~ 69 Sal~ Boston MA USA 2113 13.9
## 6 6 37 2009-01-19~ Berger~ Frankf~ <NA> Germany 60316 0.99
## # ... with abbreviated variable names 1: BillingAddress, 2: BillingCity,
## # 3: BillingState, 4: BillingCountry, 5: BillingPostalCode
%>% count # número de filas invoices
## # Source: lazy query [?? x 1]
## # Database: sqlite 3.36.0 [D:\OneDrive - Universidade da
## # Coruña\__Actual\__IA\_intror\datos\chinook.db]
## n
## <int>
## 1 412
Por ejemplo, para obtener el importe mínimo, máximo y la media de las facturas:
<- invoices %>% summarise(min = min(Total, na.rm = TRUE),
res max = max(Total, na.rm = TRUE), med = mean(Total, na.rm = TRUE))
show_query(res)
## <SQL>
## SELECT MIN(`Total`) AS `min`, MAX(`Total`) AS `max`, AVG(`Total`) AS `med`
## FROM `invoices`
%>% collect res
## # A tibble: 1 x 3
## min max med
## <dbl> <dbl> <dbl>
## 1 0.99 25.9 5.65
Para obtener el total de las facturas de cada uno de los países:
<- invoices %>% group_by(BillingCountry) %>%
res summarise(n = n(), total = sum(Total, na.rm = TRUE))
show_query(res)
## <SQL>
## SELECT `BillingCountry`, COUNT(*) AS `n`, SUM(`Total`) AS `total`
## FROM `invoices`
## GROUP BY `BillingCountry`
%>% collect res
## # A tibble: 24 x 3
## BillingCountry n total
## <chr> <int> <dbl>
## 1 Argentina 7 37.6
## 2 Australia 7 37.6
## 3 Austria 7 42.6
## 4 Belgium 7 37.6
## 5 Brazil 35 190.
## 6 Canada 56 304.
## 7 Chile 7 46.6
## 8 Czech Republic 14 90.2
## 9 Denmark 7 37.6
## 10 Finland 7 41.6
## # ... with 14 more rows
Para obtener el listado de países junto con su facturación media, ordenado alfabéticamente por país:
<- invoices %>% group_by(BillingCountry) %>%
res summarise(n = n(), med = mean(Total, na.rm = TRUE)) %>%
arrange(BillingCountry)
show_query(res)
## <SQL>
## SELECT `BillingCountry`, COUNT(*) AS `n`, AVG(`Total`) AS `med`
## FROM `invoices`
## GROUP BY `BillingCountry`
## ORDER BY `BillingCountry`
%>% collect res
## # A tibble: 24 x 3
## BillingCountry n med
## <chr> <int> <dbl>
## 1 Argentina 7 5.37
## 2 Australia 7 5.37
## 3 Austria 7 6.09
## 4 Belgium 7 5.37
## 5 Brazil 35 5.43
## 6 Canada 56 5.43
## 7 Chile 7 6.66
## 8 Czech Republic 14 6.45
## 9 Denmark 7 5.37
## 10 Finland 7 5.95
## # ... with 14 more rows
Si el resultado lo queremos en orden decreciente por importe de facturación media:
%>% group_by(BillingCountry) %>%
invoices summarise(n = n(), med = mean(Total, na.rm = TRUE)) %>%
arrange(desc(med)) %>% collect
## # A tibble: 24 x 3
## BillingCountry n med
## <chr> <int> <dbl>
## 1 Chile 7 6.66
## 2 Ireland 7 6.52
## 3 Hungary 7 6.52
## 4 Czech Republic 14 6.45
## 5 Austria 7 6.09
## 6 Finland 7 5.95
## 7 Netherlands 7 5.80
## 8 India 13 5.79
## 9 USA 91 5.75
## 10 Norway 7 5.66
## # ... with 14 more rows
Para obtener un listado con Nombre y Apellidos de cliente y el importe de cada una de sus facturas (Hint: WHERE customer.CustomerID=invoices.CustomerID):
<- tbl(chinook, "customers")
customers tbl_vars(customers)
## <dplyr:::vars>
## [1] "CustomerId" "FirstName" "LastName" "Company" "Address"
## [6] "City" "State" "Country" "PostalCode" "Phone"
## [11] "Fax" "Email" "SupportRepId"
<- customers %>% inner_join(invoices, by = "CustomerId") %>% select(FirstName, LastName, Country, Total)
res show_query(res)
## <SQL>
## SELECT `FirstName`, `LastName`, `Country`, `Total`
## FROM (SELECT `LHS`.`CustomerId` AS `CustomerId`, `FirstName`, `LastName`, `Company`, `Address`, `City`, `State`, `Country`, `PostalCode`, `Phone`, `Fax`, `Email`, `SupportRepId`, `InvoiceId`, `InvoiceDate`, `BillingAddress`, `BillingCity`, `BillingState`, `BillingCountry`, `BillingPostalCode`, `Total`
## FROM `customers` AS `LHS`
## INNER JOIN `invoices` AS `RHS`
## ON (`LHS`.`CustomerId` = `RHS`.`CustomerId`)
## )
%>% collect res
## # A tibble: 412 x 4
## FirstName LastName Country Total
## <chr> <chr> <chr> <dbl>
## 1 Luís Gonçalves Brazil 3.98
## 2 Luís Gonçalves Brazil 3.96
## 3 Luís Gonçalves Brazil 5.94
## 4 Luís Gonçalves Brazil 0.99
## 5 Luís Gonçalves Brazil 1.98
## 6 Luís Gonçalves Brazil 13.9
## 7 Luís Gonçalves Brazil 8.91
## 8 Leonie Köhler Germany 1.98
## 9 Leonie Köhler Germany 13.9
## 10 Leonie Köhler Germany 8.91
## # ... with 402 more rows
Para obtener el porcentaje de canciones que son vídeos:
<- tbl(chinook, "tracks")
tracks head(tracks)
## # Source: lazy query [?? x 9]
## # Database: sqlite 3.36.0 [D:\OneDrive - Universidade da
## # Coruña\__Actual\__IA\_intror\datos\chinook.db]
## TrackId Name AlbumId Media~1 GenreId Compo~2 Milli~3 Bytes UnitP~4
## <int> <chr> <int> <int> <int> <chr> <int> <int> <dbl>
## 1 1 For Those Abou~ 1 1 1 Angus ~ 343719 1.12e7 0.99
## 2 2 Balls to the W~ 2 2 1 <NA> 342562 5.51e6 0.99
## 3 3 Fast As a Shark 3 2 1 F. Bal~ 230619 3.99e6 0.99
## 4 4 Restless and W~ 3 2 1 F. Bal~ 252051 4.33e6 0.99
## 5 5 Princess of th~ 3 2 1 Deaffy~ 375418 6.29e6 0.99
## 6 6 Put The Finger~ 1 1 1 Angus ~ 205662 6.71e6 0.99
## # ... with abbreviated variable names 1: MediaTypeId, 2: Composer,
## # 3: Milliseconds, 4: UnitPrice
%>% group_by(MediaTypeId) %>%
tracks summarise(n = n()) %>% collect %>% mutate(freq = n / sum(n))
## # A tibble: 5 x 3
## MediaTypeId n freq
## <int> <int> <dbl>
## 1 1 3034 0.866
## 2 2 237 0.0677
## 3 3 214 0.0611
## 4 4 7 0.00200
## 5 5 11 0.00314
<- tbl(chinook, "media_types")
media_types head(media_types)
## # Source: lazy query [?? x 2]
## # Database: sqlite 3.36.0 [D:\OneDrive - Universidade da
## # Coruña\__Actual\__IA\_intror\datos\chinook.db]
## MediaTypeId Name
## <int> <chr>
## 1 1 MPEG audio file
## 2 2 Protected AAC audio file
## 3 3 Protected MPEG-4 video file
## 4 4 Purchased AAC audio file
## 5 5 AAC audio file
%>% inner_join(media_types, by = "MediaTypeId") %>% count(Name.y) %>%
tracks %>% mutate(freq = n / sum(n)) %>% filter(grepl('video', Name.y)) collect
## # A tibble: 1 x 3
## Name.y n freq
## <chr> <int> <dbl>
## 1 Protected MPEG-4 video file 214 0.0611
Para listar los 10 mejores clientes (aquellos a los que se les ha facturado más cantidad) indicando Nombre, Apellidos, Pais y el importe total de su facturación:
%>% inner_join(invoices, by = "CustomerId") %>% group_by(CustomerId) %>%
customers summarise(FirstName, LastName, country, total = sum(Total, na.rm = TRUE)) %>%
arrange(desc(total)) %>% head(10) %>% collect
## # A tibble: 10 x 5
## CustomerId FirstName LastName Country total
## <int> <chr> <chr> <chr> <dbl>
## 1 6 Helena Holý Czech Republic 49.6
## 2 26 Richard Cunningham USA 47.6
## 3 57 Luis Rojas Chile 46.6
## 4 45 Ladislav Kovács Hungary 45.6
## 5 46 Hugh O'Reilly Ireland 45.6
## 6 28 Julia Barnett USA 43.6
## 7 24 Frank Ralston USA 43.6
## 8 37 Fynn Zimmermann Germany 43.6
## 9 7 Astrid Gruber Austria 42.6
## 10 25 Victor Stevens USA 42.6
Para listar los géneros musicales por orden decreciente de popularidad (definida la popularidad como el número de canciones de ese género), indicando el porcentaje de las canciones de ese género:
%>% inner_join(tbl(chinook, "genres"), by = "GenreId") %>% count(Name.y) %>%
tracks arrange(desc(n)) %>% collect %>% mutate(freq = n / sum(n))
## # A tibble: 25 x 3
## Name.y n freq
## <chr> <int> <dbl>
## 1 Rock 1297 0.370
## 2 Latin 579 0.165
## 3 Metal 374 0.107
## 4 Alternative & Punk 332 0.0948
## 5 Jazz 130 0.0371
## 6 TV Shows 93 0.0265
## 7 Blues 81 0.0231
## 8 Classical 74 0.0211
## 9 Drama 64 0.0183
## 10 R&B/Soul 61 0.0174
## # ... with 15 more rows
Para listar los 10 artistas con mayor número de canciones de forma descendente según el número de canciones:
%>% inner_join(tbl(chinook, "albums"), by = "AlbumId") %>%
tracks inner_join(tbl(chinook, "artists"), by = "ArtistId") %>%
count(Name.y) %>% arrange(desc(n)) %>% collect
## # A tibble: 204 x 2
## Name.y n
## <chr> <int>
## 1 Iron Maiden 213
## 2 U2 135
## 3 Led Zeppelin 114
## 4 Metallica 112
## 5 Lost 92
## 6 Deep Purple 92
## 7 Pearl Jam 67
## 8 Lenny Kravitz 57
## 9 Various Artists 56
## 10 The Office 53
## # ... with 194 more rows
Al finalizar hay que desconectar la base de datos:
::dbDisconnect(chinook) DBI