Regex para extrair strings entre dois ou mais padrões usando stringr

victordogo · Dezembro 6, 2021, 8:22pm

Boa tarde pessoal, tudo bem com vocês?

Nestas últimas semanas, estive mexendo com um banco de dados sobre a banda Metallica que contém uma coluna Sets com as setlists numa string apenas. Elas são dispostas da seguinte maneira, por exemplo:

"['So What', 'Creeping Death', 'Sad But True', \"Ain't My Bitch\", 'Whiplash', 'Bleeding Me', 'King Nothing', 'One', \"Devil's Dance\", 'Bass/Guitar Doodle', 'Nothing Else Matters', 'Until It Sleeps', 'For Whom the Bell Tolls', 'Wherever I May Roam', 'Fade to Black', 'Kill/Ride Medley']"

O meu interesse é, dentro dessa coluna Sets, criar um vetor que contenha todas as músicas tocadas em todos os sets, mesmo que estas se repitam, para contar quais são mais tocadas. A parte que está me complicando é o regex: consigo extrair as músicas entre os caracteres ’ , mas algumas delas estão entre caracteres " como vocês podem ver. Minha primeira tentativa ficou algo da forma:

"['So What', 'Creeping Death', 'Sad But True', \"Ain't My Bitch\", 'Whiplash', 'Bleeding Me', 'King Nothing', 'One', \"Devil's Dance\", 'Bass/Guitar Doodle', 'Nothing Else Matters', 'Until It Sleeps', 'For Whom the Bell Tolls', 'Wherever I May Roam', 'Fade to Black', 'Kill/Ride Medley']" |> 
  stringr::str_split(pattern = ",") |>
  unlist() |> 
  stringr::str_extract("(?s)'\\s*(.*?)\\s*'")
#>  [1] "'So What'"                 "'Creeping Death'"         
#>  [3] "'Sad But True'"            NA                         
#>  [5] "'Whiplash'"                "'Bleeding Me'"            
#>  [7] "'King Nothing'"            "'One'"                    
#>  [9] NA                          "'Bass/Guitar Doodle'"     
#> [11] "'Nothing Else Matters'"    "'Until It Sleeps'"        
#> [13] "'For Whom the Bell Tolls'" "'Wherever I May Roam'"    
#> [15] "'Fade to Black'"           "'Kill/Ride Medley'"

^{Created on 2021-12-06 by the reprex package (v2.0.0)}

(tirei esse regex daqui.)

Percebam que as músicas entre caracteres " resultam em NA. Já tentei adaptar esse regex para que inclua os dois casos, mas não consigo notar meu erro (um pouco porque este que voz fala ainda tem certas dificuldades com regex no geral):

"['So What', 'Creeping Death', 'Sad But True', \"Ain't My Bitch\", 'Whiplash', 'Bleeding Me', 'King Nothing', 'One', \"Devil's Dance\", 'Bass/Guitar Doodle', 'Nothing Else Matters', 'Until It Sleeps', 'For Whom the Bell Tolls', 'Wherever I May Roam', 'Fade to Black', 'Kill/Ride Medley']" |> 
  stringr::str_split(pattern = ",") |>
  unlist() |> 
  stringr::str_extract("((?s)'\\s*(.*?)\\s*' | (?s)\\\ \\s*(.*?)\\s* \\\ )")
#>  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

^{Created on 2021-12-06 by the reprex package (v2.0.0)}

Alguém consegue me ajudar a formular melhor esse regex (ou até mesmo a sugerir uma alternativa que seja mais conveniente)? Agradeço demais

clente · Dezembro 7, 2021, 12:29am

Victor, não sei se atende ao seu requisito, mas pensei em uma solução que simplesmente remove o primeiro e o último caractere de cada música.

# Dados exemplo
set <- "['So What', 'Creeping Death', 'Sad But True', \"Ain't My Bitch\", 'Whiplash', 'Bleeding Me', 'King Nothing', 'One', \"Devil's Dance\", 'Bass/Guitar Doodle', 'Nothing Else Matters', 'Until It Sleeps', 'For Whom the Bell Tolls', 'Wherever I May Roam', 'Fade to Black', 'Kill/Ride Medley']"

set |>
  stringr::str_remove_all("[\\[\\]]") |>
  stringr::str_split(", ") |>
  purrr::pluck(1) |>
  stringr::str_sub(2, -2)
#>  [1] "So What"                 "Creeping Death"         
#>  [3] "Sad But True"            "Ain't My Bitch"         
#>  [5] "Whiplash"                "Bleeding Me"            
#>  [7] "King Nothing"            "One"                    
#>  [9] "Devil's Dance"           "Bass/Guitar Doodle"     
#> [11] "Nothing Else Matters"    "Until It Sleeps"        
#> [13] "For Whom the Bell Tolls" "Wherever I May Roam"    
#> [15] "Fade to Black"           "Kill/Ride Medley"

^{Created on 2021-12-06 by the reprex package (v2.0.1)}

victordogo · Dezembro 7, 2021, 1:24pm

Caio, primeiramente muito obrigado pela resposta, com ela acabei percebendo uma falha no meu reprex: enquanto que a solução tenha dado certo para os dados exemplo, ela não funciona para algumas outras músicas do banco de dados (estou tentando encontrar alguma forma de explicar isso de maneira reprodutível, o banco em si é enorme).

No geral, por conta de algumas músicas, ainda apareciam campos NA ou com a música cortada mesmo com a solução que você sugeriu (algumas muito específicas possuiam caracteres adicionais no início ou no fim). O que eu consegui fazer funcionar foi o seguinte:

# Dados exemplo

set <- "['So What', 'Creeping Death', 'Sad But True', \"Ain't My Bitch\", 'Whiplash', 'Bleeding Me', 'King Nothing', 'One', \"Devil's Dance\", 'Bass/Guitar Doodle', 'Nothing Else Matters', 'Until It Sleeps', 'For Whom the Bell Tolls', 'Wherever I May Roam', 'Fade to Black', 'Kill/Ride Medley']"

set |>
  stringr::str_split(pattern = ", ") |>
  unlist() |>
  stringr::str_replace_all("[[:punct:]]", "")
#>  [1] "So What"                 "Creeping Death"         
#>  [3] "Sad But True"            "Aint My Bitch"          
#>  [5] "Whiplash"                "Bleeding Me"            
#>  [7] "King Nothing"            "One"                    
#>  [9] "Devils Dance"            "BassGuitar Doodle"      
#> [11] "Nothing Else Matters"    "Until It Sleeps"        
#> [13] "For Whom the Bell Tolls" "Wherever I May Roam"    
#> [15] "Fade to Black"           "KillRide Medley"

^{Created on 2021-12-07 by the reprex package (v2.0.0)}

Removendo a pontuação das músicas pelo regex no str_replace_all() e separando elas por vírgula, consegui obter um resultado similar e que funcionasse da mesma forma que o seu (acabei por não ligar pela falta de pontuação nesse cálculo específico). Estou verificando se algum problema ainda persiste, se for o caso atualizo esse tópico. De qualquer forma, muito obrigado