Extração de frases usando stringr

Olá! Estou tentando extrair algumas frases (localidades) de umas strings, e coloquei vários exemplos a seguir. A função busca o texto após essa primeira ref e iria até o primeiro ., mas ignorando casos como km., ca. e U.S. (este último para quando for U.S.A.).

No entanto, o ca dentro do padrão na função acaba ignorando o a. em Colombia. no primeiro exemplo. Se eu testar com Colombie, funciona como esperado.
Todos os outros estão funcionando, só o primeiro exemplo que ainda não funciona.

library(dplyr)

places <- tibble(
  text = c(
    "versity Studies No. 19; ref. 1259] Villavicencio, Río Orinoco system, Colombia. Holotype: FMNH 56646 [ex CM 5463]. Paratypes: (6) ",
    "versity Studies No. 19; ref. 1259] Villavicencio, Río Orinoco system, Colombie. Holotype: FMNH 56646 [ex CM 5463]. Paratypes: (6) ",
    "ions No. 457 (art. 17); ref. 2247] Roadside pool, 3 kilometers south of Progreso, Yucatán, southeastern Mexico. Holotype: UMMZ 102144. Paratypes: UMMZ 102145 (3). •Syn",
    "for 2001); ref. 27036] La Sapayera stream, near the bridge km. 2 vía Guadalupe-Florencia, 1°59'27\"N, 75°45'08\"W, afluent of La Viciosa stream, río Suaza system, upper Magdalena, Cachimba trail, Municipality of Guadalupe, Department of Huila, Colombia, elevation 952 meters. Holotype: IUQ 422. Par",
    "); ref. 30495] Rio Iguaçu, ca. 25°43'S, 49°44'W, Serrinha, Paraná, Brazil. Holotype: FMN",
    ". 4 (no. 2) (art. 7); ref. 1261] Corumba, río Otuquis, Ascunción, Paraguay. Holo",
    "); ref. 14367] Not: Puerto Mexico [Coatzacoalcos], Veracruz State, Isthmus of Tehuantepec, Mexico [probably Panama]. Holotype",
    "hia v. 7; ref. 168] Upper tributaries of Nueces River, Texas, U.S.A. Lectotype: BMNH 1883.12.14.107. Paralectotype"
  )
)


places %>% 
  mutate(
    type_locality = stringr::str_extract(text, "(?<=ref.\\s[:digit:]{1,6}\\]\\s).*?[^\\s(km|ca|U\\.S\\.)\\.](?=\\.)")) %>% 
  relocate(type_locality)
#> # A tibble: 8 x 2
#>   type_locality                                                            text 
#>   <chr>                                                                    <chr>
#> 1 "Villavicencio, Río Orinoco system, Colombia. Holotype: FMNH 56646 [ex ~ "ver~
#> 2 "Villavicencio, Río Orinoco system, Colombie"                            "ver~
#> 3 "Roadside pool, 3 kilometers south of Progreso, Yucatán, southeastern M~ "ion~
#> 4 "La Sapayera stream, near the bridge km. 2 vía Guadalupe-Florencia, 1°5~ "for~
#> 5 "Rio Iguaçu, ca. 25°43'S, 49°44'W, Serrinha, Paraná, Brazil"             "); ~
#> 6 "Corumba, río Otuquis, Ascunción, Paraguay"                              ". 4~
#> 7 "Not: Puerto Mexico [Coatzacoalcos], Veracruz State, Isthmus of Tehuant~ "); ~
#> 8 "Upper tributaries of Nueces River, Texas, U.S.A"                        "hia~

Created on 2022-12-19 with reprex v2.0.2

Bruno, o jeito que eu pensei envolve quebrar o regex em duas etapas. Veja se isso atende aos seus requisitos:

places <- tibble::tibble(
  text = c(
    "versity Studies No. 19; ref. 1259] Villavicencio, Río Orinoco system, Colombia. Holotype: FMNH 56646 [ex CM 5463]. Paratypes: (6) ",
    "versity Studies No. 19; ref. 1259] Villavicencio, Río Orinoco system, Colombie. Holotype: FMNH 56646 [ex CM 5463]. Paratypes: (6) ",
    "ions No. 457 (art. 17); ref. 2247] Roadside pool, 3 kilometers south of Progreso, Yucatán, southeastern Mexico. Holotype: UMMZ 102144. Paratypes: UMMZ 102145 (3). •Syn",
    "for 2001); ref. 27036] La Sapayera stream, near the bridge km. 2 vía Guadalupe-Florencia, 1°59'27\"N, 75°45'08\"W, afluent of La Viciosa stream, río Suaza system, upper Magdalena, Cachimba trail, Municipality of Guadalupe, Department of Huila, Colombia, elevation 952 meters. Holotype: IUQ 422. Par",
    "); ref. 30495] Rio Iguaçu, ca. 25°43'S, 49°44'W, Serrinha, Paraná, Brazil. Holotype: FMN",
    ". 4 (no. 2) (art. 7); ref. 1261] Corumba, río Otuquis, Ascunción, Paraguay. Holo",
    "); ref. 14367] Not: Puerto Mexico [Coatzacoalcos], Veracruz State, Isthmus of Tehuantepec, Mexico [probably Panama]. Holotype",
    "hia v. 7; ref. 168] Upper tributaries of Nueces River, Texas, U.S.A. Lectotype: BMNH 1883.12.14.107. Paralectotype"
  )
)

places |>
  dplyr::mutate(
    type_locality = stringr::str_extract(text, "(?<=ref.\\s[:digit:]{1,6}\\]\\s).+"),
    type_locality = stringr::str_remove(type_locality, "(?<!\\s|km|ca)\\.\\s.+")
  ) |>
  dplyr::relocate(type_locality)
#> # A tibble: 8 × 2
#>   type_locality                                                            text 
#>   <chr>                                                                    <chr>
#> 1 "Villavicencio, Río Orinoco system, Colombia"                            "ver…
#> 2 "Villavicencio, Río Orinoco system, Colombie"                            "ver…
#> 3 "Roadside pool, 3 kilometers south of Progreso, Yucatán, southeastern M… "ion…
#> 4 "La Sapayera stream, near the bridge km. 2 vía Guadalupe-Florencia, 1°5… "for…
#> 5 "Rio Iguaçu, ca. 25°43'S, 49°44'W, Serrinha, Paraná, Brazil"             "); …
#> 6 "Corumba, río Otuquis, Ascunción, Paraguay"                              ". 4…
#> 7 "Not: Puerto Mexico [Coatzacoalcos], Veracruz State, Isthmus of Tehuant… "); …
#> 8 "Upper tributaries of Nueces River, Texas, U.S.A"                        "hia…

Created on 2022-12-19 with reprex v2.0.2

1 curtida

Que legal, @clente !
Muito legal essa ideia, parece mais simples até! Acabei usando essa abordagem aqui:

places |>
  dplyr::mutate(
    type_locality = stringr::str_extract(text, "(?<=ref\\.\\s[0-9]{1,6}]\\s)(?:\\b(?:km|ca|U\\.S)\\.|.)*?(?=\\.\\s)")
)