Scraping de lista de links

raphaelsaldanha · Junho 9, 2022, 1:59pm

Oi pessoal,

Estou com dificuldade de fazer scrapping com essa página: https://idapcap.mdr.gov.br

Ela retorna uma lista de arquivos XMLs que eu quero baixar (e depois ler um a um) com o R. Como posso fazer uma lista desses XMLs para baixar? Tentei com o rvest mas não consegui

Abraços!

beatrizmilz · Junho 9, 2022, 2:45pm

Oi Rafael!

Imagino 2 etapas: buscar a lista de URL dos xmls pra baixar; depois criar uma função que faça o que vc precisa fazer com eles e iterar no resultado da primeira etapa.
Esse código tá meio incompleto: ele faz a etapa 1 ok, a etapa 2 eu deixei simples mas dá pra entender a ideia e adaptar. Veja se faz sentido!

# buscando a lista completa ------
url_mdr <- "https://idapcap.mdr.gov.br"

g_mdr <- httr::GET(url_mdr)

result <- httr::content(g_mdr) |>
  # buscar os elementos
  rvest::html_nodes(xpath = "/html/body/pre/a") |> 
  # pegar o conteudo
  rvest::html_attr("href") 
  
# [997] "6863121022021-SP.xml"         
# [998] "6863221022021-SP.xml"         
# [999] "6863321022021-SP.xml"    

complete_url <- paste0("https://idapcap.mdr.gov.br/", result)
# [723] "https://idapcap.mdr.gov.br/6834617022021-SP.xml"         
# [724] "https://idapcap.mdr.gov.br/6834717022021-SP.xml"         
# [725] "https://idapcap.mdr.gov.br/6834817022021-SP.xml"  


# buscar cada xml
buscar_infos_xml <- function(url_idapcap){
  # colocar aqui o q vc quer fazer pra cada.
  # eu coloquei um simples get + content 
  # se quiser baixar o conteúdo, tem que 
  # usar a funcao httr::write_disk() dentro do GET()
  url_idapcap |>  
    httr::GET() |> 
    httr::content() # o que vc quer buscar neles?
}

infos_xml <- complete_url[2:10] |> 
  purrr::map(buscar_infos_xml)


# > infos_xml  
# [[1]]
# {xml_document}
# <feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
#   [1] <title>Rss Idap</title>
#   [2] <id>http://idapapi.mi.gov.br/api/rss/cap</id>
#   [3] <updated>2021-02-15T17:56:07Z</updated>
#   [4] <dc:date>2021-02-15T17:56:07Z</dc:date>
#   [5] <entry>\n  <title>CHUVAS INTENSAS</title>\n  <author>\ ...
# 
# [[2]]
# {xml_document}
# <feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
#   [1] <title>Rss Idap</title>
#   [2] <id>http://idapapi.mi.gov.br/api/rss/cap</id>
#   [3] <updated>2022-05-04T16:46:58Z</updated>
#   [4] <dc:date>2022-05-04T16:46:58Z</dc:date>
#   [5] <entry>\n  <title>DESLIZAMENTOS</title>\n  <author>\n  ...
# 
# [[3]]
# {xml_document}
# <feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
#   [1] <title>Rss Idap</title>
#   [2] <id>http://idapapi.mi.gov.br/api/rss/cap</id>
#   [3] <updated>2020-12-29T14:04:34Z</updated>
#   [4] <dc:date>2020-12-29T14:04:34Z</dc:date>
#   [5] <entry>\n  <title>CHUVAS INTENSAS</title>\n  <author>\ ...
# 
# [[4]]
# {xml_document}
# <feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
#   [1] <title>Rss Idap</title>
#   [2] <id>http://idapapi.mi.gov.br/api/rss/cap</id>
#   [3] <updated>2020-12-29T14:42:36Z</updated>
#   [4] <dc:date>2020-12-29T14:42:36Z</dc:date>
#   [5] <entry>\n  <title>CHUVAS INTENSAS</title>\n  <author>\ ...
# 
# [[5]]
# {xml_document}
# <feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
#   [1] <title>Rss Idap</title>
#   [2] <id>http://idapapi.mi.gov.br/api/rss/cap</id>
#   [3] <updated>2020-12-29T14:59:44Z</updated>
#   [4] <dc:date>2020-12-29T14:59:44Z</dc:date>
#   [5] <entry>\n  <title>TEMPESTADE LOCAL/CONVECTIVA - CHUVAS ...
# 
# [[6]]
# {xml_document}
# <feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
#   [1] <title>Rss Idap</title>
#   [2] <id>http://idapapi.mi.gov.br/api/rss/cap</id>
#   [3] <updated>2020-12-29T15:13:57Z</updated>
#   [4] <dc:date>2020-12-29T15:13:57Z</dc:date>
#   [5] <entry>\n  <title>CHUVAS INTENSAS</title>\n  <author>\ ...
# 
# [[7]]
# {xml_document}
# <feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
#   [1] <title>Rss Idap</title>
#   [2] <id>http://idapapi.mi.gov.br/api/rss/cap</id>
#   [3] <updated>2020-12-29T15:16:43Z</updated>
#   [4] <dc:date>2020-12-29T15:16:43Z</dc:date>
#   [5] <entry>\n  <title>CHUVAS INTENSAS</title>\n  <author>\ ...
# 
# [[8]]
# {xml_document}
# <feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
#   [1] <title>Rss Idap</title>
#   [2] <id>http://idapapi.mi.gov.br/api/rss/cap</id>
#   [3] <updated>2020-12-29T15:47:36Z</updated>
#   [4] <dc:date>2020-12-29T15:47:36Z</dc:date>
#   [5] <entry>\n  <title>CHUVAS INTENSAS</title>\n  <author>\ ...
# 
# [[9]]
# {xml_document}
# <feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
#   [1] <title>Rss Idap</title>
#   [2] <id>http://idapapi.mi.gov.br/api/rss/cap</id>
#   [3] <updated>2020-12-29T16:04:51Z</updated>
#   [4] <dc:date>2020-12-29T16:04:51Z</dc:date>
#   [5] <entry>\n  <title>CHUVAS INTENSAS</title>\n  <author>\ ...

raphaelsaldanha · Junho 9, 2022, 8:08pm

Perfeito! Muito obrigado.