Oi pessoal,
Estou com dificuldade de fazer scrapping com essa página: https://idapcap.mdr.gov.br
Ela retorna uma lista de arquivos XMLs que eu quero baixar (e depois ler um a um) com o R. Como posso fazer uma lista desses XMLs para baixar? Tentei com o rvest
mas não consegui
Abraços!
Oi Rafael!
Imagino 2 etapas: buscar a lista de URL dos xmls pra baixar; depois criar uma função que faça o que vc precisa fazer com eles e iterar no resultado da primeira etapa.
Esse código tá meio incompleto: ele faz a etapa 1 ok, a etapa 2 eu deixei simples mas dá pra entender a ideia e adaptar. Veja se faz sentido!
# buscando a lista completa ------
url_mdr <- "https://idapcap.mdr.gov.br"
g_mdr <- httr::GET(url_mdr)
result <- httr::content(g_mdr) |>
# buscar os elementos
rvest::html_nodes(xpath = "/html/body/pre/a") |>
# pegar o conteudo
rvest::html_attr("href")
# [997] "6863121022021-SP.xml"
# [998] "6863221022021-SP.xml"
# [999] "6863321022021-SP.xml"
complete_url <- paste0("https://idapcap.mdr.gov.br/", result)
# [723] "https://idapcap.mdr.gov.br/6834617022021-SP.xml"
# [724] "https://idapcap.mdr.gov.br/6834717022021-SP.xml"
# [725] "https://idapcap.mdr.gov.br/6834817022021-SP.xml"
# buscar cada xml
buscar_infos_xml <- function(url_idapcap){
# colocar aqui o q vc quer fazer pra cada.
# eu coloquei um simples get + content
# se quiser baixar o conteúdo, tem que
# usar a funcao httr::write_disk() dentro do GET()
url_idapcap |>
httr::GET() |>
httr::content() # o que vc quer buscar neles?
}
infos_xml <- complete_url[2:10] |>
purrr::map(buscar_infos_xml)
# > infos_xml
# [[1]]
# {xml_document}
# <feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
# [1] <title>Rss Idap</title>
# [2] <id>http://idapapi.mi.gov.br/api/rss/cap</id>
# [3] <updated>2021-02-15T17:56:07Z</updated>
# [4] <dc:date>2021-02-15T17:56:07Z</dc:date>
# [5] <entry>\n <title>CHUVAS INTENSAS</title>\n <author>\ ...
#
# [[2]]
# {xml_document}
# <feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
# [1] <title>Rss Idap</title>
# [2] <id>http://idapapi.mi.gov.br/api/rss/cap</id>
# [3] <updated>2022-05-04T16:46:58Z</updated>
# [4] <dc:date>2022-05-04T16:46:58Z</dc:date>
# [5] <entry>\n <title>DESLIZAMENTOS</title>\n <author>\n ...
#
# [[3]]
# {xml_document}
# <feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
# [1] <title>Rss Idap</title>
# [2] <id>http://idapapi.mi.gov.br/api/rss/cap</id>
# [3] <updated>2020-12-29T14:04:34Z</updated>
# [4] <dc:date>2020-12-29T14:04:34Z</dc:date>
# [5] <entry>\n <title>CHUVAS INTENSAS</title>\n <author>\ ...
#
# [[4]]
# {xml_document}
# <feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
# [1] <title>Rss Idap</title>
# [2] <id>http://idapapi.mi.gov.br/api/rss/cap</id>
# [3] <updated>2020-12-29T14:42:36Z</updated>
# [4] <dc:date>2020-12-29T14:42:36Z</dc:date>
# [5] <entry>\n <title>CHUVAS INTENSAS</title>\n <author>\ ...
#
# [[5]]
# {xml_document}
# <feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
# [1] <title>Rss Idap</title>
# [2] <id>http://idapapi.mi.gov.br/api/rss/cap</id>
# [3] <updated>2020-12-29T14:59:44Z</updated>
# [4] <dc:date>2020-12-29T14:59:44Z</dc:date>
# [5] <entry>\n <title>TEMPESTADE LOCAL/CONVECTIVA - CHUVAS ...
#
# [[6]]
# {xml_document}
# <feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
# [1] <title>Rss Idap</title>
# [2] <id>http://idapapi.mi.gov.br/api/rss/cap</id>
# [3] <updated>2020-12-29T15:13:57Z</updated>
# [4] <dc:date>2020-12-29T15:13:57Z</dc:date>
# [5] <entry>\n <title>CHUVAS INTENSAS</title>\n <author>\ ...
#
# [[7]]
# {xml_document}
# <feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
# [1] <title>Rss Idap</title>
# [2] <id>http://idapapi.mi.gov.br/api/rss/cap</id>
# [3] <updated>2020-12-29T15:16:43Z</updated>
# [4] <dc:date>2020-12-29T15:16:43Z</dc:date>
# [5] <entry>\n <title>CHUVAS INTENSAS</title>\n <author>\ ...
#
# [[8]]
# {xml_document}
# <feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
# [1] <title>Rss Idap</title>
# [2] <id>http://idapapi.mi.gov.br/api/rss/cap</id>
# [3] <updated>2020-12-29T15:47:36Z</updated>
# [4] <dc:date>2020-12-29T15:47:36Z</dc:date>
# [5] <entry>\n <title>CHUVAS INTENSAS</title>\n <author>\ ...
#
# [[9]]
# {xml_document}
# <feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
# [1] <title>Rss Idap</title>
# [2] <id>http://idapapi.mi.gov.br/api/rss/cap</id>
# [3] <updated>2020-12-29T16:04:51Z</updated>
# [4] <dc:date>2020-12-29T16:04:51Z</dc:date>
# [5] <entry>\n <title>CHUVAS INTENSAS</title>\n <author>\ ...
1 curtida
Perfeito! Muito obrigado.
1 curtida