0

I am trying to automatically collect articles from a database which first requires me to login.

I have written the following code using selenium to open up the search results page, then wait and allow me to login. That works, and it can get the links to each item in the search results.

I want to then continue use selenium to continue to visit each of the links in the search results and collect the article text

browser = webdriver.Firefox()
browser.get("LINK")
time.sleep(60)
lnks = browser.find_elements_by_tag_name("a")[20:40]
for lnk in lnks:
    link = lnk.get_attribute('href')
    print(link)

I can't get any further. How should I then make it visit these links in turn and get the text of the articles for each one?

I tried to add driver.get(link) to the for loop, I got the 'selenium.common.exceptions.StaleElementReferenceException'

On the request of the database owner, I have removed the screenshots previously posted in this post, as well as information about the database. I would like to delete the post completely, but am unable to do so.

Nick Olczak
  • 305
  • 3
  • 14
  • 1
    duplicate of https://stackoverflow.com/q/64865614/1387701? – DMart Nov 17 '20 at 15:38
  • Thanks, that link looks useful and covers similar ground. I see I need to open the links in a new window, but I don't know Java to do so, and if I do that then I would need to login again each time, no? – Nick Olczak Nov 17 '20 at 15:52

1 Answers1

0

You need to seek bs4 tutroials, but here is starter

html_source_code = Browser.execute_script("return document.body.innerHTML;")
soup = bs4.BeautifulSoup(html_source_code, 'lxml')
links = soup.find_all('what-ever-the-html-code-is')
for l in links:
    print(l['href'])

coderoftheday
  • 1,987
  • 4
  • 7
  • 21