-4
class ArticleScraper:
    def __init__(self):
        pass
    
    def articleScraper(self, article_links):
        article_content = []    
        for url in article_links:
            url_i = newspaper.Article(url="%s" % (url), language='en')
            url_i.download()
            url_i.parse()
            content = (f"TITLE:{url_i.title} ARTICLES: {url_i.text}")
            print(urllib.parse.unquote(content))
            article_content.append(content)
        
        return ("\n".join(article_content))
sol = ArticleScraper()
print(sol.articleScraper(list_of_urls))

this is my current code, and the problem I'm having is that whenever it outputs the content or the text it doesn't scrape all the utf-8.

like this: enter image description here

I' tried using the urllib3, and with bs4 aswell, no luck on the urllib3 on bs4 it works the encoding and decoding but I wanted to use newspaper3k because it's more efficient when scraping.

3
  • 3
    maybe it is not in UTF-8 (some pages still may use other encodings), or maybe problem is only your terminal which may not use UTF-8. Better show URL for this page and maybe create minimal working code so we could check this problem Commented Aug 31 at 12:25
  • 1
    You're right, it was a problem in my terminal, it couldn't process or wasn't built to process UTF-8, tried it on a different interpreter and all special characters showed. Commented Aug 31 at 12:53
  • 1
    Change default code page of Windows console to UTF-8 - Super User Commented Aug 31 at 18:17

0

Your Answer

By clicking β€œPost Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.