Do y’all know about textise? I don’t see mention of it come up in a quick search. https://www.textise.net/

It can be used with the duckduckgo bang !textise

It also works over Tor, where I can use it as a proxy to avoid Cloudflare checkpoints.

I don’t think that it is open source but not completely sure.

Copy from the site intro:

Textise is a new way of looking at the Web. It’s an internet tool that removes everything from a web page except for its text. In practice, this means that images, forms, scripts, adverts, they all go, leaving plain text. Find out more here… (https://textise.wordpress.com/about-textise/)

How to use this page

  1. Type or paste the URL of a web page into the box below and click “Textise”. A text only version of the web page will be displayed.
  2. Type a search term into the box, select a search engine from the drop-down list, and click “Search”. You will be taken to a text only version of the search results.
  • deleted@lemmy.world
    link
    fedilink
    arrow-up
    3
    ·
    edit-2
    1 year ago

    Technically, you’re correct.

    However, many websites doesn’t follow the appropriate HTML standards and just abuse h1 and p.

    I just tried it with Google.com and it seems to remove all html notations other than text.

    It useful in some cases such as wordpress one-page websites which have their story, mission, products, etc…