• keepthepace@slrpnk.net
    link
    fedilink
    English
    arrow-up
    6
    ·
    18 days ago

    I love that PDFs are so difficult to transform into HTML, too

    FYI, if that’s relevant to your field, every new article published on arxiv.org now has a HTML render as well.

    And on many older publications, transforming “arxiv.org” into “ar5iv.org” leads to an HTML rendering that is a best-effort experiments they ran for a while.

    • JackbyDev@programming.dev
      link
      fedilink
      English
      arrow-up
      2
      ·
      18 days ago

      That’s really cool! What I really would like is a tool that converts PDFs to semantic HTML files. I took a peek there and it seems easier for them because they have the original LeX source.

      I think for arbitrary PDFs files the information just isn’t there. I’ve looked into it a bit and it’s sort of all over. A tool called pdf2htmlex is pretty good but it makes the HTML look exactly like the PDF.

      • keepthepace@slrpnk.net
        link
        fedilink
        English
        arrow-up
        2
        ·
        18 days ago

        Yes, PDFs are much more permissive and may not have any semantic information at all. Hell, some old publications are just scanned images!

        PDF -> semantic seems to be a hard problem that basically requires OCR, like these people are doing

        • thevoidzero@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          16 days ago

          Not just semantics. PDFs doesn’t even have segmentations like spaces/lines/paragraph. It’s just text drawn at locations the text processor/any other softwares inserted into. Many pdf editor softwares just detect the closeness of the characters to group them together.

          And one step further is you can convert text to path, which basically won’t even have glyph (characters) info and font info, all characters will just be geometric shapes. In that case you can’t even copy the text. OCR is your only choice.

          PDF is for finalizing something and printing/sharing without the ability to edit.