The time now is Thu 26 Apr 2018, 00:08
All times are UTC - 4 |
Author |
Message |
Makoto

Joined: 03 Sep 2009 Posts: 2097 Location: Out wandering... maybe.
|
Posted: Tue 21 Aug 2012, 01:58 Post subject:
Simple converter for .doc/.odt to HTML? |
|
Is there anything simple that will allow me to convert document file formats (mostly .doc and .odt, I guess), with styles/etc., to an HTML file? I can (and have) used OpenOffice/LibreOffice - but that maintains parity with the MS Office way of creating an HTML file from a document, and adds a LOT of unnecessary code overhead to the resulting HTML file.
Seamonkey's Composer won't directly open the above document filetypes. I can use it (among other HTML editors, of course) to attempt to strip out what the Office programs have done to the text... but that's a massive undertaking. (Though, if there's an automatic way to 'optimize' the HTML page in Seamonkey, I wouldn't mind. )
_________________ [ Puppy 4.3.1 JP, Frugal install | 1GB RAM | 1.3GB swap ] * [ Puppy Precise 5.7.1 JP, Frugal install ]
In memory of our beloved American Eskimo puppy (1995-2010) and black Lab puppy (1997-2011).
|
Back to top
|
|
 |
don570

Joined: 10 Mar 2010 Posts: 4989 Location: Ontario
|
Posted: Tue 21 Aug 2012, 20:17 Post subject:
|
|
There is a nicely written text processor that opens up
microsoft docs and saves to various formats including html.
Softmaker 2012 beta is the latest version. It's a commercial product
but there's a free trial so you can find out if it's good enough.
Here's more info ---->
http://murga-linux.com/puppy/viewtopic.php?p=647950#647950
__________________________________________
|
Back to top
|
|
 |
Makoto

Joined: 03 Sep 2009 Posts: 2097 Location: Out wandering... maybe.
|
Posted: Wed 22 Aug 2012, 00:20 Post subject:
|
|
Yeah, but I'm a little reluctant to install another office suite just for that particular 'simple' feature, though.
Why MS Office/Word and Open/LibreOffice feel they have to add that much code even for a simple text page HTML, I don't know. I did just that, recently, had a monospace font set for the whole document - and the resulting HTML page from OpenOffice was redefining the font with every single line of text. Among other things, of course.
...then again, I experimented with doing it with AbiWord. Not only did I lose some of the formatting, but it also insisted on adding CSS functions to the document. (It's just plain text, with the occasional italicized, bolded and maybe underlined word. That's all. No real need for a stylesheet, is there? (No, really. I'm not really sure.))
_________________ [ Puppy 4.3.1 JP, Frugal install | 1GB RAM | 1.3GB swap ] * [ Puppy Precise 5.7.1 JP, Frugal install ]
In memory of our beloved American Eskimo puppy (1995-2010) and black Lab puppy (1997-2011).
|
Back to top
|
|
 |
technosaurus

Joined: 18 May 2008 Posts: 4787 Location: Kingwood, TX
|
Posted: Sun 26 Aug 2012, 22:21 Post subject:
|
|
Abiword
_________________ Check out my github repositories. I may eventually get around to updating my blogspot.
|
Back to top
|
|
 |
Makoto

Joined: 03 Sep 2009 Posts: 2097 Location: Out wandering... maybe.
|
Posted: Mon 27 Aug 2012, 05:30 Post subject:
|
|
I did try Abiword, as I mentioned above. Not only did it insist on adding CSS to the document, it still generated an HTML page around the same size as the versions Word and OpenOffice created.
All of them roughly converted a 66k (7-bit) text document into a 166k HTML file. Text should not need a 100k markup.
(I used to do it manually, so I should know. )
_________________ [ Puppy 4.3.1 JP, Frugal install | 1GB RAM | 1.3GB swap ] * [ Puppy Precise 5.7.1 JP, Frugal install ]
In memory of our beloved American Eskimo puppy (1995-2010) and black Lab puppy (1997-2011).
|
Back to top
|
|
 |
technosaurus

Joined: 18 May 2008 Posts: 4787 Location: Kingwood, TX
|
Posted: Mon 27 Aug 2012, 10:02 Post subject:
|
|
that sound about right actually to preserve formatting... they have to cover cases that aren't as simple as yours. if you dont care about preserving format at all convert to text and then: Code: | echo '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title></title>
</head>
<body>
<pre>' > file.html
cat file.txt >>file.html
echo '</pre>
</body>
</html>' >>file.html |
_________________ Check out my github repositories. I may eventually get around to updating my blogspot.
|
Back to top
|
|
 |
Makoto

Joined: 03 Sep 2009 Posts: 2097 Location: Out wandering... maybe.
|
Posted: Mon 27 Aug 2012, 22:49 Post subject:
|
|
I know, but there's usually something about the generated HTML that just seems... weird, for whatever reason. Much more than it probably needs to be, maybe. Like OpenOffice's insistence on restating the font on every single line of text (sure, I set a monospace font for the entire document, but does it really need to be renewed on every line?). Or an earlier version of MS Word insisting on tokenizing practically everything.
Of course, I'll be the first to admit I'm not any sort of expert on HTML.
_________________ [ Puppy 4.3.1 JP, Frugal install | 1GB RAM | 1.3GB swap ] * [ Puppy Precise 5.7.1 JP, Frugal install ]
In memory of our beloved American Eskimo puppy (1995-2010) and black Lab puppy (1997-2011).
|
Back to top
|
|
 |
technosaurus

Joined: 18 May 2008 Posts: 4787 Location: Kingwood, TX
|
Posted: Tue 28 Aug 2012, 22:29 Post subject:
|
|
still sticking by my original suggestion, I just tested abiword-2.8.6 in wary 5.3 on /usr/share/examples/test.doc ... just uncheck all of the boxes when you save as html - it actually reduced the total size 4 fold and looks acceptable.
(btw abiword does have a command line interface that you can batch process with)
_________________ Check out my github repositories. I may eventually get around to updating my blogspot.
|
Back to top
|
|
 |
Makoto

Joined: 03 Sep 2009 Posts: 2097 Location: Out wandering... maybe.
|
Posted: Wed 29 Aug 2012, 00:18 Post subject:
|
|
Abiword eats (doesn't support) some of the simple formatting elements I use, though, like horizontal lines. They disappear from the document when I load it... and, of course, aren't added to the end HTML.
(That's aside from the fact that Abiword usually behaves rather badly, for me. I'm surprised I managed to get it to export a document to an HTML page without something bad happening, aside from the missing elements.)
Hmm... wonder how much of a dent HTML Tidy might make in it?
_________________ [ Puppy 4.3.1 JP, Frugal install | 1GB RAM | 1.3GB swap ] * [ Puppy Precise 5.7.1 JP, Frugal install ]
In memory of our beloved American Eskimo puppy (1995-2010) and black Lab puppy (1997-2011).
|
Back to top
|
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You cannot attach files in this forum You can download files in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|