Skip to content

Commit e7a1270

Browse files
committed
MM-12877
Additional updates to resolve some earlier oversights: - Ensure that we can work with documents that lack a <head> section, previously we were assuming that such a section existed, which it might not, as <head> is not required in HTML5 at all, and sometimes omitted for other versions even though it's technically required there. - In the event that Encoding::FixLatin is available, we should use it. Otherwise fallback to ASCII. - Update the version to 4002, along with adding some documentation
1 parent fd8f7b6 commit e7a1270

File tree

3 files changed

+69
-29
lines changed

3 files changed

+69
-29
lines changed

ChangeLog

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -211,12 +211,13 @@
211211
- Adds proper headers for remote fetching of files
212212
* Fix issues within pod documentation
213213

214-
4001 2015-11-23 Kevin Kamel <[email protected]>
214+
4002 2015-11-23 Kevin Kamel <[email protected]>
215215
* Update POD within Inliner.pm such that it generates more consistent documentation for CPAN/GitHub
216216
* Set URI flag allowing urls containing leading dots to be handled correctly
217217
* Extend support for foreign character sets
218218
- implement charset detection algorithm, roughly based off of HTML5 W3C specification
219219
- implement character encoding/decoding based upon detected charset
220+
- implement fallback mode for when no charset is detected, leverage Encoding::FixLatin if available
220221
- add tests for exercising new charset related features
221222
- update documentation regarding new methods to support foreign charsets
222223
* Add reference to contributor Dave Gray ([email protected]) to contributors section

README

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -45,9 +45,9 @@ METHODS
4545

4646
fetch_file
4747
Fetches a remote HTML file that supposedly contains both HTML and a
48-
style declaration, properly tags the data with the proper characterset
49-
as provided by the remote webserver (if any). Subsequently calls the
50-
read method automatically.
48+
style declaration, properly tags the data with the proper charset as
49+
provided by the remote webserver (if any). Subsequently calls the read
50+
method automatically.
5151

5252
This method expands all relative urls, as well as fully expands the
5353
stylesheet reference within the document.
@@ -67,8 +67,8 @@ METHODS
6767

6868
read_file
6969
Opens and reads an HTML file that supposedly contains both HTML and a
70-
style declaration. It subsequently calls the read() method
71-
automatically.
70+
style declaration, properly tags the data with the proper charset if
71+
specified. It subsequently calls the read() method automatically.
7272

7373
This method requires you to pass in a params hash that contains a
7474
filename argument. For example:
@@ -113,6 +113,12 @@ METHODS
113113
"determining the character encoding" section:
114114
http://www.w3.org/TR/html5/syntax.html
115115

116+
NOTE: In the event that no charset can be identified the library will
117+
handle the content as a mix of UTF-8/CP-1252/8859-1/ASCII by attempting
118+
to use the Encoding::FixLatin module, as this combination is relatively
119+
common in the wild. Finally, if Encoding::FixLatin is unavailable the
120+
content will be treated as ASCII.
121+
116122
Input Parameters: content - scalar presumably containing both html and
117123
css charset - (optional) programmer specified charset for the passed
118124
content ctcharset - (optional) content-type specified charset for

lib/CSS/Inliner.pm

Lines changed: 56 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ package CSS::Inliner;
22
use strict;
33
use warnings;
44

5-
our $VERSION = '4001';
5+
our $VERSION = '4002';
66

77
use Carp;
88
use Encode;
@@ -39,7 +39,7 @@ support top level <style> declarations.
3939
=cut
4040

4141
BEGIN {
42-
my $members = ['stylesheet','css','html','html_tree','query','strip_attrs','relaxed','leave_style','warns_as_errors','content_warnings','agent','charset'];
42+
my $members = ['stylesheet','css','html','html_tree','query','strip_attrs','relaxed','leave_style','warns_as_errors','content_warnings','agent','fixlatin'];
4343

4444
#generate all the getter/setter we need
4545
foreach my $member (@{$members}) {
@@ -109,7 +109,7 @@ sub new {
109109
leave_style => (defined($$params{leave_style}) && $$params{leave_style}) ? 1 : 0,
110110
warns_as_errors => (defined($$params{warns_as_errors}) && $$params{warns_as_errors}) ? 1 : 0,
111111
agent => (defined($$params{agent}) && $$params{agent}) ? $$params{agent} : 'Mozilla/4.0',
112-
charset => undef
112+
fixlatin => eval { require Encoding::FixLatin; return 1; } ? 1 : 0
113113
};
114114

115115
bless $self, $class;
@@ -122,7 +122,7 @@ sub new {
122122
=head2 fetch_file
123123
124124
Fetches a remote HTML file that supposedly contains both HTML and a
125-
style declaration, properly tags the data with the proper characterset
125+
style declaration, properly tags the data with the proper charset
126126
as provided by the remote webserver (if any). Subsequently calls the
127127
read method automatically.
128128
@@ -158,7 +158,21 @@ sub fetch_file {
158158

159159
my $charset = $self->detect_charset({ content => $content, charset => $$params{charset}, ctcharset => $ctcharset });
160160

161-
my $decoded_html = $self->decode_characters({ content => $content, charset => $charset });
161+
my $decoded_html;
162+
if ($charset) {
163+
$decoded_html = $self->decode_characters({ content => $content, charset => $charset });
164+
}
165+
else {
166+
# no good hints found, do the best we can
167+
168+
if ($self->_fixlatin()) {
169+
Encoding::FixLatin->import('fix_latin');
170+
$decoded_html = fix_latin($content);
171+
}
172+
else {
173+
$decoded_html = $self->decode_characters({ content => $content, charset => 'ascii' });
174+
}
175+
}
162176

163177
my $html = $self->_absolutize_references({ content => $decoded_html, baseref => $baseref });
164178

@@ -170,8 +184,8 @@ sub fetch_file {
170184
=head2 read_file
171185
172186
Opens and reads an HTML file that supposedly contains both HTML and a
173-
style declaration. It subsequently calls the read() method
174-
automatically.
187+
style declaration, properly tags the data with the proper charset
188+
if specified. It subsequently calls the read() method automatically.
175189
176190
This method requires you to pass in a params hash that contains a
177191
filename argument. For example:
@@ -203,7 +217,21 @@ sub read_file {
203217

204218
my $charset = $self->detect_charset({ content => $content, charset => $$params{charset} });
205219

206-
my $decoded_html = $self->decode_characters({ content => $content, charset => $charset });
220+
my $decoded_html;
221+
if ($charset) {
222+
$decoded_html = $self->decode_characters({ content => $content, charset => $charset });
223+
}
224+
else {
225+
# no good hints found, do the best we can
226+
227+
if ($self->_fixlatin()) {
228+
Encoding::FixLatin->import('fix_latin');
229+
$decoded_html = fix_latin($content);
230+
}
231+
else {
232+
$decoded_html = $self->decode_characters({ content => $content, charset => 'ascii' });
233+
}
234+
}
207235

208236
$self->read({ html => $decoded_html, charset => $charset });
209237

@@ -249,7 +277,6 @@ sub read {
249277
$self->_html_tree->parse_content($$params{html});
250278

251279
$self->_init_query();
252-
$self->_charset($$params{charset});
253280

254281
#suck in the styles for later use from the head section - stylesheets anywhere else are invalid
255282
my $stylesheet = $self->_parse_stylesheet();
@@ -270,6 +297,11 @@ which lays out a recommendation for determining the character set of a received
270297
can be seen here under the "determining the character encoding" section:
271298
http://www.w3.org/TR/html5/syntax.html
272299
300+
NOTE: In the event that no charset can be identified the library will handle the content as a mix of
301+
UTF-8/CP-1252/8859-1/ASCII by attempting to use the Encoding::FixLatin module, as this combination
302+
is relatively common in the wild. Finally, if Encoding::FixLatin is unavailable the content will be
303+
treated as ASCII.
304+
273305
Input Parameters:
274306
content - scalar presumably containing both html and css
275307
charset - (optional) programmer specified charset for the passed content
@@ -304,8 +336,7 @@ sub detect_charset {
304336
$charset = $meta_charset;
305337
}
306338
else {
307-
# no hints found, assume ascii until support for additional steps from the working group is added
308-
$charset = 'ascii';
339+
# no hints found...
309340
}
310341
}
311342

@@ -938,21 +969,23 @@ sub _extract_meta_charset {
938969

939970
my $head = $doc->look_down("_tag", "head"); # there should only be one
940971

941-
# pull key header meta elements
942-
my $meta_charset_elem = $head->look_down('_tag','meta','charset',qr/./);
943-
my $meta_equiv_charset_elem = $head->look_down('_tag','meta','http-equiv',qr/content-type/i,'content',qr/./);
944-
945-
# assign meta charset, we give precedence to meta http_equiv content type
946972
my $meta_charset;
947-
if ($meta_equiv_charset_elem) {
948-
my $meta_equiv_content = $meta_equiv_charset_elem->attr('content');
973+
if ($head) {
974+
# pull key header meta elements
975+
my $meta_charset_elem = $head->look_down('_tag','meta','charset',qr/./);
976+
my $meta_equiv_charset_elem = $head->look_down('_tag','meta','http-equiv',qr/content-type/i,'content',qr/./);
977+
978+
# assign meta charset, we give precedence to meta http_equiv content type
979+
if ($meta_equiv_charset_elem) {
980+
my $meta_equiv_content = $meta_equiv_charset_elem->attr('content');
949981

950-
if ($meta_equiv_content =~ /charset=(.*)(?:[";,]?)/i) {
951-
$meta_charset = $1;
982+
if ($meta_equiv_content =~ /charset=(.*)(?:[";,]?)/i) {
983+
$meta_charset = $1;
984+
}
985+
}
986+
elsif ($meta_charset_elem) {
987+
$meta_charset = $meta_charset_elem->attr('charset');
952988
}
953-
}
954-
elsif ($meta_charset_elem) {
955-
$meta_charset = $meta_charset_elem->attr('charset');
956989
}
957990

958991
return $meta_charset;

0 commit comments

Comments
 (0)