Skip to content

Commit 16e323a

Browse files
committed
Merge pull request #13 from kamelkev/utf8
Merge branch utf8 to master
2 parents f0f953f + 276af58 commit 16e323a

21 files changed

+810
-583
lines changed

ChangeLog

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -210,3 +210,23 @@
210210
* Add patch provided by Dave Gray ([email protected])
211211
- Adds proper headers for remote fetching of files
212212
* Fix issues within pod documentation
213+
214+
39XX 2015-11-23 Kevin Kamel <[email protected]>
215+
* Update POD within Inliner.pm such that it generates more consistent documentation for CPAN/GitHub
216+
* Set URI flag allowing urls containing leading dots to be handled correctly
217+
* Extend support for foreign character sets
218+
- implement charset detection algorithm, roughly based off of HTML5 W3C specification
219+
- implement character encoding/decoding based upon detected charset
220+
- add tests for exercising new charset related features
221+
- update documentation regarding new methods to support foreign charsets
222+
* Add reference to contributor Dave Gray ([email protected]) to contributors section
223+
* Add reference to contributor Chelsea Rio ([email protected]) to contributors section
224+
* Add new TreeBuilder configuration method, which ensures all instances are configured identically
225+
* Remove all entity handling intentionally or unintentionally done, retain original state of all read chars
226+
- Modify configuration of all TreeBuilder instances, remove all entity decoding done during parsing
227+
- Modify configuration of TreeBuilder output, skip calls for entity encoding
228+
- strip all documentation and argument handling related to entity encoding
229+
- All entity encoding is now the responsibility of the caller
230+
* Update MANIFEST to reference all added tests/assets
231+
* Fix minor formatting issues within some tests/assets
232+
* Address concerns raised by CPAN RT96414, conditionally test for connectivity instead of outright failing

MANIFEST

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,21 +15,22 @@ t/basic_media.t
1515
t/basic_pseudo.t
1616
t/basic_redeclare.t
1717
t/cascade.t
18+
t/charset.t
1819
t/css_url.t
1920
t/custom_html_tree.t
2021
t/embedded_style_block.t
21-
t/encoding.t
22-
t/entities.t
22+
t/entities_default.t
23+
t/entities_utf8.t
2324
t/fetch-filter.perl
2425
t/fetch.perl
2526
t/html/acidtest.html
2627
t/html/acidtest_result.html
2728
t/html/atruletest.html
2829
t/html/atruletest_result.html
30+
t/html/charset.html
31+
t/html/charset_result.html
2932
t/html/embedded_style.html
3033
t/html/embedded_style_result.html
31-
t/html/encoding.html
32-
t/html/encoding_result.html
3334
t/html/linebreaktest.html
3435
t/html/linebreaktest_result.html
3536
t/html/linktest.html
@@ -40,6 +41,7 @@ t/html/pseudotest.html
4041
t/html/pseudotest_result.html
4142
t/html/relaxed.html
4243
t/html/relaxed_result.html
44+
t/important.t
4345
t/linebreaktest.t
4446
t/linktest.t
4547
t/mediaquery.t
@@ -53,3 +55,5 @@ t/pod.t
5355
t/pseudotest.t
5456
t/relaxed.t
5557
t/specificity.t
58+
t/utf8_attributes.t
59+
t/utf8_content.t

README

Lines changed: 107 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -18,114 +18,149 @@ DESCRIPTION
1818
top level <style> declarations.
1919

2020
METHODS
21-
new
22-
Instantiates the Inliner object. Sets up class variables that are
23-
used during file parsing/processing. Possible options are:
21+
new
22+
Instantiates the Inliner object. Sets up class variables that are used
23+
during file parsing/processing. Possible options are:
2424

25-
entities - (optional) Pass in a string containing characters to
26-
entity encode in all output, overrides the internal default provided
27-
by the module
25+
html_tree - (optional) Pass in a fresh unparsed instance of
26+
HTML::Treebuilder
2827

29-
html_tree - (optional) Pass in a fresh unparsed instance of
30-
HTML::Treebuilder
28+
NOTE: Any passed references to HTML::TreeBuilder will be substantially
29+
altered by passing it in here...
3130

32-
NOTE: Any passed references to HTML::TreeBuilder will be
33-
substantially altered by passing it in here...
31+
strip_attrs - (optional) Remove all "id" and "class" attributes during
32+
inlining
3433

35-
strip_attrs - (optional) Remove all "id" and "class" attributes
36-
during inlining
34+
leave_style - (optional) Leave style/link tags alone within <head>
35+
during inlining
3736

38-
leave_style - (optional) Leave style/link tags alone within <head>
39-
during inlining
37+
relaxed - (optional) Relaxed HTML parsing which will attempt to
38+
interpret non-HTML4 documents.
4039

41-
relaxed - (optional) Relaxed HTML parsing which will attempt to
42-
interpret non-HTML4 documents.
40+
NOTE: This argument is not compatible with passing an html_tree.
4341

44-
NOTE: This argument is not compatible with passing an html_tree.
42+
agent - (optional) Pass in a string containing a preferred user-agent,
43+
overrides the internal default provided by the module for handling
44+
remote documents
4545

46-
agent - (optional) Pass in a string containing a preferred
47-
user-agent, overrides the internal default provided by the module
48-
for handling remote documents
46+
fetch_file
47+
Fetches a remote HTML file that supposedly contains both HTML and a
48+
style declaration, properly tags the data with the proper characterset
49+
as provided by the remote webserver (if any). Subsequently calls the
50+
read method automatically.
4951

50-
fetch_file
51-
Fetches a remote HTML file that supposedly contains both HTML and a
52-
style declaration, properly tags the data with the proper
53-
characterset as provided by the remote webserver (if any).
54-
Subsequently calls the read method automatically.
52+
This method expands all relative urls, as well as fully expands the
53+
stylesheet reference within the document.
5554

56-
This method expands all relative urls, as well as fully expands the
57-
stylesheet reference within the document.
55+
This method requires you to pass in a params hash that contains a url
56+
argument for the requested document. For example:
5857

59-
This method requires you to pass in a params hash that contains a
60-
url argument for the requested document. For example:
58+
$self->fetch_file({ url => 'http://www.example.com' });
6159

62-
$self->fetch_file({ url => 'http://www.example.com' });
60+
Note that you can specify a user-agent to override the default
61+
user-agent of 'Mozilla/4.0' within the constructor. Doing so may avoid
62+
certain issues with agent filtering related to quirky webserver configs.
6363

64-
Note that you can specify a user-agent to override the default
65-
user-agent of 'Mozilla/4.0' within the constructor. Doing so may
66-
avoid certain issues with agent filtering related to quirky
67-
webserver configs.
64+
Input Parameters: url - the desired url for a remote asset presumably
65+
containing both html and css charset - (optional) programmer specified
66+
charset for the pass url
6867

69-
read_file
70-
Opens and reads an HTML file that supposedly contains both HTML and
71-
a style declaration. It subsequently calls the read() method
72-
automatically.
68+
read_file
69+
Opens and reads an HTML file that supposedly contains both HTML and a
70+
style declaration. It subsequently calls the read() method
71+
automatically.
7372

74-
This method requires you to pass in a params hash that contains a
75-
filename argument. For example:
73+
This method requires you to pass in a params hash that contains a
74+
filename argument. For example:
7675

77-
$self->read_file({ filename => 'myfile.html' });
76+
$self->read_file({ filename => 'myfile.html' });
7877

79-
Additionally you can specify the character encoding within the file,
80-
for example:
78+
Additionally you can specify the character encoding within the file, for
79+
example:
8180

82-
$self->read_file({ filename => 'myfile.html', charset => 'utf8' });
81+
$self->read_file({ filename => 'myfile.html', charset => 'utf8' });
8382

84-
read
85-
Reads passed html data and parses it. The intermediate data is
86-
stored in class variables.
83+
Input Parameters: filename - name of local file presumably containing
84+
both html and css charset - (optional) programmer specified charset of
85+
the passed file
8786

88-
The <style> block is ripped out of the html here, and stored
89-
separately. Class/ID/Names used in the markup are left alone.
87+
read
88+
Reads passed html data and parses it. The intermediate data is stored in
89+
class variables.
9090

91-
This method requires you to pass in a params hash that contains
92-
scalar html data. For example:
91+
The <style> block is ripped out of the html here, and stored separately.
92+
Class/ID/Names used in the markup are left alone.
9393

94-
$self->read({ html => $html });
94+
This method requires you to pass in a params hash that contains scalar
95+
html data. For example:
9596

96-
NOTE: You are required to pass a properly encoded perl reference to
97-
the html data. This method does *not* do the dirty work of encoding
98-
the html as utf8 - do that before calling this method.
97+
$self->read({ html => $html });
9998

100-
inlinify
101-
Processes the html data that was entered through either 'read' or
102-
'read_file', returns a scalar that contains a composite chunk of
103-
html that has inline styles instead of a top level <style>
104-
declaration.
99+
NOTE: You are required to pass a properly encoded perl reference to the
100+
html data. This method does *not* do the dirty work of encoding the html
101+
as utf8 - do that before calling this method.
105102

106-
query
107-
Given a particular selector return back the applicable styles
103+
Input Parameters: html - scalar presumably containing both html and css
104+
charset - (optional) scalar representing the original charset of the
105+
passed html
108106

109-
specificity
110-
Given a particular selector return back the associated selectivity
107+
detect_charset
108+
Detect the charset of the passed content.
111109

112-
content_warnings
113-
Return back any warnings thrown while inlining a given block of
114-
content.
110+
The algorithm present here is roughly based off of the HTML5 W3C working
111+
group document, which lays out a recommendation for determining the
112+
character set of a received document, which can be seen here under the
113+
"determining the character encoding" section:
114+
http://www.w3.org/TR/html5/syntax.html
115115

116-
Note: content warnings are initialized at inlining time, not at read
117-
time. In order to receive back content feedback you must perform
118-
inlinify first
116+
Input Parameters: content - scalar presumably containing both html and
117+
css charset - (optional) programmer specified charset for the passed
118+
content ctcharset - (optional) content-type specified charset for
119+
content retrieved via a url
120+
121+
decode_characters
122+
Implement the character decoding algorithm for HTML as outlined by the
123+
various working groups
124+
125+
Basically apply best practices for determining the applied character
126+
encoding and properly decode it
127+
128+
It is expected that this method will be called before any calls to
129+
read()
130+
131+
Input Parameters: content - scalar presumably containing both html and
132+
css charset - known charset for the passed content
133+
134+
inlinify
135+
Processes the html data that was entered through either 'read' or
136+
'read_file', returns a scalar that contains a composite chunk of html
137+
that has inline styles instead of a top level <style> declaration.
138+
139+
query
140+
Given a particular selector return back the applicable styles
141+
142+
specificity
143+
Given a particular selector return back the associated selectivity
144+
145+
content_warnings
146+
Return back any warnings thrown while inlining a given block of content.
147+
148+
Note: content warnings are initialized at inlining time, not at read
149+
time. In order to receive back content feedback you must perform
150+
inlinify first
119151

120152
Sponsor
121153
This code has been developed under sponsorship of MailerMailer LLC,
122154
http://www.mailermailer.com/
123155

124156
AUTHOR
125-
Kevin Kamel <[email protected]>
157+
Kevin Kamel <[email protected]>
126158

127159
CONTRIBUTORS
128-
Vivek Khera <[email protected]>, Michael Peters <[email protected]>
160+
Dave Gray <[email protected]>
161+
Vivek Khera <[email protected]>
162+
Michael Peters <[email protected]>
163+
Chelsea Rio <[email protected]>
129164

130165
LICENSE
131166
This module is Copyright 2015 Khera Communications, Inc. It is licensed

0 commit comments

Comments
 (0)