Performance issues on large files

Hi there,

I am trying to use your script to compare 2 large HTML files (500000 tokens in both arrays), but never reaches the end (and i've been pretty patient ;) )

I'm trying to find out what's going on exactly (now testing with files around 30000 tokens).
It takes about 68 seconds for the script to complete, which unfortunately is by far not good enough for my needs.

I've decided to see if i could find some things to improve, and the first thing i've noticed so far is in the create_index function.

I think the code below should only be executed when index[token] == null.

   `index[token] = []`
    `idx = p.in_these.indexOf token`
    `while idx isnt -1`
    `index[token].push idx`
    `idx = p.in_these.indexOf token, idx+1`

Consider the fact that of 30000 tokens, there are a lot of duplicates (ie. " " (+/- 12000), "a", "and", "for".... etc )
Running the code as is means that you are getting the position of 12000 spaces in the target array for 12000 times, each time overwriting index[token].

Putting a check around the above lines immediately took of a solid 15 seconds of the total time (the time for building the index went down from 16 seconds to 1).

Just thought i'd let you know, think this should be fixed in your main file.

I'm now trying to find a way to get the recursive finding of matching blocks back to an acceptable excecution time (at the moment that part takes a little over 50 seconds)

Any ideas on this would be great!

Thx,
Rutger Scheepens


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance issues on large files #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance issues on large files #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions