Skip to content
This repository was archived by the owner on Oct 30, 2018. It is now read-only.

Conversation

@raisercostin
Copy link

Merged various forks to include as many as possible improvements made to goose in the main trunk.

skyshard and others added 30 commits May 17, 2013 18:31
(cherry picked from commit 6c7f98523a0cee3e08d506f656850c8e29974602)
Conflicts:
	pom.xml
	src/main/scala/com/gravity/goose/network/HtmlFetcher.scala
	src/main/scala/com/gravity/goose/text/StopWords.scala
… not found properly inside the for() for crawler… leaving for later)
…?) doesn't actually do extraction in this case...
warrd and others added 30 commits October 13, 2014 18:17
Goose uses a HashSet for iterating topNode candidates
But HashSet doesn't guarantee ordering, so when two candidates have
the same score, the choice is basically random. This is not acceptable.
Now, by using LinkedHashSet we make sure that in case of draw, we choose
the first tag that was found in the DOM tree.
Using LinkedHashSet to avoid inconsistency
Accept cookies from web sites which put all the cookies into one request
header.
Conflicts:
	build.sbt
	src/main/scala/com/gravity/goose/Configuration.scala
Conflicts:
	README.md
	build.sbt
	pom.xml
	src/main/scala/com/gravity/goose/Article.scala
	src/main/scala/com/gravity/goose/Configuration.scala
	src/main/scala/com/gravity/goose/opengraph/OpenGraphData.scala
	src/test/scala/com/gravity/goose/GooseTest.scala
Conflicts:
	pom.xml
	src/main/scala/com/gravity/goose/Article.scala
	src/main/scala/com/gravity/goose/Configuration.scala
	src/main/scala/com/gravity/goose/Crawler.scala
	src/main/scala/com/gravity/goose/images/ImageExtractor.scala
	src/main/scala/com/gravity/goose/images/StandardImageExtractor.scala
	src/main/scala/com/gravity/goose/images/UpgradedImageIExtractor.scala
	src/main/scala/com/gravity/goose/network/HtmlFetcher.scala
	src/test/scala/com/gravity/goose/TestUtils.scala
	src/test/scala/com/gravity/goose/TextExtractionsTest.scala
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.