-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[WIP][SPARK-1405][MLLIB]collapsed Gibbs sampling based latent Dirichlet allocation #1983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
QA tests have started for PR 1983 at commit
|
QA tests have finished for PR 1983 at commit
|
QA tests have started for PR 1983 at commit
|
QA tests have finished for PR 1983 at commit
|
QA tests have started for PR 1983 at commit
|
QA tests have finished for PR 1983 at commit
|
QA tests have started for PR 1983 at commit
|
QA tests have finished for PR 1983 at commit
|
QA tests have started for PR 1983 at commit
|
QA tests have finished for PR 1983 at commit
|
QA tests have started for PR 1983 at commit
|
QA tests have finished for PR 1983 at commit
|
QA tests have started for PR 1983 at commit
|
QA tests have started for PR 1983 at commit
|
QA tests have finished for PR 1983 at commit
|
@mengxr This patch removed the |
QA tests have started for PR 1983 at commit
|
QA tests have finished for PR 1983 at commit
|
QA tests have finished for PR 1983 at commit
|
QA tests have started for PR 1983 at commit
|
Tests timed out after a configured wait of |
QA tests have started for PR 1983 at commit
|
QA tests have finished for PR 1983 at commit
|
@witgo 下面这一段代码可以多线程化么?
将此代码改成
我目前的情况是集群中单机CPU核多,24核,但内存有限,所以无法充分利用cpu资源。希望多线程化一部分代码。 |
@allwefantasy Spark是可以调整executor同时运行的task数量的.
|
@witgo 感谢这个技巧的分享。 我目前还遇到一个问题。昨天你问我这边24w文档的words是多少,我统计了下,是 2400w words 计算方式是(parsedData.map(f:Document=>f.content.size).sum()),term 数是8w。 初始化非常快,只要分钟左右就跑完。但进行第一轮迭代时候,每个task 大概需要序列化26m的数据。然后到Cleaned broadcast 后 spark-shell 就没有反应了。 进入类似 http://csdn-hdp-nn-01:4040/stages/stage/?id=11 这种url 后task 显示都是running,然后我看了下每个worker 老年代什么的都是正常的。但是cpu很空闲,感觉人物都没有在跑的样子。你有遇到这个问题么? 之后就一直卡在这了 没反应。 |
@allwefantasy 现有的代码在迭代计算过程中创建了太多的TopicModel实例, 我现在正在尝试解决这个问题. |
@witgo 好的。如果有更新后请通知我。我这里也可以第一时间进行测试。 |
QA tests have started for PR 1983 at commit
|
|
@witgo @allwefantasy We had an offline discussion about LDA's implementation. Please check the JIRA page for the notes. 我们有大约LDA的实现脱机讨论。请检查JIRA页的注释。 |
QA tests have finished for PR 1983 at commit
|
The current broadcast-based implementation, especially in the corpus is large, the performance loss is more serious. Next week I will submit a graphx based implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin @mengxr
mapPartitions
方法的closure似乎没有正确清理. 序列化后的corpus
RDD和序列化后topicModel
broadcast 差不多一样大.
mapPartitions
method seems to be no correct cleaning. The serialized corpus
RDD and serialized topicModel
broadcast almost as big.
cat spark.log | grep 'stored as values in memory'
=>
14/09/13 00:47:59 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 218.2 KB, free 2.8 GB)
14/09/13 00:48:04 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 2.8 GB)
14/09/13 00:48:08 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.7 KB, free 2.8 GB)
14/09/13 00:48:20 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.4 KB, free 2.8 GB)
14/09/13 00:48:23 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.6 KB, free 2.8 GB)
14/09/13 00:48:25 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 2.6 KB, free 2.8 GB)
14/09/13 00:48:25 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 3.1 KB, free 2.8 GB)
14/09/13 00:48:30 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 2.9 KB, free 2.8 GB)
14/09/13 00:48:35 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 3.2 KB, free 2.8 GB)
14/09/13 00:48:44 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 68.6 KB, free 2.8 GB)
14/09/13 00:48:45 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 41.7 KB, free 2.8 GB)
14/09/13 00:49:21 INFO MemoryStore: Block broadcast_11 stored as values in memory (estimated size 197.5 MB, free 2.6 GB)
14/09/13 00:49:24 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 197.7 MB, free 2.3 GB)
14/09/13 00:53:25 INFO MemoryStore: Block broadcast_13 stored as values in memory (estimated size 163.9 MB, free 2.1 GB)
14/09/13 00:53:28 INFO MemoryStore: Block broadcast_14 stored as values in memory (estimated size 164.0 MB, free 1878.0 MB)
14/09/13 00:57:34 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 149.7 MB, free 1658.5 MB)
14/09/13 00:57:36 INFO MemoryStore: Block broadcast_16 stored as values in memory (estimated size 150.0 MB, free 1444.0 MB)
14/09/13 01:01:34 INFO MemoryStore: Block broadcast_17 stored as values in memory (estimated size 141.1 MB, free 1238.3 MB)
14/09/13 01:01:36 INFO MemoryStore: Block broadcast_18 stored as values in memory (estimated size 141.2 MB, free 1036.2 MB)
14/09/13 01:05:12 INFO MemoryStore: Block broadcast_19 stored as values in memory (estimated size 134.5 MB, free 840.7 MB)
14/09/13 01:05:14 INFO MemoryStore: Block broadcast_20 stored as values in memory (estimated size 134.7 MB, free 647.8 MB)
14/09/13 01:08:39 INFO MemoryStore: Block broadcast_21 stored as values in memory (estimated size 218.3 KB, free 589.5 MB)
14/09/13 01:08:39 INFO MemoryStore: Block broadcast_22 stored as values in memory (estimated size 218.3 KB, free 589.2 MB)
14/09/13 01:08:40 INFO MemoryStore: Block broadcast_23 stored as values in memory (estimated size 134.6 MB, free 454.6 MB)
14/09/13 01:08:53 INFO MemoryStore: Block broadcast_24 stored as values in memory (estimated size 129.3 MB, free 267.1 MB)
14/09/13 01:08:55 INFO MemoryStore: Block broadcast_25 stored as values in memory (estimated size 129.4 MB, free 82.0 MB)
QA tests have started for PR 1983 at commit
|
@allwefantasy |
QA tests have finished for PR 1983 at commit
|
QA tests have started for PR 1983 at commit
|
QA tests have finished for PR 1983 at commit
|
@witgo i have saw ur spark configuration for new performance test。 I will try your latest code and test in my data today |
@witgo i have try ur latest code in my corpus 。 it will not Stuck in broadcasting . However ,some exception are throw。 |
QA tests have started for PR 1983 at commit
|
QA tests have finished for PR 1983 at commit
|
@witgo Since we are converging on a GraphX-based implementation and distributed representation of the topic model, do you mind closing this PR? Thanks! |
This PR is based on @yinxusen's #476
The performance test:
500
topics:100
1000
topics:100
1000
2000
topics:150
2000
conf/spark-defaults.conf: