代码索引构建scip

代码索引构建

应用场景

代码跳转/代码搜索

主流方案

协议格式 简介
LSP 适合有源代码的本地仓库,与LSP-Server进行交互
LSIF 支持LSP协议格式的离线索引结构
SCIP source-graph开源,解决了LSIF使用过程中遇到的问题,比如json格式没有很强的规范性,由于LSIF的自增主键造成的一系列问题

对自研代码索引格式进行探索

由于scip需要对项目进行编译,kuaishou整体的编译速度P50在78s左右,为了提升编译速度,对自研实现索引功能也进行了一定的探索,java的编译过程如下:

img

通过java-parser生成代码的AST树,通过访问ast树进行解析,理论上也能实现相同的功能,但是经过测试发现通过AST解析处理,有两个问题比较难解决:

1.经过语法分析生成的树带有了语法本身的概念,比如Method/field,但实际建立索引的过程是基于更底层的「位置」的概念,与语法树并不相关,所以通过ast建立索引比较困难,猜测scip应该是在生成了tokens流就进行了处理(未论证)

2.进行ast解析的话如果不配置symbolresolver,得不到全限定名,因此在使用过程中也需要配合maven等构建工具使用,如此以来,与现成的scip方案相比 不具备优势

自研方案和scip方案对比

优点 缺点
自研 1.可以实现增量代码分析,速度快2.可以定制更为简单的索引格式提升存储效率和访问速度 1.工作量高2.大规模使用情况不容易被验证3.需要兼容不同的语言
SCIP 1.开源且被sourcegraph公司验证2.多种语言的支持 1.需要编译2.生成的索引格式复杂

整体实现

数据生成与消费

索引文件生成

索引文件的生成过程如下图

1.触发时机: mr创建/更新

2.主要通过halo流水线插件实现scip文件的生成

3.整个仓库的scip文件整体上传到blobstore进行存储

4.通过blob的rocketmq消息通知解析模块进行解析

5.解析之后的数据进入mysql进行存储

img

索引文件解析和存储

目前kdev-mr每天创建的MR单数为2k-3k,所属仓库800-1000个,版本数量(5k-6k),diff文件数量10w(所有版本)左右,全仓库构建的话大概需要构建 30001000=300w次,以ks-serveree-cr项目为例,其中的occurrence(某个文档的某个位置出现的字符)有106w,如果都建立索引的话,数据量基数大概 1000 * 106w = 1亿,如果活跃仓库一直是这些的话,后续增长的就是diff文件中的occurrence了,假定每个版本变更10个文件,且假定每个diff文件中有1000个occurrence,那么每天新增的数量是101000=1w

存储方案采用mysql分表存储

存储结构:

为了提升可扩展性,整体存储结构的设计尽可能的将索引本身和业务进行分离,本期采用mr触发的方式,后面如果改成代码仓库触发,可以尽可能的少动底层结构

字面量表: 作为基础表,解决字符串重复存储的问题

字符表:唯一定位一个仓库中的一个字符

hash_id生成策略: hash(文件路径+字面量+range+引用类型)

讨论:是否需要增加projectId维度

索引表:表示一次索引的生成过程,module表示引用的模块数据

symbol增加版本信息(?)

img

整体的存储方案大概如下图

img

hover或者点击跳转到定义的时候:

1.查询symbol信息

1
2
select * from symbol where file_string_id=hash(filePath) 
and start_line=1 and start_col=2;

2.根据symbol_string 和refrence_type查询具体文件信息

1
select * from symbol where symbol_string_id="xxxx" and refrence_type=1

3.根据查到的结果用index表进行过滤

1
2
select  * from index where project_id=123 
and commit_hash='xxx' and symbol_hash_id in ('xxxx','zzzz')

查看接口的实现类

1.查询symbol信息

1
2
select * from symbol where file_string_id=hash(filePath) 
and start_line=1 and start_col=2;

2.查询relation关系

1
2
select * from symbol_relation where right_symbol_string_id='xxxx'
and relation_type=1;

跨仓库索引

目前的触发方式是mr触发,因此如果依赖库没有提过mr(比如kconf这种),那么将无法进行跳转

如果假设目标库有索引存在,则需要借助index中的module字段 通过查询制品库获取到依赖仓库的信息,包括仓库id,发布制品的commitId等信息,

这部分中间信息需要制品库和编译系统协助建设,现状是制品库直接获取的流水线发包时候的信息,流水线发包的这部分数据也有过期时间,目前是存5个版本+5天的有效数据,如果有新增发包任务,则按照上述策略对历史数据进行删除

以这部分中间信息为媒介,通过上面的索引查询方式进行处理

增量构建

目前只能做到增量插入,还做不到增量构建,对于一个mr的版本迭代,在生成新的索引之后,通过比较与前一个版本的变更文件,进行symbols的增量插入,整体流程

img

数据清理

暂无

参考文档

阿里代码索引实践

LSP/LSIF

SCIP简介(包含和LSIF的对比)

SCIP具体协议内容

scip-java Github地址

scip-java主页

sematicDB介绍

上一版本代码索引方案

附(proto解析之后格式):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
documents {
relative_path: "ks-serveree-cr-component/src/main/java/com/kuaishou/serveree/cr/component/service/logparser/CommonLogParser.java"
occurrences {
range: 2
range: 7
range: 10
symbol: "semanticdb maven . . org/"
}
occurrences {
range: 2
range: 11
range: 26
symbol: "semanticdb maven . . org/springframework/"
}
occurrences {
range: 2
range: 27
range: 37
symbol: "semanticdb maven . . org/springframework/stereotype/"
}
occurrences {
range: 2
range: 38
range: 47
symbol: "semanticdb maven maven/org.springframework/spring-context 5.1.10-kwai-12 org/springframework/stereotype/Component#"
}
occurrences {
range: 4
range: 7
range: 10
symbol: "semanticdb maven . . com/"
}
occurrences {
range: 4
range: 11
range: 19
symbol: "semanticdb maven . . com/kuaishou/"
}
occurrences {
range: 4
range: 20
range: 28
symbol: "semanticdb maven . . com/kuaishou/serveree/"
}
occurrences {
range: 4
range: 29
range: 31
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/"
}
occurrences {
range: 4
range: 32
range: 41
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/"
}
occurrences {
range: 4
range: 42
range: 48
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/common/"
}
occurrences {
range: 4
range: 49
range: 55
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/common/domain/"
}
occurrences {
range: 4
range: 56
range: 62
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/common/domain/entity/"
}
occurrences {
range: 4
range: 63
range: 74
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/common/domain/entity/CrActionLog#"
}
occurrences {
range: 5
range: 7
range: 10
symbol: "semanticdb maven . . com/"
}
occurrences {
range: 5
range: 11
range: 19
symbol: "semanticdb maven . . com/kuaishou/"
}
occurrences {
range: 5
range: 20
range: 28
symbol: "semanticdb maven . . com/kuaishou/serveree/"
}
occurrences {
range: 5
range: 29
range: 31
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/"
}
occurrences {
range: 5
range: 32
range: 41
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/"
}
occurrences {
range: 5
range: 42
range: 48
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/common/"
}
occurrences {
range: 5
range: 49
range: 54
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/common/event/"
}
occurrences {
range: 5
range: 55
range: 72
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/common/event/DetailOperateType#"
}
occurrences {
range: 6
range: 7
range: 10
symbol: "semanticdb maven . . com/"
}
occurrences {
range: 6
range: 11
range: 19
symbol: "semanticdb maven . . com/kuaishou/"
}
occurrences {
range: 6
range: 20
range: 28
symbol: "semanticdb maven . . com/kuaishou/serveree/"
}
occurrences {
range: 6
range: 29
range: 31
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/"
}
occurrences {
range: 6
range: 32
range: 41
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/"
}
occurrences {
range: 6
range: 42
range: 47
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/utils/"
}
occurrences {
range: 6
range: 48
range: 57
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/utils/I18nUtils#"
}
occurrences {
range: 7
range: 7
range: 10
symbol: "semanticdb maven . . com/"
}
occurrences {
range: 7
range: 11
range: 19
symbol: "semanticdb maven . . com/kuaishou/"
}
occurrences {
range: 7
range: 20
range: 28
symbol: "semanticdb maven . . com/kuaishou/serveree/"
}
occurrences {
range: 7
range: 29
range: 31
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/"
}
occurrences {
range: 7
range: 32
range: 41
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/"
}
occurrences {
range: 7
range: 42
range: 44
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/vo/"
}
occurrences {
range: 7
range: 45
range: 53
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/vo/response/"
}
occurrences {
range: 7
range: 54
range: 56
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/vo/response/vo/"
}
occurrences {
range: 7
range: 57
range: 74
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/vo/response/vo/ActionLogResponse#"
}
occurrences {
range: 15
range: 1
range: 10
symbol: "semanticdb maven maven/org.springframework/spring-context 5.1.10-kwai-12 org/springframework/stereotype/Component#"
}
occurrences {
range: 16
range: 13
range: 28
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/CommonLogParser#"
symbol_roles: 1
}
occurrences {
range: 16
range: 13
range: 28
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/CommonLogParser#`<init>`()."
symbol_roles: 1
}
occurrences {
range: 16
range: 40
range: 54
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/LogParserInter#"
}
occurrences {
range: 19
range: 5
range: 13
symbol: "semanticdb maven jdk 8 java/lang/Override#"
}
occurrences {
range: 20
range: 11
range: 28
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/vo/response/vo/ActionLogResponse#"
}
occurrences {
range: 20
range: 29
range: 34
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/CommonLogParser#parse()."
symbol_roles: 1
}
occurrences {
range: 20
range: 35
range: 55
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/ActionLogParseTarget#"
}
occurrences {
range: 20
range: 56
range: 62
symbol: "local 0"
symbol_roles: 1
}
occurrences {
range: 21
range: 12
range: 18
symbol: "local 0"
}
occurrences {
range: 21
range: 30
range: 36
symbol: "local 0"
}
occurrences {
range: 21
range: 37
range: 51
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/ActionLogParseTarget#getCrActionLog()."
}
occurrences {
range: 24
range: 8
range: 19
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/common/domain/entity/CrActionLog#"
}
occurrences {
range: 24
range: 20
range: 31
symbol: "local 1"
symbol_roles: 1
}
occurrences {
range: 24
range: 34
range: 40
symbol: "local 0"
}
occurrences {
range: 24
range: 41
range: 55
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/ActionLogParseTarget#getCrActionLog()."
}
occurrences {
range: 25
range: 8
range: 25
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/common/event/DetailOperateType#"
}
occurrences {
range: 25
range: 26
range: 37
symbol: "local 2"
symbol_roles: 1
}
occurrences {
range: 25
range: 40
range: 51
symbol: "local 1"
}
occurrences {
range: 25
range: 52
range: 65
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/common/domain/entity/CrActionLog#getObjectType()."
}
occurrences {
range: 25
range: 68
range: 81
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/common/event/ObjectTypeEnum#getDetailType()."
}
occurrences {
range: 25
range: 82
range: 93
symbol: "local 1"
}
occurrences {
range: 25
range: 94
range: 108
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/common/domain/entity/CrActionLog#getOperateType()."
}
occurrences {
range: 26
range: 15
range: 32
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/vo/response/vo/ActionLogResponse#"
}
occurrences {
range: 26
range: 33
range: 40
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/vo/response/vo/ActionLogResponse#builder()."
}
occurrences {
range: 27
range: 17
range: 23
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/vo/response/vo/ActionLogResponse#ActionLogResponseBuilder#action()."
}
occurrences {
range: 27
range: 24
range: 35
symbol: "local 2"
}
occurrences {
range: 27
range: 36
range: 40
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/common/event/DetailOperateType#name()."
}
occurrences {
range: 28
range: 17
range: 27
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/vo/response/vo/ActionLogResponse#ActionLogResponseBuilder#actionName()."
}
occurrences {
range: 28
range: 28
range: 39
symbol: "local 2"
}
occurrences {
range: 28
range: 40
range: 51
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/common/event/DetailOperateType#operateDesc()."
}
occurrences {
range: 29
range: 17
range: 34
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/vo/response/vo/ActionLogResponse#ActionLogResponseBuilder#actionDescription()."
}
occurrences {
range: 29
range: 35
range: 44
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/utils/I18nUtils#"
}
occurrences {
range: 29
range: 45
range: 67
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/utils/I18nUtils#getActionLogDescByDesc()."
}
occurrences {
range: 29
range: 68
range: 79
symbol: "local 1"
}
occurrences {
range: 29
range: 80
range: 88
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/common/domain/entity/CrActionLog#getTitle()."
}
occurrences {
range: 30
range: 17
range: 31
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/vo/response/vo/ActionLogResponse#ActionLogResponseBuilder#actionAtGitlab()."
}
occurrences {
range: 30
range: 32
range: 39
symbol: "semanticdb maven jdk 8 java/lang/Boolean#"
}
occurrences {
range: 30
range: 40
range: 45
symbol: "semanticdb maven jdk 8 java/lang/Boolean#FALSE."
}
occurrences {
range: 31
range: 17
range: 24
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/vo/response/vo/ActionLogResponse#ActionLogResponseBuilder#display()."
}
occurrences {
range: 32
range: 17
range: 22
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/vo/response/vo/ActionLogResponse#ActionLogResponseBuilder#build()."
}
symbols {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/CommonLogParser#"
documentation: "```java\n@Component(\"commonLogParser\")\npublic class CommonLogParser\n```"
documentation: " log\350\256\260\345\275\225\347\232\204\351\200\232\347\224\250\345\256\236\347\216\260\347\261\273\n\n @author yangsimeng <yangsimeng@kuaishou.com>\n Created on 2021/3/26\n"
relationships {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/LogParserInter#"
is_implementation: true
}
}
symbols {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/CommonLogParser#`<init>`()."
documentation: "```java\npublic CommonLogParser()\n```"
}
symbols {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/CommonLogParser#parse()."
documentation: "```java\n@Override\npublic ActionLogResponse parse(ActionLogParseTarget target)\n```"
relationships {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/AddPatchLogParser#parse()."
is_reference: true
is_implementation: true
}
relationships {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/ChangeNotifierLogParser#parse()."
is_reference: true
is_implementation: true
}
relationships {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/ChangeTargetBranchParser#parse()."
is_reference: true
is_implementation: true
}
relationships {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/CommentLabelLogParser#parse()."
is_reference: true
is_implementation: true
}
relationships {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/CommentLogParser#parse()."
is_reference: true
is_implementation: true
}
relationships {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/CrCreateLogParser#parse()."
is_reference: true
is_implementation: true
}
relationships {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/LabelLogParser#parse()."
is_reference: true
is_implementation: true
}
relationships {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/MergeDeclineActionLogParser#parse()."
is_reference: true
is_implementation: true
}
relationships {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/MilestoneLogParser#parse()."
is_reference: true
is_implementation: true
}
relationships {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/RejectMergeActionLogParser#parse()."
is_reference: true
is_implementation: true
}
relationships {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/RelatedResourceLogParser#parse()."
is_reference: true
is_implementation: true
}
relationships {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/RelatedTeamLogParser#parse()."
is_reference: true
is_implementation: true
}
relationships {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/ReviewActionLogParser#parse()."
is_reference: true
is_implementation: true
}
relationships {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/UndoRejectMergeActionLogParser#parse()."
is_reference: true
is_implementation: true
}
relationships {
symbol: "semanticdb maven . . com/kuaishou/serveree/cr/component/service/logparser/LogParserInter#parse()."
is_reference: true
is_implementation: true
}
}
symbols {
symbol: "local 0"
documentation: "```java\nActionLogParseTarget target\n```"
}
symbols {
symbol: "local 1"
documentation: "```java\nCrActionLog crActionLog\n```"
}
symbols {
symbol: "local 2"
documentation: "```java\nDetailOperateType operateType\n```"
}
}

代码索引构建scip
http://coder-xieshijie.cn/2023/07/14/个人成长/工作领域汇总/代码索引构建scip/
作者
谢世杰
发布于
2023年7月14日
许可协议